domingo, 11 de outubro de 2015

3 things you should know when using the DataStax ODBC driver to connect Tableau with Apache Cassandra


This week we will talk about the great connectivity between Cassandra and Tableau.

As many of you already know, there is an ODBC driver that allows Tableau to connect with Cassandra.

The DataStax Developer Blog has a step-by-step instructions that will help you to use the ODBC driver with Tableau.

The intention of this blog is to highlight some details that I believe that are important.

Sometimes we are so exciting to be able to use Tableau to analise the data stored in Cassandra, that we may not remember some very important things:

#1 - Cassandra does not support aggregations (sum(), etc...) 
That means that every query you perform against Cassandra, you will be asking for the detailed data. In other words, your queries will have similar behavior when you extract the data without using aggregation option (see the post Tableau Data Extract #1 - Aggregation), forcing your local machine to perform the aggregations.

So, imagine you have a Sales table where you keep all your sales data, and you want to analize your sales per date or per order.

CREATE TABLE demo.sales (
    order_id int,
    product_id int,
    customer_id int,
    order_date timestamp,
    quantity int,
    sales decimal,
    PRIMARY KEY (order_id, product_id)

);

By default, Tableau will show you an aggregated chart:
In a relational database, the query would be:
select order_date, sum(sales)
from table
group by order_date
But in Cassandra, the query will not have the SUM:

Here is another example:
And the query is SELECT "order_id", "quantity" FROM "demo"."sales":

Now imagine if you are quering against a real world sales table, with millions or maybe billions of records. You will retrieve, to your local machine, all those billions of records,a dn will have to summarize them.


#2 - There is no joins inside Cassandra
We know that for many reasons Cassandra does not support joins, so if you are planning to perform some joins, your machine (desktop or notebook) will have to blend the data (see the post Data Blending #1).
So, you will ask Cassandra to retrieve all the records of both tables (or more), without aggregations, with all the columns you will use for the analises + the columns you will need for the join.

For instance, you have a sales table and a customer table, and you want to see the sales per customer's email.

CREATE TABLE demo.customer (
    customer_id int PRIMARY KEY,
    email text,
    full_name text

);
CREATE TABLE demo.sales (
    order_id int,
    product_id int,
    customer_id int,
    order_date timestamp,
    quantity int,
    sales decimal,
    PRIMARY KEY (order_id, product_id)

);

Your Tableau chart is something like this:



In a relational database, your query would be something like this:
select c.email, sum(s.sales)
from customer c, sales s
where c.customer_id = s.customer_id
group by c.email

But, again, Cassandra does not support neither joins or aggregations.
So, you will perform 2 queries:
1) SELECT "customer_id", "sales" FROM "demo"."sales"
2) SELECT "customer_id", "email" FROM "demo"."customer"





And you will have to do the aggregation and the join (data blending) local in your machine.

In those examples we used a table with 10 records, but again, in a real world your table may have billions of records. 

#3 - You can not use any field you want to filter
Well, it depends on your understanding of "can" and "can not".
Because in practice, you can. Tableau will performance a select field1, fieldN from table and will apply your filter after retrieve the data. So, you "can" filter any field you want to, but keep in mind you will retrieve all the data frrm the table, and will have to filter in you local machine.

But if you filter only one value of the primary key (primary key field = <value>) then Tableau will ask Cassandra for that specific value:

SELECT "customer_id", "email", "full_name" 
FROM "demo"."customer" 
WHERE ( "customer_id" = 20 ) 
ALLOW FILTERING

Be aware of the "allow filtering".

For each primary key column that you filter more than one value, that column will be removed from the where condition.
For instance, if you filter Order id (partition key) = 1 and product id (clustering column) = 100, then the query will be:
SELECT "order_id", "product_id", "customer_id" FROM "demo"."sales" WHERE ( "order_id" = 1 ) AND ( "product_id" = 100 ) ALLOW FILTERING

You also can use >=, <=. > and < to filter the clustering columns:
SELECT "order_id", "product_id", "customer_id" FROM "demo"."sales" WHERE ( "order_id" = 1 ) AND ( "product_id" >= 100 ) ALLOW FILTERING

In other words, when you follow Cassandra's filtering rules, you will be just fine.

But if you try Order id (partition key) = 1 and product id (clustering column) in (100,200) then the query will be:
SELECT "order_id", "product_id", "customer_id" FROM "demo"."sales" WHERE ( "order_id" = 1 ) ALLOW FILTERING

However, if use any other field (even though to filter only one value), Tableau will ask Cassandra for the detailed data (select field1, fieldN from table) without the where condition.

Also, if you create a quick filter, first Tableau will ask Cassandra for all the values of that field.
For instance, if you create a quick filter for the the email field, this is the query that Tableau will perform to populate the quick filter:
SELECT "email" FROM "demo"."customer"
If you were using a relational database, the query would looks like this:
SELECT distinct "email" FROM "demo"."customer"

What about the secondary indexes? Well, if you use them to filter (only one value) they will be used on the where condition, but they are dangerous: avoid them.

So, the more appropriate title for this topic would be "you shouldn't use any field you want to filter".


Coclusion
When planning to use the ODBC driver to connect Tableau to Cassandra, be aware of all those points.
Try to use aggregated tables instead of detailed tables. Aggregated tables can be created by your application or thought Spark.
You also can use Spark SQL connector if you have a Cassandra+Spark cluster or the Hive connector as well.



If you have any suggestion or comments, please feel free to post here.


Um comentário:

  1. How True and relevant is this , Has rules for filtering and aggregation changed at Cassandra level ?, I have a similar scenario , we are using Cassandra as datastore and tableau for reporting , since it is log files the volume is growing day by day and tableau extracts have not reached a point where they are failing due to the tableau default 2hr server limit

    would appreciate If you know any cassandra tuning options or best practices

    ResponderExcluir