Your data model makes a big difference in the performance of
your QlikView application. Of course, if
you have been working with QlikView for any length of time that is pretty
obvious. Here is a situation you might
come across I thought I would share.
The client was trying to monitor costs for production items
on a per unit basis as they travelled through the various cost centers. So there are several hundred products, and a
few dozen cost centers, with each combination having a per unit cost. Then, it should be a fairly easy exercise to
find the extended costs as units are sold, “slicing and dicing” by things like
customers, dates, cost centers and other groupings.
Being fairly rational, I figured out a few ways to do this:
Product
|
Cost Center
|
Cost per Unit
|
A
|
Assembly
|
0.0233
|
A
|
Packing
|
0.0324
|
A
|
Q&A
|
0.0133
|
A
|
Storage
|
0.0544
|
B
|
Assembly
|
0.0766
|
B
|
Packing
|
0.0222
|
B
|
Q&A
|
0.0652
|
B
|
Storage
|
0.0143
|
This is the “thin” table approach. Load all the costs into one column and then
use set analysis in your expressions to display each cost center like: SUM({<[Cost Center]={Assembly}>} [Cost
per Unit] * [Units Sold]).
Product
|
Assembly
|
Packing
|
Q&A
|
Storage
|
A
|
0.0233
|
0.03424
|
0.0133
|
0.0544
|
B
|
0.0766
|
0.0222
|
0.0652
|
0.0143
|
Here is the “fat” table approach. Create a cross table, with a column for each
cost center. This results in a fairly
wide fact table in your data model but no need for set analysis in your
expressions as you would reference each column independently: SUM([Cost per Unit] * [Units Sold]).
Now I have always been told as rule of thumb, “Create narrow, long fact tables” and “when you have
two like facts, concatenate them together, rather than having separate tables
with a link”.
Hence, I originally thought that a “thin” table approach
would be best. This was a disaster. The model worked but performed mind-numbingly
slow. How could it be, when I used as
few columns as possible?
I finally realized the problem here is that these costs are
not facts. They are really
dimensions. Sure they are numbers that
we will calculate against, but they are still dimensions that either do not
change or might slowly change (with the addition of date fields). I was so used to thinking that any column
with lots of numbers we are calculating with should be a fact. Welcome to database design 101
I developed a small example application to illustrate both
models and illustrate the performance differences. There are about 150,000 sales records, 10
cost centers and 80 products. So there
are about 800 costs that needed to be housed either in one long column or 10
columns of 80. I ended up with this data
model (so that I could create two tables each using one of the cost sets)
I then created a pivot table from the “thin” table using set
analysis and a similar pivot table from the “fat” set using the same
expressions without the set analysis and referring to the respective data
columns.
I also created a Memory statistics file to highlight the
difference in calculation time between the two objects:
You can see that the “Fat” set is almost 5 times as
efficient in the example. And look at
the memory advantage in Bytes!
So maybe next time I will take a more thoughtful look at the
data before deciding on a model because what can appear as a fact can really be
a dimension. And I should always be
aware that regardless of the type of data, sometimes more columns really are
better.