Tuesday, May 20, 2008

Notes for reading "The Data Warehouse Toolkit," Chapter 2

There are following topics described in this chapter.

1. The “four step dimensional design process”

According to this book, there are four basic steps to do dimensional modeling. First, to select the business process, second , to define the granularity of the data, third, to find out the dimensions, fourth, to find out the facts. We discuss each in the following.

1.1, to select the business process
The basic step of data modeling is to think about the scope and influence of the new model. One principle is to avoid publishing the same data multiple times. This book discusses about having the model per business process rather than per department. The idea behind is that, if having the models per department, the same data may have to be published more than once as different business processes may require similar data on the same department. However, it is not enough to just have one data model per business process, one still has to reach the step of consolidating these business process into a unified model at certain time point in the future. But I agree that, given most situations that data marts are required in a short time frame, focusing on a specific business process is the most optimal way to take. However, one must take a serious step later to consolidate the “quickly-designed” data mart into a large data warehouse. That step, as I can see from my experience, is missing in most of the industry implementations.

1.2, to define the grain of the business process
The question is “how to define a single row in the fact table?” It is a question on how detail we should devolve into in the data model. And here is the place that we consider the balance between the scalability of data and the performance requirement. In many cases, if one uses the most detailed data in the data model, it means that, if there are new requirements to the model, it is very easy to adapt those requirements as we already have all the data (otherwise, a major re-work has to be called). On the other hand, when having the most detailed data, that brings extra maintenance cost (you have to pay people to “watch” and “fix” you disks to yield the best performance in order to satisfy the users).
This step also helps people to understand the content of the dimensions.

1.3, to define the dimensions
It is about to find out and group the categories. The tricky part is how can we make the grouping in a generic way so that it is very easy to scale the model to include more different sets of data in the future?

1.4, to identify the facts
What are we measuring? There are some columns that are generated based on calculations over other columns. Do we need these extra columns that seem to be waster of disk space? It depends on the requirement on the performance as well as how often the data is used.

2 A case study on the retail business

I have the following reflection from the case study.
First, it is always necessary to start by understanding, from the business points of view, what is required exactly. There are many business processes in a case, but to choose the right one is the most important thing.
Second, when you have defined the grain of data, the dimensions seem to be self-clarified. Just read and analyze the definition of the grain and use the terms in the that definition, for example, “prodoctu,” “store,”, “promotion.” And remember that date is always a dimension in all data warehouses (otherwise you would not call it a data warehouse).
Third, in the design of fact tables, for those additive measures, it is a kind of belief to add an additive measure to a fact table. Knowing that this will bring storage cost but still deciding to do it, you have to have a belief. In fact, it is up to how the value of the measure is used and how that piece of architecture is designed to let people use the data.
Fourth, as long as you can do the join operation, it is a good idea to have the date dimension. You cannot rely on special functions to tell you when a special public holiday is in a certain country. They have to be documented in the date table.

Fifth, I think what has been misleading in the book is that you should not always add as many attributes to the dimensional tables as possible. In the physical world, there is a limitation of row size, and by adding more columns to each row, the performance for doing queries is getting worse. One has to take a very serious step to consider whether to add a new column or to find out ways so that you can do the calculation without adding anything at all.

Sixth, apparently only the “happened” facts are recorded in the fact table. What about those that are not “happened?” For example, what are the products that are in promotion but are not sold by any amount? There can be multiple ways to cope with this request. Either you run a query to find out those that are sold and then filter out the unsold, or you add a column to the Sales fact table with a lot of “0” values in it. Because the unknown things also have hierarchies, what the book suggests is to add another fact table to carry such information. The difference of this fact table is that you do not have to be one the same grain as the Sales table. By doing this, you can record the information in an easy (for query) way and save the disk cost by rolling-up the grain. Note that this fact table may just look like a many-to-many table without facts (a factless fact table).

Seventh, regarding degenerating dimensions, it is OK and normally allowed to have them. In the example given by the book, the “transaction number” is just added to the fact table (and actually there can be a transaction dimension to record more things if needed).

Eighth, what brings the most difficulty to the dimensional design is not the initial requirement, but the new requirements after the initial design has been established. It takes a modeling team much more time to just consider how to add the new requirements to the existing star-schema model. If new fields are added, what happens to the corresponding values of those records that are already there? Should each of the records always have “not available” value on the new fields? This is a place where best practices should be collected. And I do not believe that there is always a single method that can solve all the problems. The best practices have to be selected and used case-by-case.

Ninth, what should be bared in mind for dimensional modelers, is that the nature of dimensional model is for business users to be able to query on but not for systems to efficiently update. So this model is and should not be quite normalized. So snow-flaking is in fact a risk for dimensional models. In addition to bring extra cost on joining the different tables, you cannot even create bitmap indexing on necessary fields. Business users will also find it difficult to see through the snow flake tables because there is too much snow flake. J

Tenth, I do not quite like the claim that “most business processes can be represented with less than 15 dimensions in the fact table.” Perhaps Kimall is right but an exact value brings nothing but difficulty in a modeler’s struggle with managers who do not know much about modeling but just want things done in the way the book says.

3. Surrogate Keys

Surrogate keys make so much benefit in dimensional models and please bear in mind that it is the best if one can just use small integer numbers for the surrogate keys (instead of writing a hashing function). Using surrogate keys also means that data warehouse people do not have to rely on the operational system people on the natural keys. This situation gets even worse when the people in operational systems decide to re-cycle the account number or product numbers that are inactive for a certain period of time (but the data warehouse’ life is much longer than this period).

Sometimes you just cannot rely one the natural keys to keep the data clean. For example, a product number plus a date can be a combined natural key. But the value for the “date” field may be unknown for a period, what do we do with the record during this period? How to identify them?

One deep performance problem with having a star-schema model, is the cost of making join operations. So one wise idea is to use single keys, like surrogate keys, instead of having compound or concatenated keys because you then have to join on multiple columns and that is much more complicated.

4. Market Basket Analysis

What is interesting in this part, is the book shows how to use the star-schema model for analysis purposes. The idea is to generate regular reports based on a join operation between the fact table and several dimensional tables. Actually I would think an OLAP tool will make this work much easier.

No comments: