Tuesday, March 18, 2008

Notes for reading "The Data Warehouse Toolkit"

The second edition of "'The Data Warehouse Toolkit" by Ralph Kimball and Margy Ross is a great book about dimensional modeling. It is presumed as an industry-must-read material by most data warehouse people.

Here comes the reading notes for Ch. 1

Chapter 1 Dimensional Modeling Primer

Do you need a background to work with data warehouse? Yes, definitely. This whole chapter is talking about what you have to understand to be able to understand the data warehouse.

The first thing to clarify is the difference between "operational systems" and the data warehouse. In general, an operational system deals with one record at a time. It takes orders, register customers, or log complaints. It does not do a lot of summary or aggregations or dash board things like data warehouse does. It is a rigid system. It works with transactions if necessary. It repeats the same procedures over and over to finish the same business processes. Operational systems also keep historical records of data, but for a different purpose from the data warehouse. These historical records are only used for validation, recovery, but not for summarization. Data warehouse is where the historical data is re-organized, conformed, and summarized. Data warehouse is used in a more dynamic way. I admit that, there are quite a lot of similar reports that a data warehouse must present. But the power the data warehouse is that, when users come up with new questions, it is able to answer the questions easily. When the same thing happens to the operation systems, that requires a re-development of the platform which is very expensive.

So, what is the role of a data warehouse in an enterprise? There are following answers.

a. It should make all the data "easily accessible" which means a data warehouse is not another castle in the enterprise that everybody must try hard to learn and understand. It should be easily understood and used by most business folks.

b. It is a place to keep all the data consistent. Nowadays, people are talking about "one version of the truth." The data warehouse is definitely the best place to keep the truth of all the data.

c. The data warehouse should be ready for any changes. That is also one big difference between the data warehouse and operational systems.

d. As I quote from the book. "An organization's informational crown jewels are stored in data warehouse." The data warehouse must have effective control on the confidential information of the enterprise.

e. The data warehouse must ensure the correctness and completeness of the information it contains in order to serve the decision-making of the enterprise.

f. Once created, the data warehouse must be used. So the data warehouse must be widely accepted by the business community in the beginning of its life cycle.

So, to make a successful data warehouse, it is important that your team must take knowledge and skills from both DBA side and MBA side.

We can try to think data-warehouse-publishing-data as the process of an editor trying to publish a magazine. Apparently you need to talk to the audience, understand their attention, and provides the right information at the right time. Usually you have to come to people to collect new requests for improvement. You need to find new source of information. You need to develop and manage a good network of people working with the magazine. You need to keep all relevant people (esp. those at business sides) happy. Data warehousing is about publishing the right data at the right time to the right people.

So, if we begin to talk about the data warehouse (I mean, seriously, not things around the data warehouse but the things very nearby and inside the data warehouse). What are the parts that make up a complete data warehouse environment? There are four distinctive types of components, source systems, data staging area, data presentation area, and data access tools.

Let us discuss each in the following.

1. Source systems. What can be source systems? I would say any applications that deliver data to the data warehouse. Is source system equal to operational system? The answer may be “no,” because nowadays people are talking about collecting data from Internet or certain ERP, CRM systems (which can be assumed to be a kind of BI appliance) into the data warehouse. But most of the source systems are operational ones. If an enterprise takes efforts to do EAI (enterprise application integration) which means these systems are re-engineered to have a consistent view on processes or functionalities or data or all of them, the data warehousing task will become much easier.

2. Data staging area. I like very much the example provided in the book. Just imagine the data staging area as the kitchen of a restaurant. One thing to keep in mind is that this is a both storage area and an area of many processes. By “processes” I mean those that do the extract-transform-load (ETL) jobs. It is, as described in the book, everything between the source systems and the data presentation area. One thing that normally should not happen is to let the customers come into the kitchen or eat directly in the kitchen. That’s a rule-of-thumb. As to the book’s point of view, those in industry that talks about “enterprise data warehouse” are actually talking about something in the staging area. A more general meaning of enterprise data warehouse includes both the staging area and the presentation area and perhaps also the source systems.

3. Data presentation area. The usage of dimensional model is a bit kind of nature because people tend think of this as a very simple way of understanding things. So you have to accept that dimensional modeling is successful and has to be the only modeling way used at the data presentation area. Regarding the data marts in the presentation area, I would agree that the “bus” structure is a kind of natural choices when it comes to different BI applications with great challenges to conform everything into single dimensions and facts. However, there has to be a place where the “one version of the truth” of the data is kept. If you cannot hold them in the data presentation area, then they have to be at the data staging area. That is perhaps why a lot of people nowadays are talking about establishing a RDM 3NF model at an EDW before everything is mapped into dimensional models. As we all know that Start Schema is used very widely to support dimensional model in relational database. Actually another way of making the presentation area is the multidimensional database or OLAP. I believe that, not in the far future, OLAP applications will become more and more mature so that the “star-schema” time will be over very soon.

4. Data access tools. This is normally called BI applications nowadays. Excel and the front-end query tools like SQL Server- Query analyzer are also part of the data access tool.

Are there other possible components? Yes according to the current trend. Examples include metadata repository and ODS and data quality tools.

Regarding metadata, it has been reaching a stage that the whole data warehousing industry realizes its importance and quite a few tools are emerging to support centralized, unified metadata framework. Most people knew that metadata is of great importance but did not realize how bad it could be without doing anything about it. That was why a lot of enterprises are eagerly looking for a metadata solution. To work out metadata management, the first thing is the scope, i.e., how much you want the metadata framework to be. You need a complete conceptual framework to support the implementation of the metadata system. How to define the scope of metadata framework? That depends on your ambition and budget. Do you want to have a framework that is scalable for the next 20 years? Or just something that is able to help you in the next 5 years? Not a lot of people are mentally strong enough to support a very long-term metadata framework. At least, you need to figure out the relevant metadata around the data warehouse, and to corral, catalog, integrate these varieties of metadata. And this is like a resource library. Compared to the dimensional data modeling, metadata model is much more complicated.

Next, ODS, operational data store. ODS normally stays at the data staging area. It is kept frequently updated and it is a place where the data from source systems are somewhat integrated. It is a database with 3NF design. Sometime ago, I was confused by the ODS and the “enterprise-data-warehouse” that I described above. In fact, if you have an 3NF relational data model at the EDW, this EDW is actually part of the data staging area and is actually equally to the ODS to some extent. The difference between ODS and the “EDW” is where you put a lot of business calculations defined by the BI applications. There should not be so many such calculations between the source systems and the ODS. People need ODS normally because they need some kinds of immediate report which needs to integrate a lot of data from difference source systems. Real-time BI may be a requirement to have ODS. There are trends that people put ODS as a specially-administrated part of the conventional data warehouse.

Vocabulary of dimensional modeling

A fact table is where the numerical performance measurements of the business are stored. Normally measurement data is the largest portion of any data mart. What is important for defining a fact table is the grain of the data. It must comply to the same grain of all the relevant dimension tables. What can become difficult to build a fact table is to decide the columns in the fact table. There are additive facts and semi-additive facts that seem to be difficult to put, either in a fact table or a dimension table. It is also possible that a fact can be textual type. But normally developers would try to put it to dimensional tables.

Dimension tables contain the description of business. Each dimension table may have more than 50 attributes. These attributes are used to divide, group, merge different business entities which is then joined with the fact tables to have a business-oriented calculation. The result of such calculation may lead to a report, a performance report, a business decision, or a new business product.

There are few things that always stay around the dimension table design. a) Operational codes (suffix, etc.) b) The hierarchy behind a business area, such as family relations, organization relations, etc., is normally stored by the dimension table (although it brings redundancy) c) A snow flake design may let the business understanding a bit clearer, but it is definitely a performance disaster. d) The granularity of data must be clarified before the data model is designed. Normally people come to the finest grain as much as possible. e) There is also an issue about time and updates, i.e., the so-called slowly-changing-dimension (SCD) problem.

The dimensional model, compared to an ER diagram, may be simpler. Normally ER diagrams combines and conforms to a lot of business scenarios together. When making dimensional model based on the ER diagrams, those many-to-many relationships in the ER diagrams that contain numeric and additive non-key fields are mostly facts and should be put into fact tables. The remaining tables then should be de-normalized into flat tables like the dimension tables.

There are quite a few things to be careful when doing dimensional modeling in the industry.

1. The grain of data in the dimensional model should be the finest, not for summary. And the dimensional model should have data of the whole history.
2. One should be very careful with how to scope the dimensional models. It can be departmental or process-oriented (i.e., one model for one business process). Some people support the approach of spide-web (or Hub-spoke) approach that you always have a central place to feed data to all the data marts, but Kimbal seems to disagree with this idea and support the idea of multiple feeds. I would agree with a hub-spoke way because, if we look at the data from a senior management point of view, it is very important that you have an easy way of managing the data flow and the “bus” architecture seems to bring a lot more correlations and connections than the “hub-spoke” one. And it is better to use the “hub-spoke” if you are serious about the “one-version-of-the-truth.”
3. Dimensional model can be applied to many different industries.
4. It is better to have the usage pattern when you are designing a dimensional model. But it is not compulsory. A dimensional model can still be successful without knowing how it is going to be used exactly in the beginning of the design. Anyways, the designer can coach the user.
5. Be focused more on business requirements and goals when developing a data warehouse as they are more important than technologies.
6. Make sure that you include influential, accessible, and reasonable business sponsors in the project.
7. It may be more useful to consider an iterative project process to develop a data mart.
8. Make sure that the data presentation area is considered equally important to the staging area by th project, and allocate time on the data presentation area as much as you can.
9. Be aware that, many requirements and analytics around the data warehouse is changing over the time.
10. The success of a data warehouse is ultimately decided by the users.

This the end of my note on Ch. 1.

No comments: