SQL, BI, and Information Management: May 2008

Friday, May 23, 2008

What's new in week 21, 2008

Tuesday, 2008-05-20, Copenhagen

A new competitor to Google on the search engine market? Yes and this may happen everyday. Recently, Powerset has launched a semantic search engine so that users can input natural language to query on stuff. Although not the whole internet content has been indexed by this new engine (only Wikipedia is in the index), using natural language is definitely a good idea to attract users attention.

HP does have a big dream of becoming the next IBM. In a very recently news, we have heard that HP has successfully acquired EDS, a big IT server providers. Services, out-sourcing, and new hardware (as well as software) technology alliance, HP is becoming the great challenger to IBM.

The big idea of cloud computing has been attracting more business people’s mind. The column-based database vendor, Vertica, has joined the cloud with a number of database vendors like EnterpriseDB, MS, Amazon (SimpleDB), and Google (Bigtable).

Friday, 2008-05-23, Copenhagen

Microsoft and Yahoo are in ANOTHER talk after Yahoo’s rejection of MS’s request of deal. Now MS is looking for a chance not totally acquire the searching brand and find out a better way of making things work. Yahoo’s board have received complains after the rejection that they were not considering the deal in a rational way. Let’s just wait and see what Google says about this.

Google is reaching another scenario in the search-engine market. Quite recently the search giant unveiled its health-related website, Google Health. This can be very good news for most people who does not know pretty much on how to find the good medical information on Google. What can make it difficult for Google is that people make just want to find a good medical answer by the search engine instead of going to their doctors. Oh, one more thing, there is a privacy issue if people start to put their health information on their Google Health profiles.

Tuesday, May 20, 2008

Notes for reading "The Data Warehouse Toolkit," Chapter 2

There are following topics described in this chapter.

1. The “four step dimensional design process”

According to this book, there are four basic steps to do dimensional modeling. First, to select the business process, second , to define the granularity of the data, third, to find out the dimensions, fourth, to find out the facts. We discuss each in the following.

1.1, to select the business process
The basic step of data modeling is to think about the scope and influence of the new model. One principle is to avoid publishing the same data multiple times. This book discusses about having the model per business process rather than per department. The idea behind is that, if having the models per department, the same data may have to be published more than once as different business processes may require similar data on the same department. However, it is not enough to just have one data model per business process, one still has to reach the step of consolidating these business process into a unified model at certain time point in the future. But I agree that, given most situations that data marts are required in a short time frame, focusing on a specific business process is the most optimal way to take. However, one must take a serious step later to consolidate the “quickly-designed” data mart into a large data warehouse. That step, as I can see from my experience, is missing in most of the industry implementations.

1.2, to define the grain of the business process
The question is “how to define a single row in the fact table?” It is a question on how detail we should devolve into in the data model. And here is the place that we consider the balance between the scalability of data and the performance requirement. In many cases, if one uses the most detailed data in the data model, it means that, if there are new requirements to the model, it is very easy to adapt those requirements as we already have all the data (otherwise, a major re-work has to be called). On the other hand, when having the most detailed data, that brings extra maintenance cost (you have to pay people to “watch” and “fix” you disks to yield the best performance in order to satisfy the users).
This step also helps people to understand the content of the dimensions.

1.3, to define the dimensions
It is about to find out and group the categories. The tricky part is how can we make the grouping in a generic way so that it is very easy to scale the model to include more different sets of data in the future?

1.4, to identify the facts
What are we measuring? There are some columns that are generated based on calculations over other columns. Do we need these extra columns that seem to be waster of disk space? It depends on the requirement on the performance as well as how often the data is used.

2 A case study on the retail business

I have the following reflection from the case study.
First, it is always necessary to start by understanding, from the business points of view, what is required exactly. There are many business processes in a case, but to choose the right one is the most important thing.
Second, when you have defined the grain of data, the dimensions seem to be self-clarified. Just read and analyze the definition of the grain and use the terms in the that definition, for example, “prodoctu,” “store,”, “promotion.” And remember that date is always a dimension in all data warehouses (otherwise you would not call it a data warehouse).
Third, in the design of fact tables, for those additive measures, it is a kind of belief to add an additive measure to a fact table. Knowing that this will bring storage cost but still deciding to do it, you have to have a belief. In fact, it is up to how the value of the measure is used and how that piece of architecture is designed to let people use the data.
Fourth, as long as you can do the join operation, it is a good idea to have the date dimension. You cannot rely on special functions to tell you when a special public holiday is in a certain country. They have to be documented in the date table.

Fifth, I think what has been misleading in the book is that you should not always add as many attributes to the dimensional tables as possible. In the physical world, there is a limitation of row size, and by adding more columns to each row, the performance for doing queries is getting worse. One has to take a very serious step to consider whether to add a new column or to find out ways so that you can do the calculation without adding anything at all.

Sixth, apparently only the “happened” facts are recorded in the fact table. What about those that are not “happened?” For example, what are the products that are in promotion but are not sold by any amount? There can be multiple ways to cope with this request. Either you run a query to find out those that are sold and then filter out the unsold, or you add a column to the Sales fact table with a lot of “0” values in it. Because the unknown things also have hierarchies, what the book suggests is to add another fact table to carry such information. The difference of this fact table is that you do not have to be one the same grain as the Sales table. By doing this, you can record the information in an easy (for query) way and save the disk cost by rolling-up the grain. Note that this fact table may just look like a many-to-many table without facts (a factless fact table).

Seventh, regarding degenerating dimensions, it is OK and normally allowed to have them. In the example given by the book, the “transaction number” is just added to the fact table (and actually there can be a transaction dimension to record more things if needed).

Eighth, what brings the most difficulty to the dimensional design is not the initial requirement, but the new requirements after the initial design has been established. It takes a modeling team much more time to just consider how to add the new requirements to the existing star-schema model. If new fields are added, what happens to the corresponding values of those records that are already there? Should each of the records always have “not available” value on the new fields? This is a place where best practices should be collected. And I do not believe that there is always a single method that can solve all the problems. The best practices have to be selected and used case-by-case.

Ninth, what should be bared in mind for dimensional modelers, is that the nature of dimensional model is for business users to be able to query on but not for systems to efficiently update. So this model is and should not be quite normalized. So snow-flaking is in fact a risk for dimensional models. In addition to bring extra cost on joining the different tables, you cannot even create bitmap indexing on necessary fields. Business users will also find it difficult to see through the snow flake tables because there is too much snow flake. J

Tenth, I do not quite like the claim that “most business processes can be represented with less than 15 dimensions in the fact table.” Perhaps Kimall is right but an exact value brings nothing but difficulty in a modeler’s struggle with managers who do not know much about modeling but just want things done in the way the book says.

3. Surrogate Keys

Surrogate keys make so much benefit in dimensional models and please bear in mind that it is the best if one can just use small integer numbers for the surrogate keys (instead of writing a hashing function). Using surrogate keys also means that data warehouse people do not have to rely on the operational system people on the natural keys. This situation gets even worse when the people in operational systems decide to re-cycle the account number or product numbers that are inactive for a certain period of time (but the data warehouse’ life is much longer than this period).

Sometimes you just cannot rely one the natural keys to keep the data clean. For example, a product number plus a date can be a combined natural key. But the value for the “date” field may be unknown for a period, what do we do with the record during this period? How to identify them?

One deep performance problem with having a star-schema model, is the cost of making join operations. So one wise idea is to use single keys, like surrogate keys, instead of having compound or concatenated keys because you then have to join on multiple columns and that is much more complicated.

4. Market Basket Analysis

What is interesting in this part, is the book shows how to use the star-schema model for analysis purposes. The idea is to generate regular reports based on a join operation between the fact table and several dimensional tables. Actually I would think an OLAP tool will make this work much easier.

What's new in week 20, 2008

Wednesday, 2008-05-14, Copenhagen

What can we guess from Microsoft’s failure in acquiring Yahoo? One possible answer is that Microsoft has to try harder on cloud computing to save its success in the enterprise market. Let us just be frankly, MS is not a good player in the online service market while Google was started on this market (and is in the leaders’ dimension now). Given the assumption that online services (ideas like SAAS, Cloud Computing) will be the next market trend for enterprise IT, MS has to make a solid move in order to take the lead in that direction (which is always viewed as a great piece cake). Without Yahoo’s reputation, MS has to find another way to attract more online users’ eyeballs.

The online virus-protection technology is one of the biggest trends in the security market. Given the fact that most of the virus-affections (and/or security breakage) happened through internet protocols, this market is indeed a great place to put innovations. McAfee and Yahoo just announced a beta launch of a new technology, termed SearchScan, to make Yahoo searching much safer by warning of dangerous sites before the user even clicks on them. Wait a minute, isn’t it true that, when you search on a site from Google, there is always a suggestion if the site is believed to contain malicious code? So, there is not just one player in this scenario.

Friday, 2008-05-16, Copenhagen

Quite a few major Internet players, like Microsoft and Yahoo, are trying to put profile-sharing into their service portfolio. The idea is that once a user has an ID and profile information created in one of their websites, the information can just be shared with other sites with one-mouse-click. The convenience is straightforward but many people have the concern that this will have certain amount of privacy leak (just like what happened to FaceBook.com).

Good news for VW fans! Volkswagen and Sanyo are working on a joint venture to deliver battery to be used in hybrid and electric cars. According to their plan, VW aims to start producing cars with the battery by 2012 (which is not far away from us). It’s a bit difficult to imagine how the battery can be, given the fact that VW cars are normally heavier than other competitors.

Friday, May 9, 2008

Reading notes for "Universal Metadata Models" Part II

Part II Universal Meta Models for the Meta Data Repository Environment

The rest part of this book is dedicated to explain, in detail, a very detailed meta model for a large enterprise. The meta model covers not only the enterprise system data, but also the IT portfolio management, the XML, and the business metadata.

For example, meta model for enterprise system data includes six subject areas, such as data package, data relationships to structures, data movement and transformation, expression transformation, transformation operational history, and data profiling.

What's new in week 19, 2008

Monday, 2008-05-05, Copenhagen

A little bit funny but interesting news, is that Intel, Texas Instruments, Panasonic and Infineon have formed an alliance to promote home networks for movies, music and pictures using domestic wiring. One question just out of my mouth, is that, why not “WiFi?” It is true that telephone lines are almost everywhere in a house. But you still need to consider a little extra construction work (like digging the holes, adding a few extra nails). Why not just think about wireless technology?

Friday, 2008-05-09, Copenhagen

WiMax will come to us (I mean US) very soon and hopefully bring new development chances to the awkward economy situation of the country. Intel and Google are two big proponents of WiMax technology. The traditional ISP and telephone companies will start to worry about their existence in the next few years.