Monday, 2008-06-30, Copenhagen
Launching location-based services have been a big stream for many tekis and academias. We have not seen a good success of LBS since the idea was born. Now, Nokia just said that it has agreed to buy social networking company Plazes which provides location-aware services that people can use to plan, record, and share social activities. If Nokia is starting to push this idea, what would the ISPs do? Allowing the protocols or not?
Facebook is coming to China. The big social network site has just launched a mainland Chinese version zh-cn.facebook.com. Well, knowing that there have been quite a lot of facebooking sites within China for the past years (and some of them are very successful), let us cross our fingers and see when Facebook will have the same destiny as Ebay.
Wednesday, 2008-07-02, Copenhagen
We have been dreaming of having mobile phone as the only means of media to use on ANYTHING, such as paying a ticket, buying food, reserve an airline ticket, etc. Now with the ISPs and vendors’ help, this is becoming true. T-systems and Nokia started offering mobile ticketing in Germany. This is especially useful for countries like Germany where people tend to stay outside of what’s happening in the world and only trust what TV says.
Skype has been out of our sight for a while. What’s happening after the company has been acquired by EBay? Well, the CEO recently said that Skype has been receiving full help and support from the parent EBay and both are working on integrating Skype with Paypal. We can see that this is a good idea in terms that people can actually negotiate when they are selling or buying things (like what we did 50 years ago in Bazaars). But, isn’t true that after bundling Skype with Paypal, Skype itself is becoming more-or-less a sub-product in the EBay’s brand and that will change people’s image on Skype that it has been a tool independent of any platforms?
Monday, June 30, 2008
Friday, June 27, 2008
What's new in week 26, 2008
Monday, 2008-06-23, Copenhagen
A very funny news, two Belgian developers designed a video game concept called place-to-pee that relies on players hitting sensors in urinals to control the game play. Apparently the great ideas came from the country with best beers because people there have to spend a lot of time in the bath room. The games are for both man and woman. Now bath rooms will be more filled up….
We have seen that Google is challenging MS on collaboration to suites. Now another player joined the game, Adobe. Adobe has recently formally launched its free online collaboration tool suite which includes work processing, file sharing and web conferencing tools. Let’s wait and see how Adobe and get the market already occupied by Google and MS.
Where to receive if you have a BI report to wait while traveling with your mobile? In most cases, if you have BO, MicroStrategy, or Cognos tools in the company, these tools are integrated with Blackberry very well. There are also other possibilities, such as the Windows smart phones, given that your company also uses Microsoft BI tool suites. Another BI vendor Information Builder actually can work with any mobile devices with browser support. It seems that Blackberry is indeed a tool well-integrated with the business world. It will take a very long time for MS to enter the same market with their smart phones.
Tuesday, 2008-06-24, Copenhagen
Is SOA the ultimate way to go for all enterprises? Well, it depends pretty much how you work with services. Sometimes services can be very difficult to design and package. If you package basic functionalities into services, they do not have any business meanings such that it makes the re-usability no sense. If you try to package things into more business oriented services, then it is very possible that many services have overlaps with each other, meaning that they utilizes similar basic functionalities which may bring ripple effects when the performance is tuned according to one specific needs or changes are made to the basic functionalities. Especially in legacy systems, packing legacy applications into services are making the picture more complicated. One can hardly find easy ways to separate hard-coded functionalities according to different business needs.
So, is SOA the right way to go?
Which is the best way to grow for social networking sites like Facebook? Well, open source can be the answer. Facebook has been opening its application development extensions to all developers around the world and people are adding new applications to Facebook more than 100 pieces per day. It is just like there are always a lot of parties or other fun stuff coming up every day. How can facebook lose its audience? Never! Well, Google is also looking at the same initiative and starts to open-sourcing its social network protocols for developers. We’ll wait and see what’s going on the next.
Friday, 2008-06-27, Copenhagen
Here is a news about cloud computing. Data center operator Terremark just launched its “Enterprise Cloud” platform which is a complete managed platform for full-time operation of online business infrastructure. As we can see from this news, hardware and system providers are now thinking of platforms that can support global cloud computing. We can see that there are not only one computing center as the main server, but quite a few centers located globally such that 24/7 cloud computing is definitely not a hardware problem.
A very funny news, two Belgian developers designed a video game concept called place-to-pee that relies on players hitting sensors in urinals to control the game play. Apparently the great ideas came from the country with best beers because people there have to spend a lot of time in the bath room. The games are for both man and woman. Now bath rooms will be more filled up….
We have seen that Google is challenging MS on collaboration to suites. Now another player joined the game, Adobe. Adobe has recently formally launched its free online collaboration tool suite which includes work processing, file sharing and web conferencing tools. Let’s wait and see how Adobe and get the market already occupied by Google and MS.
Where to receive if you have a BI report to wait while traveling with your mobile? In most cases, if you have BO, MicroStrategy, or Cognos tools in the company, these tools are integrated with Blackberry very well. There are also other possibilities, such as the Windows smart phones, given that your company also uses Microsoft BI tool suites. Another BI vendor Information Builder actually can work with any mobile devices with browser support. It seems that Blackberry is indeed a tool well-integrated with the business world. It will take a very long time for MS to enter the same market with their smart phones.
Tuesday, 2008-06-24, Copenhagen
Is SOA the ultimate way to go for all enterprises? Well, it depends pretty much how you work with services. Sometimes services can be very difficult to design and package. If you package basic functionalities into services, they do not have any business meanings such that it makes the re-usability no sense. If you try to package things into more business oriented services, then it is very possible that many services have overlaps with each other, meaning that they utilizes similar basic functionalities which may bring ripple effects when the performance is tuned according to one specific needs or changes are made to the basic functionalities. Especially in legacy systems, packing legacy applications into services are making the picture more complicated. One can hardly find easy ways to separate hard-coded functionalities according to different business needs.
So, is SOA the right way to go?
Which is the best way to grow for social networking sites like Facebook? Well, open source can be the answer. Facebook has been opening its application development extensions to all developers around the world and people are adding new applications to Facebook more than 100 pieces per day. It is just like there are always a lot of parties or other fun stuff coming up every day. How can facebook lose its audience? Never! Well, Google is also looking at the same initiative and starts to open-sourcing its social network protocols for developers. We’ll wait and see what’s going on the next.
Friday, 2008-06-27, Copenhagen
Here is a news about cloud computing. Data center operator Terremark just launched its “Enterprise Cloud” platform which is a complete managed platform for full-time operation of online business infrastructure. As we can see from this news, hardware and system providers are now thinking of platforms that can support global cloud computing. We can see that there are not only one computing center as the main server, but quite a few centers located globally such that 24/7 cloud computing is definitely not a hardware problem.
Friday, June 20, 2008
What's new in week 25, 2008
Friday, 2008-06-20, Copenhagen
Allen System Group, ASG, is in negotiations to have another major acquisition. ASG is one of the largest privately owned software companies in this world. It is no doubt that this acquisition will lead to an improvement in its business process management technology flagship in the market.
High performance computing is definitely a continuously-growing market in the enterprise world. Just like an IDC expert said recently, the HPC market has over 10 billion usd market nowadays and it is still growing over 10 percent per year. One interesting observation is that Microsoft is also having over 1000 developers working at its HPC lab. So even the current market leaders are HP and IBM, or perhaps Sun. Other vendors are eagerly joining the competition.
More enterprise vendors are looking into how their products can have a Google version. For example, BI vendor Panorama has been launching its BI tools on top of Google spreadsheet applications. It seems that Google is also becoming a name of a certain kind of OS.
Allen System Group, ASG, is in negotiations to have another major acquisition. ASG is one of the largest privately owned software companies in this world. It is no doubt that this acquisition will lead to an improvement in its business process management technology flagship in the market.
High performance computing is definitely a continuously-growing market in the enterprise world. Just like an IDC expert said recently, the HPC market has over 10 billion usd market nowadays and it is still growing over 10 percent per year. One interesting observation is that Microsoft is also having over 1000 developers working at its HPC lab. So even the current market leaders are HP and IBM, or perhaps Sun. Other vendors are eagerly joining the competition.
More enterprise vendors are looking into how their products can have a Google version. For example, BI vendor Panorama has been launching its BI tools on top of Google spreadsheet applications. It seems that Google is also becoming a name of a certain kind of OS.
Tuesday, June 17, 2008
Notes for Chp. 4 of "Beyond Software Architecture"
Chapter 4. Business and License Model Symbiosis
Software vendors normally have one or more license models on each of their products.Why do we need a license? Because software is different from solid objects. You cannot stop people from copying it, redistribute it, and even reverse-engineering it. License is a legal way of protecting people from abusing the product while still enjoying not only the product but also the services around the product.
There are several common software business models. For example, users can by access or use of application for a period of time, or they can be charged at a percentage of their cost saved from using the software (enterprises tend to be resistant to this model), or they can be charged per transaction, or the vendors can be more precise by metering the users accesses on different resources(for example, the amount of concurrent users). Users can also be charged on the hardware that runs on the software instead of the software itself. Another way is used by the open source tool vendors.An open source tool is free. But if you want specialized services on top of the tool, it comes with price.
I have two reflections on the pricing models.
First, modern software tends to provide more features with online access. One example being the anti-virus software which normally needs to updates its virus library regularly. By using the Internet,one can find certain ways of protecting the software from abuse. In fact, in the architecture design process, one must try to include this business model in the technical design.
Second, software vendors must think of their architecture to have specialized support of more precise pricing models, such as the metering. Not all software can be priced by the metering way. But it is preferred by some customers.
There can be different rights and restrictions (most of them are very technical) associated with each type of business models. Again, the tarchitecture is very important in embedding such rights and restrictions in the technical design.
So, how can tarchitecture help the business model? I am sure that may technical architects have quite a lot of ideas in their mind. There are following things that one should always be careful with.
1. The most important thing is to capture the necessary data for the business pricing. This has to happen within the architecture.
2. Give the necessary report so that both the vendors and customers can be aware of the details of the cost.
3. Put something there to enforce the business model in case the license is violated.
4. Sometimes, the price can also be linked with those "-ility" such as scalability and reliability. The architecture should be able to support it.
5. Successful tarchitecture must also try to help customers to save their budget on other things such as the hardware investment.
6. Vendors should regularly adjust the parameters used in the pricing models. Everything changes over the time,especially in the IT world.
There are many ways of enforcing the licensing models. One can always try to link the architecture with online access to ensure that people are serious when using and paying the software. However,up to now, I have not heard any 100 percent secure licensing model if there is no law to protect the copyright of software.
Finishing on the tarchitecture, let us get back to the marchitecture. In fact, in the beginning of specific software market, vendors can make the business models very simple so that customers can easily understand and decide to buy. The idea is of course to occupy the market first.
The maturity of the market decides on the business model, indeed.
Software vendors normally have one or more license models on each of their products.Why do we need a license? Because software is different from solid objects. You cannot stop people from copying it, redistribute it, and even reverse-engineering it. License is a legal way of protecting people from abusing the product while still enjoying not only the product but also the services around the product.
There are several common software business models. For example, users can by access or use of application for a period of time, or they can be charged at a percentage of their cost saved from using the software (enterprises tend to be resistant to this model), or they can be charged per transaction, or the vendors can be more precise by metering the users accesses on different resources(for example, the amount of concurrent users). Users can also be charged on the hardware that runs on the software instead of the software itself. Another way is used by the open source tool vendors.An open source tool is free. But if you want specialized services on top of the tool, it comes with price.
I have two reflections on the pricing models.
First, modern software tends to provide more features with online access. One example being the anti-virus software which normally needs to updates its virus library regularly. By using the Internet,one can find certain ways of protecting the software from abuse. In fact, in the architecture design process, one must try to include this business model in the technical design.
Second, software vendors must think of their architecture to have specialized support of more precise pricing models, such as the metering. Not all software can be priced by the metering way. But it is preferred by some customers.
There can be different rights and restrictions (most of them are very technical) associated with each type of business models. Again, the tarchitecture is very important in embedding such rights and restrictions in the technical design.
So, how can tarchitecture help the business model? I am sure that may technical architects have quite a lot of ideas in their mind. There are following things that one should always be careful with.
1. The most important thing is to capture the necessary data for the business pricing. This has to happen within the architecture.
2. Give the necessary report so that both the vendors and customers can be aware of the details of the cost.
3. Put something there to enforce the business model in case the license is violated.
4. Sometimes, the price can also be linked with those "-ility" such as scalability and reliability. The architecture should be able to support it.
5. Successful tarchitecture must also try to help customers to save their budget on other things such as the hardware investment.
6. Vendors should regularly adjust the parameters used in the pricing models. Everything changes over the time,especially in the IT world.
There are many ways of enforcing the licensing models. One can always try to link the architecture with online access to ensure that people are serious when using and paying the software. However,up to now, I have not heard any 100 percent secure licensing model if there is no law to protect the copyright of software.
Finishing on the tarchitecture, let us get back to the marchitecture. In fact, in the beginning of specific software market, vendors can make the business models very simple so that customers can easily understand and decide to buy. The idea is of course to occupy the market first.
The maturity of the market decides on the business model, indeed.
Monday, June 16, 2008
What's new in week 24, 2008
Thursday, 2008-06-12, Copenhagen
Another high-performance, parallel processing database? Aster Data Systems, started up three years ago with three Stanford students, is now coming into a period of growing up their business with the brilliant product. One of their well-known customers is MySpace. The parallel processing database has incorporated quite a few new innovations like data partitioning algorithm POD. We will see how this database engines goes if one of giant database players has the intention to acquire it. So Teradata may not always be the leading players in parallel databases.
Google has recently announced the availability of a new Google application, called Google Sites, which let users create own web pages with a WYSIWYG style. This is another cloud computing application from Google. Seems that Google will attract quite a lot of SoHos’ attention.
Another high-performance, parallel processing database? Aster Data Systems, started up three years ago with three Stanford students, is now coming into a period of growing up their business with the brilliant product. One of their well-known customers is MySpace. The parallel processing database has incorporated quite a few new innovations like data partitioning algorithm POD. We will see how this database engines goes if one of giant database players has the intention to acquire it. So Teradata may not always be the leading players in parallel databases.
Google has recently announced the availability of a new Google application, called Google Sites, which let users create own web pages with a WYSIWYG style. This is another cloud computing application from Google. Seems that Google will attract quite a lot of SoHos’ attention.
Friday, May 23, 2008
What's new in week 21, 2008
Tuesday, 2008-05-20, Copenhagen
A new competitor to Google on the search engine market? Yes and this may happen everyday. Recently, Powerset has launched a semantic search engine so that users can input natural language to query on stuff. Although not the whole internet content has been indexed by this new engine (only Wikipedia is in the index), using natural language is definitely a good idea to attract users attention.
HP does have a big dream of becoming the next IBM. In a very recently news, we have heard that HP has successfully acquired EDS, a big IT server providers. Services, out-sourcing, and new hardware (as well as software) technology alliance, HP is becoming the great challenger to IBM.
The big idea of cloud computing has been attracting more business people’s mind. The column-based database vendor, Vertica, has joined the cloud with a number of database vendors like EnterpriseDB, MS, Amazon (SimpleDB), and Google (Bigtable).
Friday, 2008-05-23, Copenhagen
Microsoft and Yahoo are in ANOTHER talk after Yahoo’s rejection of MS’s request of deal. Now MS is looking for a chance not totally acquire the searching brand and find out a better way of making things work. Yahoo’s board have received complains after the rejection that they were not considering the deal in a rational way. Let’s just wait and see what Google says about this.
Google is reaching another scenario in the search-engine market. Quite recently the search giant unveiled its health-related website, Google Health. This can be very good news for most people who does not know pretty much on how to find the good medical information on Google. What can make it difficult for Google is that people make just want to find a good medical answer by the search engine instead of going to their doctors. Oh, one more thing, there is a privacy issue if people start to put their health information on their Google Health profiles.
A new competitor to Google on the search engine market? Yes and this may happen everyday. Recently, Powerset has launched a semantic search engine so that users can input natural language to query on stuff. Although not the whole internet content has been indexed by this new engine (only Wikipedia is in the index), using natural language is definitely a good idea to attract users attention.
HP does have a big dream of becoming the next IBM. In a very recently news, we have heard that HP has successfully acquired EDS, a big IT server providers. Services, out-sourcing, and new hardware (as well as software) technology alliance, HP is becoming the great challenger to IBM.
The big idea of cloud computing has been attracting more business people’s mind. The column-based database vendor, Vertica, has joined the cloud with a number of database vendors like EnterpriseDB, MS, Amazon (SimpleDB), and Google (Bigtable).
Friday, 2008-05-23, Copenhagen
Microsoft and Yahoo are in ANOTHER talk after Yahoo’s rejection of MS’s request of deal. Now MS is looking for a chance not totally acquire the searching brand and find out a better way of making things work. Yahoo’s board have received complains after the rejection that they were not considering the deal in a rational way. Let’s just wait and see what Google says about this.
Google is reaching another scenario in the search-engine market. Quite recently the search giant unveiled its health-related website, Google Health. This can be very good news for most people who does not know pretty much on how to find the good medical information on Google. What can make it difficult for Google is that people make just want to find a good medical answer by the search engine instead of going to their doctors. Oh, one more thing, there is a privacy issue if people start to put their health information on their Google Health profiles.
Tuesday, May 20, 2008
Notes for reading "The Data Warehouse Toolkit," Chapter 2
There are following topics described in this chapter.
1. The “four step dimensional design process”
According to this book, there are four basic steps to do dimensional modeling. First, to select the business process, second , to define the granularity of the data, third, to find out the dimensions, fourth, to find out the facts. We discuss each in the following.
1.1, to select the business process
The basic step of data modeling is to think about the scope and influence of the new model. One principle is to avoid publishing the same data multiple times. This book discusses about having the model per business process rather than per department. The idea behind is that, if having the models per department, the same data may have to be published more than once as different business processes may require similar data on the same department. However, it is not enough to just have one data model per business process, one still has to reach the step of consolidating these business process into a unified model at certain time point in the future. But I agree that, given most situations that data marts are required in a short time frame, focusing on a specific business process is the most optimal way to take. However, one must take a serious step later to consolidate the “quickly-designed” data mart into a large data warehouse. That step, as I can see from my experience, is missing in most of the industry implementations.
1.2, to define the grain of the business process
The question is “how to define a single row in the fact table?” It is a question on how detail we should devolve into in the data model. And here is the place that we consider the balance between the scalability of data and the performance requirement. In many cases, if one uses the most detailed data in the data model, it means that, if there are new requirements to the model, it is very easy to adapt those requirements as we already have all the data (otherwise, a major re-work has to be called). On the other hand, when having the most detailed data, that brings extra maintenance cost (you have to pay people to “watch” and “fix” you disks to yield the best performance in order to satisfy the users).
This step also helps people to understand the content of the dimensions.
1.3, to define the dimensions
It is about to find out and group the categories. The tricky part is how can we make the grouping in a generic way so that it is very easy to scale the model to include more different sets of data in the future?
1.4, to identify the facts
What are we measuring? There are some columns that are generated based on calculations over other columns. Do we need these extra columns that seem to be waster of disk space? It depends on the requirement on the performance as well as how often the data is used.
2 A case study on the retail business
I have the following reflection from the case study.
First, it is always necessary to start by understanding, from the business points of view, what is required exactly. There are many business processes in a case, but to choose the right one is the most important thing.
Second, when you have defined the grain of data, the dimensions seem to be self-clarified. Just read and analyze the definition of the grain and use the terms in the that definition, for example, “prodoctu,” “store,”, “promotion.” And remember that date is always a dimension in all data warehouses (otherwise you would not call it a data warehouse).
Third, in the design of fact tables, for those additive measures, it is a kind of belief to add an additive measure to a fact table. Knowing that this will bring storage cost but still deciding to do it, you have to have a belief. In fact, it is up to how the value of the measure is used and how that piece of architecture is designed to let people use the data.
Fourth, as long as you can do the join operation, it is a good idea to have the date dimension. You cannot rely on special functions to tell you when a special public holiday is in a certain country. They have to be documented in the date table.
Fifth, I think what has been misleading in the book is that you should not always add as many attributes to the dimensional tables as possible. In the physical world, there is a limitation of row size, and by adding more columns to each row, the performance for doing queries is getting worse. One has to take a very serious step to consider whether to add a new column or to find out ways so that you can do the calculation without adding anything at all.
Sixth, apparently only the “happened” facts are recorded in the fact table. What about those that are not “happened?” For example, what are the products that are in promotion but are not sold by any amount? There can be multiple ways to cope with this request. Either you run a query to find out those that are sold and then filter out the unsold, or you add a column to the Sales fact table with a lot of “0” values in it. Because the unknown things also have hierarchies, what the book suggests is to add another fact table to carry such information. The difference of this fact table is that you do not have to be one the same grain as the Sales table. By doing this, you can record the information in an easy (for query) way and save the disk cost by rolling-up the grain. Note that this fact table may just look like a many-to-many table without facts (a factless fact table).
Seventh, regarding degenerating dimensions, it is OK and normally allowed to have them. In the example given by the book, the “transaction number” is just added to the fact table (and actually there can be a transaction dimension to record more things if needed).
Eighth, what brings the most difficulty to the dimensional design is not the initial requirement, but the new requirements after the initial design has been established. It takes a modeling team much more time to just consider how to add the new requirements to the existing star-schema model. If new fields are added, what happens to the corresponding values of those records that are already there? Should each of the records always have “not available” value on the new fields? This is a place where best practices should be collected. And I do not believe that there is always a single method that can solve all the problems. The best practices have to be selected and used case-by-case.
Ninth, what should be bared in mind for dimensional modelers, is that the nature of dimensional model is for business users to be able to query on but not for systems to efficiently update. So this model is and should not be quite normalized. So snow-flaking is in fact a risk for dimensional models. In addition to bring extra cost on joining the different tables, you cannot even create bitmap indexing on necessary fields. Business users will also find it difficult to see through the snow flake tables because there is too much snow flake. J
Tenth, I do not quite like the claim that “most business processes can be represented with less than 15 dimensions in the fact table.” Perhaps Kimall is right but an exact value brings nothing but difficulty in a modeler’s struggle with managers who do not know much about modeling but just want things done in the way the book says.
3. Surrogate Keys
Surrogate keys make so much benefit in dimensional models and please bear in mind that it is the best if one can just use small integer numbers for the surrogate keys (instead of writing a hashing function). Using surrogate keys also means that data warehouse people do not have to rely on the operational system people on the natural keys. This situation gets even worse when the people in operational systems decide to re-cycle the account number or product numbers that are inactive for a certain period of time (but the data warehouse’ life is much longer than this period).
Sometimes you just cannot rely one the natural keys to keep the data clean. For example, a product number plus a date can be a combined natural key. But the value for the “date” field may be unknown for a period, what do we do with the record during this period? How to identify them?
One deep performance problem with having a star-schema model, is the cost of making join operations. So one wise idea is to use single keys, like surrogate keys, instead of having compound or concatenated keys because you then have to join on multiple columns and that is much more complicated.
4. Market Basket Analysis
What is interesting in this part, is the book shows how to use the star-schema model for analysis purposes. The idea is to generate regular reports based on a join operation between the fact table and several dimensional tables. Actually I would think an OLAP tool will make this work much easier.
1. The “four step dimensional design process”
According to this book, there are four basic steps to do dimensional modeling. First, to select the business process, second , to define the granularity of the data, third, to find out the dimensions, fourth, to find out the facts. We discuss each in the following.
1.1, to select the business process
The basic step of data modeling is to think about the scope and influence of the new model. One principle is to avoid publishing the same data multiple times. This book discusses about having the model per business process rather than per department. The idea behind is that, if having the models per department, the same data may have to be published more than once as different business processes may require similar data on the same department. However, it is not enough to just have one data model per business process, one still has to reach the step of consolidating these business process into a unified model at certain time point in the future. But I agree that, given most situations that data marts are required in a short time frame, focusing on a specific business process is the most optimal way to take. However, one must take a serious step later to consolidate the “quickly-designed” data mart into a large data warehouse. That step, as I can see from my experience, is missing in most of the industry implementations.
1.2, to define the grain of the business process
The question is “how to define a single row in the fact table?” It is a question on how detail we should devolve into in the data model. And here is the place that we consider the balance between the scalability of data and the performance requirement. In many cases, if one uses the most detailed data in the data model, it means that, if there are new requirements to the model, it is very easy to adapt those requirements as we already have all the data (otherwise, a major re-work has to be called). On the other hand, when having the most detailed data, that brings extra maintenance cost (you have to pay people to “watch” and “fix” you disks to yield the best performance in order to satisfy the users).
This step also helps people to understand the content of the dimensions.
1.3, to define the dimensions
It is about to find out and group the categories. The tricky part is how can we make the grouping in a generic way so that it is very easy to scale the model to include more different sets of data in the future?
1.4, to identify the facts
What are we measuring? There are some columns that are generated based on calculations over other columns. Do we need these extra columns that seem to be waster of disk space? It depends on the requirement on the performance as well as how often the data is used.
2 A case study on the retail business
I have the following reflection from the case study.
First, it is always necessary to start by understanding, from the business points of view, what is required exactly. There are many business processes in a case, but to choose the right one is the most important thing.
Second, when you have defined the grain of data, the dimensions seem to be self-clarified. Just read and analyze the definition of the grain and use the terms in the that definition, for example, “prodoctu,” “store,”, “promotion.” And remember that date is always a dimension in all data warehouses (otherwise you would not call it a data warehouse).
Third, in the design of fact tables, for those additive measures, it is a kind of belief to add an additive measure to a fact table. Knowing that this will bring storage cost but still deciding to do it, you have to have a belief. In fact, it is up to how the value of the measure is used and how that piece of architecture is designed to let people use the data.
Fourth, as long as you can do the join operation, it is a good idea to have the date dimension. You cannot rely on special functions to tell you when a special public holiday is in a certain country. They have to be documented in the date table.
Fifth, I think what has been misleading in the book is that you should not always add as many attributes to the dimensional tables as possible. In the physical world, there is a limitation of row size, and by adding more columns to each row, the performance for doing queries is getting worse. One has to take a very serious step to consider whether to add a new column or to find out ways so that you can do the calculation without adding anything at all.
Sixth, apparently only the “happened” facts are recorded in the fact table. What about those that are not “happened?” For example, what are the products that are in promotion but are not sold by any amount? There can be multiple ways to cope with this request. Either you run a query to find out those that are sold and then filter out the unsold, or you add a column to the Sales fact table with a lot of “0” values in it. Because the unknown things also have hierarchies, what the book suggests is to add another fact table to carry such information. The difference of this fact table is that you do not have to be one the same grain as the Sales table. By doing this, you can record the information in an easy (for query) way and save the disk cost by rolling-up the grain. Note that this fact table may just look like a many-to-many table without facts (a factless fact table).
Seventh, regarding degenerating dimensions, it is OK and normally allowed to have them. In the example given by the book, the “transaction number” is just added to the fact table (and actually there can be a transaction dimension to record more things if needed).
Eighth, what brings the most difficulty to the dimensional design is not the initial requirement, but the new requirements after the initial design has been established. It takes a modeling team much more time to just consider how to add the new requirements to the existing star-schema model. If new fields are added, what happens to the corresponding values of those records that are already there? Should each of the records always have “not available” value on the new fields? This is a place where best practices should be collected. And I do not believe that there is always a single method that can solve all the problems. The best practices have to be selected and used case-by-case.
Ninth, what should be bared in mind for dimensional modelers, is that the nature of dimensional model is for business users to be able to query on but not for systems to efficiently update. So this model is and should not be quite normalized. So snow-flaking is in fact a risk for dimensional models. In addition to bring extra cost on joining the different tables, you cannot even create bitmap indexing on necessary fields. Business users will also find it difficult to see through the snow flake tables because there is too much snow flake. J
Tenth, I do not quite like the claim that “most business processes can be represented with less than 15 dimensions in the fact table.” Perhaps Kimall is right but an exact value brings nothing but difficulty in a modeler’s struggle with managers who do not know much about modeling but just want things done in the way the book says.
3. Surrogate Keys
Surrogate keys make so much benefit in dimensional models and please bear in mind that it is the best if one can just use small integer numbers for the surrogate keys (instead of writing a hashing function). Using surrogate keys also means that data warehouse people do not have to rely on the operational system people on the natural keys. This situation gets even worse when the people in operational systems decide to re-cycle the account number or product numbers that are inactive for a certain period of time (but the data warehouse’ life is much longer than this period).
Sometimes you just cannot rely one the natural keys to keep the data clean. For example, a product number plus a date can be a combined natural key. But the value for the “date” field may be unknown for a period, what do we do with the record during this period? How to identify them?
One deep performance problem with having a star-schema model, is the cost of making join operations. So one wise idea is to use single keys, like surrogate keys, instead of having compound or concatenated keys because you then have to join on multiple columns and that is much more complicated.
4. Market Basket Analysis
What is interesting in this part, is the book shows how to use the star-schema model for analysis purposes. The idea is to generate regular reports based on a join operation between the fact table and several dimensional tables. Actually I would think an OLAP tool will make this work much easier.
What's new in week 20, 2008
Wednesday, 2008-05-14, Copenhagen
What can we guess from Microsoft’s failure in acquiring Yahoo? One possible answer is that Microsoft has to try harder on cloud computing to save its success in the enterprise market. Let us just be frankly, MS is not a good player in the online service market while Google was started on this market (and is in the leaders’ dimension now). Given the assumption that online services (ideas like SAAS, Cloud Computing) will be the next market trend for enterprise IT, MS has to make a solid move in order to take the lead in that direction (which is always viewed as a great piece cake). Without Yahoo’s reputation, MS has to find another way to attract more online users’ eyeballs.
The online virus-protection technology is one of the biggest trends in the security market. Given the fact that most of the virus-affections (and/or security breakage) happened through internet protocols, this market is indeed a great place to put innovations. McAfee and Yahoo just announced a beta launch of a new technology, termed SearchScan, to make Yahoo searching much safer by warning of dangerous sites before the user even clicks on them. Wait a minute, isn’t it true that, when you search on a site from Google, there is always a suggestion if the site is believed to contain malicious code? So, there is not just one player in this scenario.
Friday, 2008-05-16, Copenhagen
Quite a few major Internet players, like Microsoft and Yahoo, are trying to put profile-sharing into their service portfolio. The idea is that once a user has an ID and profile information created in one of their websites, the information can just be shared with other sites with one-mouse-click. The convenience is straightforward but many people have the concern that this will have certain amount of privacy leak (just like what happened to FaceBook.com).
Good news for VW fans! Volkswagen and Sanyo are working on a joint venture to deliver battery to be used in hybrid and electric cars. According to their plan, VW aims to start producing cars with the battery by 2012 (which is not far away from us). It’s a bit difficult to imagine how the battery can be, given the fact that VW cars are normally heavier than other competitors.
What can we guess from Microsoft’s failure in acquiring Yahoo? One possible answer is that Microsoft has to try harder on cloud computing to save its success in the enterprise market. Let us just be frankly, MS is not a good player in the online service market while Google was started on this market (and is in the leaders’ dimension now). Given the assumption that online services (ideas like SAAS, Cloud Computing) will be the next market trend for enterprise IT, MS has to make a solid move in order to take the lead in that direction (which is always viewed as a great piece cake). Without Yahoo’s reputation, MS has to find another way to attract more online users’ eyeballs.
The online virus-protection technology is one of the biggest trends in the security market. Given the fact that most of the virus-affections (and/or security breakage) happened through internet protocols, this market is indeed a great place to put innovations. McAfee and Yahoo just announced a beta launch of a new technology, termed SearchScan, to make Yahoo searching much safer by warning of dangerous sites before the user even clicks on them. Wait a minute, isn’t it true that, when you search on a site from Google, there is always a suggestion if the site is believed to contain malicious code? So, there is not just one player in this scenario.
Friday, 2008-05-16, Copenhagen
Quite a few major Internet players, like Microsoft and Yahoo, are trying to put profile-sharing into their service portfolio. The idea is that once a user has an ID and profile information created in one of their websites, the information can just be shared with other sites with one-mouse-click. The convenience is straightforward but many people have the concern that this will have certain amount of privacy leak (just like what happened to FaceBook.com).
Good news for VW fans! Volkswagen and Sanyo are working on a joint venture to deliver battery to be used in hybrid and electric cars. According to their plan, VW aims to start producing cars with the battery by 2012 (which is not far away from us). It’s a bit difficult to imagine how the battery can be, given the fact that VW cars are normally heavier than other competitors.
Subscribe to:
Comments (Atom)