SQL, BI, and Information Management: 2011

Friday, December 30, 2011

Resizing Calendar webpart in MOSS 2003 and MOSS 2007

I ran into this problem when trying to drag a calendar webpart to a MOSS 2003 page (yeah, it's MOSS2011 now, but someone still prefer to use MOSS2003). To resize the Calendar webpart, I found a very excellent post at : http://blog.pathtosharepoint.com/2008/10/06/tiny-sharepoint-calendar-1/.

The key part of the solution is to use the Content-Editor webpart and try to overwrite the default style-sheet for the calendar webpart.

Running WordCount example in Hadoop installation at cygwin

I couldn't find a spare PC to install Ubuntu. So a cygwin on my Windows PC seems to be the only solution for running Hadoop (I did try to install Ubuntu desktop on the same PC, and then there is a wireless adapter error). I followed most of the steps that one can find out on Google. To me, the most helpful notes come from http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/

My setup: Windows 7 professional (32 bit), Dell Studio. And I have a single node Hadoop installation on the cygwin environment.

I think most of the steps described in http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/ are still quite correct. But when setting up $JAVA_HOME, I would suggest to use the Environment Variables settings in the Windows environment. The setup in conf\hadoop-env.sh did not work for me. And I also set the $PATH variable to point to the Java/jdk1_..../bin folder in the Windows environment.The reason is that the javac and java commands running in cygwin are actually based on the installed Java SDK in windows.

After the setup, my next issue was to run the WordCount example. There were two parts. Running the WordCount.jar seems to be easy and successful. However, when I tried to use javac to compile the source code of WordCount.java, I met many errors, such as "package cannot be found...".

After a few hours of struggle, I found out the answer. The issues come from the classpath for Windows environment (when you run the command in cygwin). Here is the final solution for running the javac command in cygwin bash.

javac -classpath "C:\cygwin\hadoop-0.20.2\hadoop-0.20.2-core.jar;C:\cygwin\hadoop-0.20.2\lib\commons-cli-1.2.jar" -d playground/classes playground/src/WordCount.java

Another thing to remember is to make sure that you do have java 1.6.x installed in Windows (at the time of this note, the hadoop version I used is 0.20.2). Anything in Java1.7 will not work.

Thanks to the many discussion about this issue even though they did not give direct help but just some ideas. Hopefully this note can be helpful for someone else on the same road :)

Monday, April 4, 2011

Does SCRUM fit for data warehousing?

There have been a broad discussion on how agile concept fits into the landscape of data warehousing. Considering the very special nature (well, many things are special, as I agree) of data warehousing, concepts like SCRUM seem to be beneficial for running projects in the data warehousing context. However, can all data warehousing activities fit into a sprint?

Well, I do agree that there are ways to break-down the activities into smaller steps so that they fit into sprints. However, the different life cycle statuses of a data warehouse may imply different tactics of implementing SCRUM concept.

When you work with a matured platform, where more than 80% percent of the major data subject areas, such as people, organization, employee, customer, products & services, have been populated in DW and the need to adding new data sources has been disappearing over the last 2 years, it is time to consider using SCRUM to manage the control the development activities around the DW. Why? Because the data model, the ETL, and the different rules, guidelines are getting matured. People are used to the way that things are supposed to be. So it is very easy to estimate what activities should fit into each sprint.

If we are in a situation where less than 20% of the data warehouse data is populated from source systems, it seems very challenging to consider using SCRUM. In such a status, the DW team are still struggling on the rules and ways of working. Substantial re-works are appearing on weekly or daily basis. In such a case, trying out any agile methods can be risky unless the SCRUM team has all key technical developers enrolled and gets the full management support (in case of re-works).

What if you are in-between these two states (21%-79% of data is populated)? I would be very careful with what has been populated in DW. If the majority of the enterprise master data, such as Customer, Products, Organization, Arrangements, has been ready, it is safe to consider SCRUM by including key technical developers during the process. Otherwise, consider a more classical DW approach.

By the way, this book may look interesting to read.

Saturday, April 2, 2011

Is data valuable for an enterprise?

There is a recent post on Information Management where people questioned and debated about the value of data itself to an organization.

As a general assumption, most enterprises value data as "valuable asset" for themselves. Whatever is kept in their IT systems is useful for creating business benefits. However, people always have to put data into certain context, such as a business process like "sales," "marketing," and "credit profiling", in order to create the value based on these data. On that sense, the data itself does not seem to be a valuable asset until the moment it is dragged into the context.

Well, this is an interesting observation or argument. But I would definitely question what is "data" to an enterprise. When a customer record is written in the database table at the IT system, this piece of data is already put into a context, i.e., the business definition of a customer to this business organization, and the business rules applied to this customer. And since this moment, this piece of data is creating value for the business. How? There can be few reasons.

1. The customer record is in fact the hard evidence that a business transaction happened. This transaction information lets the business organization have a legal way to protect its business. When a customer comes and ask for a refund or modification of certain product or services, the data kept in the system is a legal protection of what should or should not be considered. One could argue that I am using a certain context in this example. However, I would then ask when and how can you find any piece of data which has nothing to do with the business on any context? If that happens, why would you need data into the IT system?

2. In certain business domains, such as banking and health-care, keeping the data (I mean archiving) is a necessary need from a regulatory and compliance perspective. If an organization in these domains cannot fulfill this requirement, its business has to stop.

So, data in an enterprise is definitely a valuable asset. Why? Because you will lose your business if you do not keep data.

Friday, April 1, 2011

What is "One Version of Truth” for a Data Warehouse

One version of truth or single version of truth (svot) has been a popular vision for data warehouse developers in many years. Large organizations tend to put one version of truth as a major milestone in the implementation of a centralized data warehouse. So, what does "one version of truth" mean? Is it achievable?

First of all, one version of truth means that all enterprise data are consolidated into a single data warehouse. The data is kept in a consistent and non-redundant way such that all data coming out from the data warehouse should be understood as the enterprise's common view of information. For example, if there are minor differences in the organization hierarchy data in different business area, the data from the the data warehouse should be considered as the correct, commonly-accepted, and enterprise-wide agreed organization hierarchy.

In many cases, the interpretation of single version of truth is extended so that one can have a "federated" view of different versions of truth in the single data warehouse. This implies that the data warehouse tends to give the governance of certain business logic away in order to maintain the view of “centralization" in a technical level. I think Malcolm Chrisholm's article "there is no single version of the truth" is in fact quite valid in the real data warehouse world. The only way to achieve a "single version of the truth" is to have the agreement and governance process on the business level.

However, when you manage to let all business to agree on almost everything of the organization's data, are we losing the power of being different and being able to think out of the box? The basic nature of a successful business is that people try to work out innovative ways and new concepts or new views of old things.

In my point of view, one version of truth is achievable in certain sectors, such as the military section and the public sector. In an enterprise where it is important to make business innovations and improvements, one version of truth sounds more like a thing in the road map... ...

Thursday, March 31, 2011

SharePoint, Google Sites, and Enterprise 2.0

After a while of struggling, I have to admit that SharePoint is going to take the lead in Enterprise 2.0 world. The challenges are not just about having a good collaboration platform, but a platform that integrates email, IM, calendar&schedule, and, most important of all, business intelligence. By the way, I would expect something more promising when Windows Mobile is tightly integrated with SharePoint.... The future enterprises will go mobile and MS is definitely the closest one to provide a fully-integrated solution.

On that sense, Google is the runner-up in the game at this moment. I believe that Google will take on the Cloud-computing as its main architecture. The mind-set or belief of enterprise users will basically set the business future for Google in this market. And I am not quite optimistic for this. Only SMEs and private users have the courage to take on the challenges. An incident such as information leakage can basically kill the belief of cloud computing technology in the market.

Where is IBM? Well, let's cross our fingers and see when Lotus Notes'time is completely over. I do believe that IBM is not able to take on the game if they do not have a serious re-consideration of their extensive use of java.

Any one else in the game? Well, we constantly see so many different tool-sets coming up (especially in the BI world :) that brings forward an organization in certain perspective. But everyone will soon find out that the cost of maintaining these different tools and their integrations will basically makes up their mind to migrate to a "single platform" strategy.

Wednesday, January 19, 2011

Code Canvas improves knowledge transfer in codes and models

Software developers, programmers, and architects share the same habit of diagramming on white-boards. Lots of actual codes are developed based on the ideas implied in these diagrams or graphs. In fact, these visual elements become an essential part when the code or system is going to be made available to other people, users, new developers, customers, etc. Traditionally, software development involves the modelling step before and around the coding in order to capture the meta-knowledge. How about some efforts to make these steps easier and faster?

The Code Canvas prototype in MS research lab shows an interesting step toward an automation of creating code maps which eases the understanding of complex code structures and class diagrams. As shown in Kael Rowan's blog entry (here), the Code Canvas seems to be a future add-on to Visual Studio. As mentioned in an article at ACM COMMUNICATION ("Software Development with Code Maps," Vol. 53, No. 08, 2010), having a decent code map makes it much faster for newcomers to adapt into the project and complete their tasks.