SQL, BI, and Information Management

Tuesday, April 17, 2012

Initial Impression of Metadata Tools in Information Server 8.7

Just got my hands on a sandbox of Information Server 8.7 installation. After looking at the InfoCenter and the Redbook about metadata management, I have had an initial view of the new metadata offerings from the Information Server 8.7.

Just to add a bit background here. I have been using Information Server 8.1.x in the past 4 years.

In InfoSphere Business Glossary, I can see that the BG "Editor" and BG "Browser" interfaces are merged into a single user interface which is a good improvement on the user experience side. The new UI looks more "modern" and gives overview to both terms, categories, and IT assets very easily. In addition, Business Glossary has added a workflow feature that enables an "approval workflow" for managing BG terms. This workflow is disabled in the default installation.
The meta-model in the Metadata Server has been extended with a few new elements. Among other, the most existing one to me is the Logical Data Model. In the 8.1.x version, only physical data model (database table schemes) can be imported to the Metadata Server. Now a logical data model created in InfoSphere Data Architect, including the documentations in IDA, can be imported to the Metadata Server. This is a very good step towards making BG and Metadata Workbench more widely accepted and used by both business and IT developers in enterprises.
Metadata Workbench looks similar to the previous versions. The "Administration" tab in Metadata Workbench has an improved interface. And as I have heard (not tried myself yet), the data lineage and business lineage functionality has been improved a lot, so is the automated metadata service (which has been a pain in the 8.1.x version).
A new tool, "Metadata Asset Management" seems to provide an excellent way of administrating metadata in an enterprise setup. This tool allows users to import and compare different types of metadata, such as data models, data files, BI reports. As described in the Redbook (link), one can , based on this tool, design an enterprise metadata environment for all kinds of metadata related to data integration process.
istool seems to keep the same functionalities as before and add a "reporting asset" for import/export.

I also heard that Business Glossary provides more ways of integration with other tools. And the "Blueprint Director" is a very interesting architecture tool. Hopefully I will post more findings when I begin to do more work with this sandbox in the coming days :)

Monday, April 9, 2012

The Need of Overviewing Enterprise Data Lineage

The concept of data lineage comes to proliferation in the Information Management world due to the need for more detailed, quality-focused metadata of the data flows and deliveries across enterprise IT systems.

In a large enterprise where regulatory reporting, dash boards, score card and analytics are widely used, different IT systems need to delivery their data to fulfill the various purpose of business intelligence applications. As a best practice, such enterprise would already have established a data warehouse team to maintain a central repository of all kinds of data. Use of data warehouse makes the architecture of enterprise data flows quite simple. The warehouse is just like a centralized hub that collects all kinds of input and delivers outputs to various parties. And in such case, managing data flows and data deliveries across the enterprise seems simple (although still very challenging) because all management focus can be put around the data warehouse. Data lineage in such an architecture form is focused on the lineage of data pre- and post- the data warehouse.

An very good example is the data lineage tools inside the IBM InfoSphere Metadata Workbench. In the data lineage tool, a source delivery file can be connected to an ETL job that takes this file to the data warehouse table. Then the table can be connected to the ETL job that loads the data warehouse table to the data mart tables which are then extracted and loaded to BI reports. With a naive (yes, not native but naive) support of defining external applications and extension mappings, the data lineage can be (manually) extended to external systems and tools to include the whole data life cycle, i.e., starting from the front-end system where the data was initially created, to the deliveries, the data warehouse, the data mart, and the BI applications, not to mention the various ETL jobs, data quality checks, etc., during the "life cycle" of this data-to-information flow.

The data lineage functionality in InfoSphere Metadata Workbench does give us an good example of how enterprise users can benefit from having such a view across different architectural elements. Here is a short list:

Traceability of technical elements brought by the data lineage functionality can reduce the cost of system maintenance and support. For example, a developer in the system management group will easily find out all the relevant tables and jobs that are related to a piece of "problematic" data.
The cost of doing impact analysis can be dramatically reduced if the data lineage information is available. Whenever a change is to be applied on an architectural elements, estimating the impact in terms of cost or even possibility can save a great budget.
The capability of aligning architecture designs and to considering optimal refactoring options can be further strengthened if a detailed data lineage map is available. It is much easier for an IT architect to make the design options if the relevant data flows, together with the performance and throughput data, are presented on a diagram (which makes the job of IT architect more or less similar to an construction architect's work).

After seeing the benefits of data lineage, we need to come back and challenge if we need to have an overview of the whole data lineage across the enterprise IT systems instead of only those around the data warehouse.

Typically, most of the data need can be fulfilled by requesting a data delivery from the data warehouse to any business intelligence applications. In many enterprises, lots of IT systems act as a "hub" of certain group of enterprise data and generate outputs by adding "values" to the data. For example, a CRM system or an accounting system normally need to take data like customer, branches, agreements from other systems and then create output data with more "corrected," "aligned," "calculated" information. From an enterprise point of view, the output from such "intermediate" systems is also a main data source for the data warehouse. And the data flows from and to these systems are equally or even more "mission-critical" than the data warehouse.

Another example is the case when there is much need on making "operational reporting" or "operational business intelligence" in a real-time or "near real time" manner. In such cases, event-based architecture is used to send data directly to the business intelligence applications. No data warehouse is involved here (except that an end-of-day summary may be delivered to the data warehouse through a batch-window). Such data flow often requires great architecture care due to its mission-critical purpose.

So, it does make lots of sense to maintain an overview of the whole enterprise data lineage on top of all IT systems, not just around the data warehouse. As up to now and up to my knowledge, there has not been any great software product that can fulfill this need completely. We need to cross our fingers and see what the vendors (I expect IBM, ASG, and maybe some start-ups) can bring to the table in the future.

Friday, April 6, 2012

How to start the implementation of InfoSphere Business Glossary

InfoSphere Business Glossary (IBG or BG) is a very useful "Data Concept Management Tool" to connect all users of data at an enterprise. BG is part of the IBM Information Server tool suite and is intended to be used as a enterprise glossary of business concepts and terms. With the metadata interchange support and the Information Server Metadata Server, a business term described in BG can be connected to a technical asset such as a column in a table, an ETL job that populates a table, a field member of a Business Intelligence report, or flat file contained in a delivery between enterprise applications. Another key feature of BG is the use of data steward. A data steward (person, role or group) can be assigned as an owner of a business term defined in BG. This provides an opportunity of creating an enterprise data governance structure. The ownership concept provides an enterprise-wide awareness of data and quality of data.

Implementation of BG normally requires a team of allocated resources (e.g., a project team or a task force) to start an initial "enabling" phase to define the work process and to actually implement a substantial amount of key business terms and concepts. The following is a list of key issues or topics that such team must work with in order to create a successful initial implementation of Business Glossary.

1. Creating a data owner/steward role in the enterprise
    Apparently, making terms without knowing if it is correct or wrong does not make any sense at all. The enterprise must have employees that care and use the data that a term describes. Designing a role like data steward will not cost the employee a huge amount of extra working hours. On the long term perspective, the data owner can guardian the data and quality of data which will, in return, helps the enterprise to make the right business decisions based on the reports/dashboards/mining results that are based on the data.

2. Designing a basic category structure in BG
   Business Glossary is a very simple tool. All business terms are located in categories. All categories are having a tree structure. A term physically located in one category can be "referred" by another category. A term can be connected to other terms as synonyms.
   Generally, a good structure in BG contains top categories that classifies the business concepts. For example, there can be a category called "Customer" which includes all kinds of business terms and concepts regarding customers to this enterprise. If all the top categories in BG are about such kind of major business conceptual areas, for example, "Customer," "Branch," "Product," "Agreements," etc., any users in the enterprise will find it not difficult to browse around to understand all the key business areas and business concepts. Besides, the search functionality in BG is well enough for any newcomers to look around.

3. Designing the work processes around BG
    Typically, when users start to use BG, there will be requirements on how to let data steward edit the terms, how to manage the life cycle of terms (in BG, a term can be "standard," "accepted," "candidate," or "deprecated"), and how to make sure that all contents are having back-ups etc. There can be many more such questions to the maintenance team of the BG content.
    In addition, when BG is used and maintained in a strict enterprise environment where there can be TEST, FREEZE and PRODUCTION instances, the work processes seem to be more important.

4. Marketing and making organizational implementation
    A tool like Business Glossary does not contain any fancy and complicated functionality such that all users of the enterprise can suddenly start enjoying and cheering up every time they find it useful. The team must go out and communicate with almost everyone in the enterprise in order to have a good start on the tool. Here, the data stewards team should be the first group that get familiar (and enjoyable) with the tool and start (with their charms) to influence other enterprise users as "Business Glossary advocates" until everybody likes it.

5. Starting populating the terms
    Normally an enterprise must have several key business concepts that are used throughout almost all activities/documentations, such as customer, internal units, employee, etc. It is very unlikely that one cannot find any employee that cares about terms and concepts in these core conceptual areas. In other words, one should find it possible and possibly easy to define and describe terms and assign data stewards in these core concepts. Giving a dedicated period and resource to populate terms in such core business conceptual areas gives a huge positive impact on the success of Business Glossary.

The concept of having a Business Glossary is to use the "crown sourcing" power to improve and align the organization on the data and information usage. It paves the way for successful business intelligence implementation in the long run. Hopefully, we could expect more social and collaborative features from Business Glossary in the future.

Friday, December 30, 2011

Resizing Calendar webpart in MOSS 2003 and MOSS 2007

I ran into this problem when trying to drag a calendar webpart to a MOSS 2003 page (yeah, it's MOSS2011 now, but someone still prefer to use MOSS2003). To resize the Calendar webpart, I found a very excellent post at : http://blog.pathtosharepoint.com/2008/10/06/tiny-sharepoint-calendar-1/.

The key part of the solution is to use the Content-Editor webpart and try to overwrite the default style-sheet for the calendar webpart.

Running WordCount example in Hadoop installation at cygwin

I couldn't find a spare PC to install Ubuntu. So a cygwin on my Windows PC seems to be the only solution for running Hadoop (I did try to install Ubuntu desktop on the same PC, and then there is a wireless adapter error). I followed most of the steps that one can find out on Google. To me, the most helpful notes come from http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/

My setup: Windows 7 professional (32 bit), Dell Studio. And I have a single node Hadoop installation on the cygwin environment.

I think most of the steps described in http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/ are still quite correct. But when setting up $JAVA_HOME, I would suggest to use the Environment Variables settings in the Windows environment. The setup in conf\hadoop-env.sh did not work for me. And I also set the $PATH variable to point to the Java/jdk1_..../bin folder in the Windows environment.The reason is that the javac and java commands running in cygwin are actually based on the installed Java SDK in windows.

After the setup, my next issue was to run the WordCount example. There were two parts. Running the WordCount.jar seems to be easy and successful. However, when I tried to use javac to compile the source code of WordCount.java, I met many errors, such as "package cannot be found...".

After a few hours of struggle, I found out the answer. The issues come from the classpath for Windows environment (when you run the command in cygwin). Here is the final solution for running the javac command in cygwin bash.

javac -classpath "C:\cygwin\hadoop-0.20.2\hadoop-0.20.2-core.jar;C:\cygwin\hadoop-0.20.2\lib\commons-cli-1.2.jar" -d playground/classes playground/src/WordCount.java

Another thing to remember is to make sure that you do have java 1.6.x installed in Windows (at the time of this note, the hadoop version I used is 0.20.2). Anything in Java1.7 will not work.

Thanks to the many discussion about this issue even though they did not give direct help but just some ideas. Hopefully this note can be helpful for someone else on the same road :)

Monday, April 4, 2011

Does SCRUM fit for data warehousing?

There have been a broad discussion on how agile concept fits into the landscape of data warehousing. Considering the very special nature (well, many things are special, as I agree) of data warehousing, concepts like SCRUM seem to be beneficial for running projects in the data warehousing context. However, can all data warehousing activities fit into a sprint?

Well, I do agree that there are ways to break-down the activities into smaller steps so that they fit into sprints. However, the different life cycle statuses of a data warehouse may imply different tactics of implementing SCRUM concept.

When you work with a matured platform, where more than 80% percent of the major data subject areas, such as people, organization, employee, customer, products & services, have been populated in DW and the need to adding new data sources has been disappearing over the last 2 years, it is time to consider using SCRUM to manage the control the development activities around the DW. Why? Because the data model, the ETL, and the different rules, guidelines are getting matured. People are used to the way that things are supposed to be. So it is very easy to estimate what activities should fit into each sprint.

If we are in a situation where less than 20% of the data warehouse data is populated from source systems, it seems very challenging to consider using SCRUM. In such a status, the DW team are still struggling on the rules and ways of working. Substantial re-works are appearing on weekly or daily basis. In such a case, trying out any agile methods can be risky unless the SCRUM team has all key technical developers enrolled and gets the full management support (in case of re-works).

What if you are in-between these two states (21%-79% of data is populated)? I would be very careful with what has been populated in DW. If the majority of the enterprise master data, such as Customer, Products, Organization, Arrangements, has been ready, it is safe to consider SCRUM by including key technical developers during the process. Otherwise, consider a more classical DW approach.

By the way, this book may look interesting to read.

Saturday, April 2, 2011

Is data valuable for an enterprise?

There is a recent post on Information Management where people questioned and debated about the value of data itself to an organization.

As a general assumption, most enterprises value data as "valuable asset" for themselves. Whatever is kept in their IT systems is useful for creating business benefits. However, people always have to put data into certain context, such as a business process like "sales," "marketing," and "credit profiling", in order to create the value based on these data. On that sense, the data itself does not seem to be a valuable asset until the moment it is dragged into the context.

Well, this is an interesting observation or argument. But I would definitely question what is "data" to an enterprise. When a customer record is written in the database table at the IT system, this piece of data is already put into a context, i.e., the business definition of a customer to this business organization, and the business rules applied to this customer. And since this moment, this piece of data is creating value for the business. How? There can be few reasons.

1. The customer record is in fact the hard evidence that a business transaction happened. This transaction information lets the business organization have a legal way to protect its business. When a customer comes and ask for a refund or modification of certain product or services, the data kept in the system is a legal protection of what should or should not be considered. One could argue that I am using a certain context in this example. However, I would then ask when and how can you find any piece of data which has nothing to do with the business on any context? If that happens, why would you need data into the IT system?

2. In certain business domains, such as banking and health-care, keeping the data (I mean archiving) is a necessary need from a regulatory and compliance perspective. If an organization in these domains cannot fulfill this requirement, its business has to stop.

So, data in an enterprise is definitely a valuable asset. Why? Because you will lose your business if you do not keep data.

Friday, April 1, 2011

What is "One Version of Truth” for a Data Warehouse

One version of truth or single version of truth (svot) has been a popular vision for data warehouse developers in many years. Large organizations tend to put one version of truth as a major milestone in the implementation of a centralized data warehouse. So, what does "one version of truth" mean? Is it achievable?

First of all, one version of truth means that all enterprise data are consolidated into a single data warehouse. The data is kept in a consistent and non-redundant way such that all data coming out from the data warehouse should be understood as the enterprise's common view of information. For example, if there are minor differences in the organization hierarchy data in different business area, the data from the the data warehouse should be considered as the correct, commonly-accepted, and enterprise-wide agreed organization hierarchy.

In many cases, the interpretation of single version of truth is extended so that one can have a "federated" view of different versions of truth in the single data warehouse. This implies that the data warehouse tends to give the governance of certain business logic away in order to maintain the view of “centralization" in a technical level. I think Malcolm Chrisholm's article "there is no single version of the truth" is in fact quite valid in the real data warehouse world. The only way to achieve a "single version of the truth" is to have the agreement and governance process on the business level.

However, when you manage to let all business to agree on almost everything of the organization's data, are we losing the power of being different and being able to think out of the box? The basic nature of a successful business is that people try to work out innovative ways and new concepts or new views of old things.

In my point of view, one version of truth is achievable in certain sectors, such as the military section and the public sector. In an enterprise where it is important to make business innovations and improvements, one version of truth sounds more like a thing in the road map... ...