Monday, April 9, 2012

The Need of Overviewing Enterprise Data Lineage

The concept of data lineage comes to proliferation in the Information Management world due to the need for more detailed, quality-focused metadata of the data flows and deliveries across enterprise IT systems.

In a large enterprise where regulatory reporting, dash boards, score card and analytics are widely used, different IT systems need to delivery their data to fulfill the various purpose of business intelligence applications. As a best practice, such enterprise would already have established a data warehouse team to maintain a central repository of all kinds of data. Use of data warehouse makes the architecture of enterprise data flows quite simple. The warehouse is just like a centralized hub that collects all kinds of input and delivers outputs to various parties. And in such case, managing data flows and data deliveries across the enterprise seems simple (although still very challenging) because all management focus can be put around the data warehouse. Data lineage in such an architecture form is focused on the lineage of data pre- and post- the data warehouse.

An very good example is the data lineage tools inside the IBM InfoSphere Metadata Workbench. In the data lineage tool, a source delivery file can be connected to an ETL job that takes this file to the data warehouse table. Then the table can be connected to the ETL job that loads the data warehouse table to the data mart tables which are then extracted and loaded to BI reports. With a naive (yes, not native but naive) support of defining external applications and extension mappings, the data lineage can be (manually) extended to external systems and tools to include the whole data life cycle, i.e., starting from the front-end system where the data was initially created, to the deliveries, the data warehouse, the data mart, and the BI applications, not to mention the various ETL jobs, data quality checks, etc., during the "life cycle" of this data-to-information flow. 

The data lineage functionality in InfoSphere Metadata Workbench does give us an good example of how enterprise users can benefit from having such a view across different architectural elements. Here is a short list:
  1. Traceability of technical elements brought by the data lineage functionality can reduce the cost of system maintenance and support. For example, a developer in the system management group will easily find out all the relevant tables and jobs that are related to a piece of  "problematic" data. 
  2. The cost of doing impact analysis can be dramatically reduced if the data lineage information is available. Whenever a change is to be applied on an architectural elements, estimating the impact in terms of cost or even possibility  can save a great budget. 
  3. The capability of aligning architecture designs and to considering optimal refactoring options can be further strengthened if a detailed data lineage map is available. It is much easier for an IT architect to make the design options if the relevant data flows, together with the performance and throughput data, are presented on a diagram (which makes the job of IT architect more or less similar to an construction architect's work).
After seeing the benefits of data lineage, we need to come back and challenge if we need to have an overview of the whole data lineage across the enterprise IT systems instead of only those around the data warehouse. 

Typically, most of the data need can be fulfilled by requesting a data delivery from the data warehouse to any business intelligence applications. In many enterprises, lots of IT systems act as a "hub" of certain group of enterprise data and generate outputs by adding "values" to the data. For example, a CRM system or an accounting system normally need to take data like customer, branches, agreements from other systems and then create output data with more "corrected," "aligned," "calculated" information. From an enterprise point of view, the output from such "intermediate" systems is also a main data source for the data warehouse. And the data flows from and to these systems are equally or even more "mission-critical" than the data warehouse.

Another example is the case when there is much need on making "operational reporting" or "operational business intelligence" in a real-time or "near real time" manner.  In such cases, event-based architecture is used to send data directly to the business intelligence applications. No data warehouse is involved here (except that an end-of-day summary may be delivered to the data warehouse through a batch-window). Such data flow often requires great architecture care due to its mission-critical purpose.

So, it does make lots of sense to maintain an overview of the whole enterprise data lineage on top of all IT systems, not just around the data warehouse. As up to now and up to my knowledge, there has not been any great software product that can fulfill this need completely. We need to cross our fingers and see what the vendors (I expect IBM, ASG, and maybe some start-ups) can bring to the table in the future.

1 comment:

Unknown said...
This comment has been removed by the author.