SQL, BI, and Information Management

Wednesday, August 1, 2012

Purpose of having an Enterprise Architecture

Enterprise Architecture is closely connected to an overview of the whole. It does not matter if an enterprise means a large corporation, or a division, or a department. Enterprise Architecture is to keep an eye on the overall interest of this “enterprise.”

So what’s the business case of having an Enterprise Architecture? One thing for sure is that people tend to follow the trace of “standardization” when an EA process is driving. And this leads to a better commoditization of software and systems within an organization, i.e., controllable operating cost and predictable development time. When having an established EA process, people tend to have better communication effects and are able to make design decisions in shorter time (this is extremely important for large organizations where escalations and politics are popular). And, the most important of all, an EA tells the people, especially the leaders, the directions to go, the initiatives to take, and the strategies to rely on.

Perhaps a better way t to look at this is to see the situation when EA is not in place. Without EA, an organization definitely experiences one or more of the followings.

Many locally optimal, rather than globally optimal solutions exist and cannot removed
Very high operating cost due to the fact that things are not globally shared and re-used
Too much dependency on a vendor or a legacy application (no one knows who's using it, no one knows anything about it, no one dares to close it down)
Most solutions are created for short term purpose and limits the long term benefit

IT standards are not respected

It is worth noting that even lots of IT functionalities are outsourced to different vendors, EA is still vital and even more critical in such situations. With more than one vendors pressing the CFO for different payments, the EA team must guardian the enterprise to avoid being a cash-cow kind.

Basics for Understanding PowerPivot

I often met the questions about PowerPivot these days. "What is it?" "Why should we have it?" "Can PowerPivot do this for me?"

Well, the best way is to make a list of most basic descriptions of PowerPivot and share with those who need it. So here they are.

What is PowerPivot

PowerPivot is a new data analysis technology in a way that it makes the "impossible" usage of EXCEL (in the past) to be possible now. So business analysts do not always have to wait for a few months to see the end-result of IT implementation. They can instead do self-service and make many of the BI solution implementations on their own (self-service).

Self-service BI compared to corporate BI

PowerPivot (together with Excel, and most importantly, SharePoint) enables the self-service BI capabilities of a company. But this does not mean the era of Corporate BI is over. Instead, corporate BI will have an even more important status since the business is now able to specify more clearly on their requirements and is even able to provide a prototype plus data and results that can be used to validate the IT implementations.

PowerPivot is not a standalone tool

PowerPivot integrates with Excel (2010 since it needs the size extensions enabled from this version of Excel) and SharePoint 2010.

Data Analysis Expressions (DAX)

This is one of the key extensions of PowerPivot (in addition to the in-memory analysis engine and the compression algorithms). DAX is a collection of formula elements, i.e., functions, operators and constants that can be used to make expressions or do calculations. It is also available to the Tabular Model projects.

Managed BI collaboration environment

Clearly PowerPivot has enhanced the possibilities of spread marts and too many standalone worksheets across an enterprise. That's why it is critical to establish a managed BI collaboration environment before promoting the self-service BI concept. Here exists lots of process, rule, training, and SharePointing work. With a SharePoint site, the business can clearly do the self-service work while the IT department puts on certain audit-trails and analysis to find out the need for optimizations, further maintenance and even development of Corporate BI solutions.

Technical specifications

There is no doubt that PowerPivot includes a good in-memory database. Here is the technical specification that one should always remember. "....The maximum file size of a PowerPivot workbook is 2 GB. There are no restrictions on the amount of data users can import into a workbook, but workbooks exceeding the maximum file size can’t be saved. A 2-GB workbook typically corresponds to a 4-GB dataset, considering a 2:1 compression ratio.... Performance tests show that PowerPivot for Excel can load more than 100 million rows and maintain adequate processing performance with 2 GB of memory. However, test results vary depending

on the compressibility of the data. For fastest processing performance, Microsoft recommends multi-core processors and more than 4 GB of RAM...." (this is from the document "Microsoft SQL Server PowerPivot Planning and Deployment").

Key elements in PowerPivot for SharePoint

The Timer service (automatically refresh the data), the PowerPivot web services and the PowerPivot Management Dashboard.

The best picture that shows the roles in BI at enterprises

Look at this one (or use the following URL http://i.technet.microsoft.com/dynimg/IC396763.jpg)

The real key to success

Having a focused effort to delivery training, technical documentation and support to the "Power users," and business users is the real critical factor to the success of self-service BI. A good practice is to making a good knowledge base for common questions and training materials.

Understand the performance gain and loss

What PowerPivot provides a a great performance if everything is inside your memory. To initially get the data from somewhere, there is a potential bottleneck on the network bandwidth. In that case, do not blame PowerPivot but try to decentralize the data just like reducing the heavy traffic around the central station of your city.

Security and access

PowerPivot is definitely bring challenges to those who work with data access and security control . As a "fat-client" layer in an enterprise landscape, it is very important to identify the security needs and utilize SharePoint as a mitigation layer.

Tuesday, July 31, 2012

6 Questions for Understanding Enterprise Architecture

Many companies are implementing enterprise architecture these years. IT employees or “senior” IT employees are upgraded to the “shining” title of “Enterprise Architect.” Is it true that Enterprise Architect is just another rank name in deciding the seniority of employees or is it something really creating benefit for a company?

The best way to find out all these answers is to understand what is Enterprise Architecture.

Wait a minute. I am not going to start writing massive content about the definitions, theories, and so on. Instead, I will list the key questions that one should ask about Enterprise Architecture. How about the answers? Well, they are in wiki pages, google results, books, presentations, and most important of all, your own analysis and judgements.

1. To understand Enterprise Architecture, the first question is “what is Enterprise Architecture?” Or, one could start even one step back. What is “Enterprise?” Is it a multi-nation corporation or the IT department or a whole business division? And what is “architecture” or what is “IT architecture?”

2. Why do we need Enterprise Architecture? Of course we know that it is about fulfilling the business requirements. But there can be different ways of solving business requirements. Why do so many of us choose to use the EA approach?

3. When talking about IT architecture or software architecture, many books and articles mentioned about different levels or levels of abstraction. For example, there can be business architecture, solution architecture, reference architecture, data or information architecture, operation architecture, technology architecture, etc. What are the right ways of understanding these levels? How are they connected?

4. Are there any existing concepts, frameworks, guidelines, and systems about Enterprise Architecture (oh yes, there’s plenty and more coming). What are their differences?

5. How about a short history of Enterprise Architecture? Why did we come to this era that we all need the EA. What was it before? How did enterprise IT evolve over the past many years? Besides the history, how’s the progress and new topics in this discipline? Where do I get the latest trend and what are the trend now?

6. What about the business case for implementing Enterprise Architecture? How and why did so many companies accept to implement EA?

OK. Enough about the questions. Here are two exercises if all the above questions are answered.

E1. Consider IT systems and solutions as an commodity is a natural way to implement cost-saving for any enterprise. More and more companies are seeing the need of using "out of the box" software rather than self- developing IT systems as a "piece of art" which can only serve specific purposes. In parallel to this, we have been seeing the new evolution of IT software that enables simple and quick customizations which makes a standard software quickly fit well into an organization with a compromise between the end-users "never satisfied needs", the business sponsors' "cost-reduction while shorter time-to-market demand focus" and the IT technicians' "quality and easy to maintenance awareness”. Providing this scenario (which is very typical in companies), how is EA’s role in such a situation? How should EA solve or resolve or manage such scenarios?

E2. Many companies consider cost-saving and EA is a great area to focus on cost-saving. But is this enough? How about getting more profit? Can we use EA to gain profit for the company?

Tuesday, April 17, 2012

Initial Impression of Metadata Tools in Information Server 8.7

Just got my hands on a sandbox of Information Server 8.7 installation. After looking at the InfoCenter and the Redbook about metadata management, I have had an initial view of the new metadata offerings from the Information Server 8.7.

Just to add a bit background here. I have been using Information Server 8.1.x in the past 4 years.

In InfoSphere Business Glossary, I can see that the BG "Editor" and BG "Browser" interfaces are merged into a single user interface which is a good improvement on the user experience side. The new UI looks more "modern" and gives overview to both terms, categories, and IT assets very easily. In addition, Business Glossary has added a workflow feature that enables an "approval workflow" for managing BG terms. This workflow is disabled in the default installation.
The meta-model in the Metadata Server has been extended with a few new elements. Among other, the most existing one to me is the Logical Data Model. In the 8.1.x version, only physical data model (database table schemes) can be imported to the Metadata Server. Now a logical data model created in InfoSphere Data Architect, including the documentations in IDA, can be imported to the Metadata Server. This is a very good step towards making BG and Metadata Workbench more widely accepted and used by both business and IT developers in enterprises.
Metadata Workbench looks similar to the previous versions. The "Administration" tab in Metadata Workbench has an improved interface. And as I have heard (not tried myself yet), the data lineage and business lineage functionality has been improved a lot, so is the automated metadata service (which has been a pain in the 8.1.x version).
A new tool, "Metadata Asset Management" seems to provide an excellent way of administrating metadata in an enterprise setup. This tool allows users to import and compare different types of metadata, such as data models, data files, BI reports. As described in the Redbook (link), one can , based on this tool, design an enterprise metadata environment for all kinds of metadata related to data integration process.
istool seems to keep the same functionalities as before and add a "reporting asset" for import/export.

I also heard that Business Glossary provides more ways of integration with other tools. And the "Blueprint Director" is a very interesting architecture tool. Hopefully I will post more findings when I begin to do more work with this sandbox in the coming days :)

Monday, April 9, 2012

The Need of Overviewing Enterprise Data Lineage

The concept of data lineage comes to proliferation in the Information Management world due to the need for more detailed, quality-focused metadata of the data flows and deliveries across enterprise IT systems.

In a large enterprise where regulatory reporting, dash boards, score card and analytics are widely used, different IT systems need to delivery their data to fulfill the various purpose of business intelligence applications. As a best practice, such enterprise would already have established a data warehouse team to maintain a central repository of all kinds of data. Use of data warehouse makes the architecture of enterprise data flows quite simple. The warehouse is just like a centralized hub that collects all kinds of input and delivers outputs to various parties. And in such case, managing data flows and data deliveries across the enterprise seems simple (although still very challenging) because all management focus can be put around the data warehouse. Data lineage in such an architecture form is focused on the lineage of data pre- and post- the data warehouse.

An very good example is the data lineage tools inside the IBM InfoSphere Metadata Workbench. In the data lineage tool, a source delivery file can be connected to an ETL job that takes this file to the data warehouse table. Then the table can be connected to the ETL job that loads the data warehouse table to the data mart tables which are then extracted and loaded to BI reports. With a naive (yes, not native but naive) support of defining external applications and extension mappings, the data lineage can be (manually) extended to external systems and tools to include the whole data life cycle, i.e., starting from the front-end system where the data was initially created, to the deliveries, the data warehouse, the data mart, and the BI applications, not to mention the various ETL jobs, data quality checks, etc., during the "life cycle" of this data-to-information flow.

The data lineage functionality in InfoSphere Metadata Workbench does give us an good example of how enterprise users can benefit from having such a view across different architectural elements. Here is a short list:

Traceability of technical elements brought by the data lineage functionality can reduce the cost of system maintenance and support. For example, a developer in the system management group will easily find out all the relevant tables and jobs that are related to a piece of "problematic" data.
The cost of doing impact analysis can be dramatically reduced if the data lineage information is available. Whenever a change is to be applied on an architectural elements, estimating the impact in terms of cost or even possibility can save a great budget.
The capability of aligning architecture designs and to considering optimal refactoring options can be further strengthened if a detailed data lineage map is available. It is much easier for an IT architect to make the design options if the relevant data flows, together with the performance and throughput data, are presented on a diagram (which makes the job of IT architect more or less similar to an construction architect's work).

After seeing the benefits of data lineage, we need to come back and challenge if we need to have an overview of the whole data lineage across the enterprise IT systems instead of only those around the data warehouse.

Typically, most of the data need can be fulfilled by requesting a data delivery from the data warehouse to any business intelligence applications. In many enterprises, lots of IT systems act as a "hub" of certain group of enterprise data and generate outputs by adding "values" to the data. For example, a CRM system or an accounting system normally need to take data like customer, branches, agreements from other systems and then create output data with more "corrected," "aligned," "calculated" information. From an enterprise point of view, the output from such "intermediate" systems is also a main data source for the data warehouse. And the data flows from and to these systems are equally or even more "mission-critical" than the data warehouse.

Another example is the case when there is much need on making "operational reporting" or "operational business intelligence" in a real-time or "near real time" manner. In such cases, event-based architecture is used to send data directly to the business intelligence applications. No data warehouse is involved here (except that an end-of-day summary may be delivered to the data warehouse through a batch-window). Such data flow often requires great architecture care due to its mission-critical purpose.

So, it does make lots of sense to maintain an overview of the whole enterprise data lineage on top of all IT systems, not just around the data warehouse. As up to now and up to my knowledge, there has not been any great software product that can fulfill this need completely. We need to cross our fingers and see what the vendors (I expect IBM, ASG, and maybe some start-ups) can bring to the table in the future.

Friday, April 6, 2012

How to start the implementation of InfoSphere Business Glossary

InfoSphere Business Glossary (IBG or BG) is a very useful "Data Concept Management Tool" to connect all users of data at an enterprise. BG is part of the IBM Information Server tool suite and is intended to be used as a enterprise glossary of business concepts and terms. With the metadata interchange support and the Information Server Metadata Server, a business term described in BG can be connected to a technical asset such as a column in a table, an ETL job that populates a table, a field member of a Business Intelligence report, or flat file contained in a delivery between enterprise applications. Another key feature of BG is the use of data steward. A data steward (person, role or group) can be assigned as an owner of a business term defined in BG. This provides an opportunity of creating an enterprise data governance structure. The ownership concept provides an enterprise-wide awareness of data and quality of data.

Implementation of BG normally requires a team of allocated resources (e.g., a project team or a task force) to start an initial "enabling" phase to define the work process and to actually implement a substantial amount of key business terms and concepts. The following is a list of key issues or topics that such team must work with in order to create a successful initial implementation of Business Glossary.

1. Creating a data owner/steward role in the enterprise
    Apparently, making terms without knowing if it is correct or wrong does not make any sense at all. The enterprise must have employees that care and use the data that a term describes. Designing a role like data steward will not cost the employee a huge amount of extra working hours. On the long term perspective, the data owner can guardian the data and quality of data which will, in return, helps the enterprise to make the right business decisions based on the reports/dashboards/mining results that are based on the data.

2. Designing a basic category structure in BG
   Business Glossary is a very simple tool. All business terms are located in categories. All categories are having a tree structure. A term physically located in one category can be "referred" by another category. A term can be connected to other terms as synonyms.
   Generally, a good structure in BG contains top categories that classifies the business concepts. For example, there can be a category called "Customer" which includes all kinds of business terms and concepts regarding customers to this enterprise. If all the top categories in BG are about such kind of major business conceptual areas, for example, "Customer," "Branch," "Product," "Agreements," etc., any users in the enterprise will find it not difficult to browse around to understand all the key business areas and business concepts. Besides, the search functionality in BG is well enough for any newcomers to look around.

3. Designing the work processes around BG
    Typically, when users start to use BG, there will be requirements on how to let data steward edit the terms, how to manage the life cycle of terms (in BG, a term can be "standard," "accepted," "candidate," or "deprecated"), and how to make sure that all contents are having back-ups etc. There can be many more such questions to the maintenance team of the BG content.
    In addition, when BG is used and maintained in a strict enterprise environment where there can be TEST, FREEZE and PRODUCTION instances, the work processes seem to be more important.

4. Marketing and making organizational implementation
    A tool like Business Glossary does not contain any fancy and complicated functionality such that all users of the enterprise can suddenly start enjoying and cheering up every time they find it useful. The team must go out and communicate with almost everyone in the enterprise in order to have a good start on the tool. Here, the data stewards team should be the first group that get familiar (and enjoyable) with the tool and start (with their charms) to influence other enterprise users as "Business Glossary advocates" until everybody likes it.

5. Starting populating the terms
    Normally an enterprise must have several key business concepts that are used throughout almost all activities/documentations, such as customer, internal units, employee, etc. It is very unlikely that one cannot find any employee that cares about terms and concepts in these core conceptual areas. In other words, one should find it possible and possibly easy to define and describe terms and assign data stewards in these core concepts. Giving a dedicated period and resource to populate terms in such core business conceptual areas gives a huge positive impact on the success of Business Glossary.

The concept of having a Business Glossary is to use the "crown sourcing" power to improve and align the organization on the data and information usage. It paves the way for successful business intelligence implementation in the long run. Hopefully, we could expect more social and collaborative features from Business Glossary in the future.

Friday, December 30, 2011

Resizing Calendar webpart in MOSS 2003 and MOSS 2007

I ran into this problem when trying to drag a calendar webpart to a MOSS 2003 page (yeah, it's MOSS2011 now, but someone still prefer to use MOSS2003). To resize the Calendar webpart, I found a very excellent post at : http://blog.pathtosharepoint.com/2008/10/06/tiny-sharepoint-calendar-1/.

The key part of the solution is to use the Content-Editor webpart and try to overwrite the default style-sheet for the calendar webpart.

Running WordCount example in Hadoop installation at cygwin

I couldn't find a spare PC to install Ubuntu. So a cygwin on my Windows PC seems to be the only solution for running Hadoop (I did try to install Ubuntu desktop on the same PC, and then there is a wireless adapter error). I followed most of the steps that one can find out on Google. To me, the most helpful notes come from http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/

My setup: Windows 7 professional (32 bit), Dell Studio. And I have a single node Hadoop installation on the cygwin environment.

I think most of the steps described in http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/ are still quite correct. But when setting up $JAVA_HOME, I would suggest to use the Environment Variables settings in the Windows environment. The setup in conf\hadoop-env.sh did not work for me. And I also set the $PATH variable to point to the Java/jdk1_..../bin folder in the Windows environment.The reason is that the javac and java commands running in cygwin are actually based on the installed Java SDK in windows.

After the setup, my next issue was to run the WordCount example. There were two parts. Running the WordCount.jar seems to be easy and successful. However, when I tried to use javac to compile the source code of WordCount.java, I met many errors, such as "package cannot be found...".

After a few hours of struggle, I found out the answer. The issues come from the classpath for Windows environment (when you run the command in cygwin). Here is the final solution for running the javac command in cygwin bash.

javac -classpath "C:\cygwin\hadoop-0.20.2\hadoop-0.20.2-core.jar;C:\cygwin\hadoop-0.20.2\lib\commons-cli-1.2.jar" -d playground/classes playground/src/WordCount.java

Another thing to remember is to make sure that you do have java 1.6.x installed in Windows (at the time of this note, the hadoop version I used is 0.20.2). Anything in Java1.7 will not work.

Thanks to the many discussion about this issue even though they did not give direct help but just some ideas. Hopefully this note can be helpful for someone else on the same road :)