SQL, BI, and Information Management

Sunday, November 3, 2013

Understanding Data Analysis: Part Three

Data analysis is a basic skill for any business personel at any modern enterprises. Most of the decision making processes at such organizations lay on a solid analytical background before they reach into the final resolution. Almost all the modern Business Intelligence technologies originate from requirements at the data analysis steps. Learning the data analysis process and concept gives very good meaning for understanding why and how the BI technologies have evolved into the current state.

I have a series of blog entries on data analysis concepts, process, and techniques. This is the third part of the series. This blog entry is motivated and actually based on the following references.

1. Michael Milton: Head First Data Analysis

Part III: Mathematics, Statistics, and Data Mining

One very basic professional background of every data analyst is the knowledge and education in mathematics, statistics, and data mining. There has been a saying that pure mathematical education is not useful for the real world any more. In fact, this "saying" has been completely wrong. Lots of mathematical models, methods, and theories are directly applied to real world scenarios. What could be true is that if one is not paying enough attention to what has been taught at the mathematical lectures, one tends to lose all the key part of how to use the mathematics in the real world. But everyone is good at blaming the theories rather than regreting what has been missed in the courses.

Theories cannot talk (or blame), they only give value.

One good example is histogram. Histogram has been a powerful tool in the mathematical and statistical world. And this tool is also very useful in the data analysts´ world. Histogram is a powerful tool to understand the distribution of data across different groupings. Most data analysis involves large groups of numbers and histogram can be used to start the observations when the analysis starts from a blank piece of paper. Both Excel and R provide good support on histograms. The data analysis function in Excel can help to draw histograms. Using R is also a good choice and R is in fact more effective in diagraming tasks like this. In fact, R is better in terms of finding the groupings in the data set.

Another data analysis scenario where mathematics is very often used is to make predictions. Making predictions is just like finding out the best mathematical models for the problem scenario. There will never be perfect models or equations for every scenario or every object. In the end, it is just a guess. An analyst can always start with the very basic mathematical approach, linear regression. If it is possible to find correlations between certain variables, the linear regression is the best way to make the predictions. However, not lots of things in the real world actually fall into the linear model. So an important step for the analyst is to draw scatter plots and try to see if any mathematical models can efficiently represent the trend. It is also very possible that part of the data points on the plots can be represented with linear regressions while the others cannot. Just like the normal step in the analysis, the analyst needs to break the problem into pieces where linear regression works and where other model works. Of course, different non-linear regression models can also be tested in such cases. Back to the universities, a very good course called "numerical analysis" presents all basic mathematical models for this scenario.

Statistics is also a frequently-used tool in data analyst´s daily life. For example, the practices of falsification, hypothesis testing, is a very good method for dealing with heterogeneous sources of information. Such as rumors that you have heard from various places. Do not underestimate the power of analyzing on rumors. In the business world where there more 99% rumors and the rest is only surprises (yes, there is no truth in the business world), making the right "prediction" and "understanding" of the near future business trend means that the analyst must sit and work with rumors. Navgating around the ocean of lies and rumors, the analyst must learn to draw things on a paper and try to link them into different relationships. Different variables can be negatively or positively linked in such diagrams. Very often the analyst will get a network diagram after this session. This is absolutely right! Things in the real world are linked through causal networks.

The core part of hypothesis testing is falsification. The analyst should focus on eliminating the disconfirmed hypotheses rather than picking the right one. What to do when there are more than 1 hypothesis left after you have tried everything to eliminate them? A simple approach is to decide which one has the strongest support based on all the possible evidences. Evidence is diagnostic if it helps you rank one hypothesis as stronger than another. The analyst needs to look at each hypothesis in comparison to each piece of evidence and each other and see which has the strongest support. Very often, what the analyst needs is to make a ranking matrix. Every evidence is ranked with different values for each hypothesis. Sometimes an evidence can have stronger values than the others. Well, how stronger or weaker is decided by the analyst. This is not rocket science and we are just giving a statistically strongest vote to the alternative hypothesis. One thing very important to remember is that there can be new evidence (which may be very strong and can change the total ranking orders of all the candidate hypotheses). It is the analyst' role to make sure that the new evidence is included in time and the latest changes are communicated.

Besides hypothesis testing, I would recommend to learn about Bayesian statistics and the practices about subjective probabilities. In fact, subjective probabilities is a very practical tool i the real world. When it happens that an analyst is not sure about a judgement or a conclusion (which is very often), she or he normally tends to use words like "probably," "perhaps" and so on. Here is the time when one should consider to use subjective probability. That is, to ask for a percentage number instead of theses words. For example, something like "90% probability" is a subjective probability.

With a set of subjective probabilities collected from a group of analysts, it is very easy to use all kinds of statistical methods to further analyze the scenario. One good tool is the standard deviation. The standard deviation measures how far typical points are from the average (or mean) of the data set. Given the set of probabilities collected from different analysts, the standard deviation tells how different the opinions are.

There are many other powerful tools and theories in the statistical world for every data analyst.

Data mining is based on different mathematical and statistical theories and method. The computer scientists tends to "computerize" the theories and make it into a computing model which is ready for real world usage. Details about data minings will be posted in future blogs (where I will concentrate on each data mining method). Here is a list of typical data mining tools and theories used in the data analysis world.

Bayesian Network
Decision Tree
Neural Nework
Support-vector Machines
K-nearest Neighbors
Clustering
Association Rules

Having a background in mathematics, statistics and data mining has always been a career foundation for all data analysts. The techniques and methods mentioned in this blog entry are only a small portion of the whole ocean of the mathematical world. The important thing for data analyst is to keep learning new tools and try to apply them into their real world scenarios. in the end, it is the business value that decides the success of all techniques, methods and theories.

在Blogsy发表的

Sunday, October 27, 2013

Understanding Data Analysis: Part Two

I have a series of blog entries on data analysis concepts, process, and techniques. This is the second part of the series. This blog entry is motivated and actually based on the following references.

1. Michael Milton: Head First Data Analysis

Part II: Test, Optimize and Visualize

While following the four steps described in my previous entry, every analyst is facing the problem of defensing her/his decision-input by the hard, bloody reality of the real world result. All analytical models, decisions and recommendations have to survive through the examinations when they are deployed and used in the real world. Time proves the truth.

But what about fine-tuning and adaptive adjustment to a defined analytical (or mental) model? The world evolves (if not revolutionalizes), so does the mental models. How to give the analysts opportunities to adjust the models and improve them?

Test, test, and test. Yes, test your models. And by "test" I am not saying using just simulations and paper-based data and examples (made by empirical data). You must test your model in the real world and do the experiments in a real scenario.

In many real world scenario, doing test gives very good sense to solve the problem in a sustainable manner. Making a well-planned and well-executed experiments gives a powerful proof of the analyst's judgements. How to carry out the experiment? The very basic element of experimenting is to do surveys. Making surveys before and after a planned test activity can let the analyst have more insight on how effective the test activity has been.

One good example in Michael's book is to use a few local coffee shops to experiment the analytical models which can contribute to a national wide deployment to all the coffee shops. Before running the experiments, the analyst must define what are the factors to observe and compare. And a baseline data must be prepared in order to compare the results. A set of control groups, in contrast to experimental groups, must be established to ensure the comparison. Selecting the control groups requires the analysts to be very careful. The analyst must make sure that other confounders are not blending into the experiment and make sure that all the control groups are "equally" the same on the confounders. After the groups are defined, the rest of the steps are simple and direct. One just need to execute the experiments and collect result for comparison. And the final conclusion will be clear.

What is confounder, precisely? A confounder is a difference among the target group of the analyst's research. A confounder is a factor that, if you use it to compare, makes the analysis result sensible compared to other factors which are non-confounders. A confounder can be factors such as locations, age groups, gender type, etc.

Doing data analysis with identified confounders is not very difficult. The analyst just needs to use the confounders to break out data into smaller chunks to test the result. Identifying confounders based on a given set of data is not easy. Analysts can typically use a set of mathematical or statistical tools to work it out.

A very typical method, when the analytical model or the scenario for analysis gets complicated, is to define a mathematical or statistical model and to optimize it. One thing to make it clear is that no mathematical model can be made so perfect that all real world factors can be modelled, represented and "mathematicalized". One always has to drop those that are too complicated and focus on the "main thing."

The way that an analyst tries to solve the optimization is to define a mathematical model for the problem scenario and then solves the problem by classical mathematical methods. When it happens that the results do not show to be the most optimal result, the analyst must find out where in the model was set to a wrong assumption or if anything should have been included in the model or if it is the right mathematical model to use. There is no perfect model for all scenarios. The analyst's role is to choose the right one for the specific scenario.

Optimization can be done when it is possible to represent the main factors using a mathematical model. There are many data analysis tools (software packages) that can help a data analyst to find out the most optimal result. The Solver package in Excel is a good example of it. And I shall also list R, SAS, SPSS, Mathematica, Matlab and so on.

To solve an optimization problem, the analyst must define an objective function, find out all constants, variables, and search for other constraints related to the identified variables. In many cases, it brings extra benefit if a visualization of the objective function, together with the constraints, can be drawn on a two or three dimensional space. Visualization gives the best chances to verify the analysis and estimate the most optimal result.

In fact, visualization has been broadly used in the scientific (or, academical) world to understand observations and facilitate key decisions on research directions. The business world (and the BI industry) is following this trend in recent years. Good examples are the population of visual analytics tools such as SAS Visual Analytics, Tableau, and Qlikview.

Data visualization means to show your data with diagrams instead of tables of numbers. Visualization is important and extremely useful in most scenarios. It is equally to say that finding out the right visualization is also vital and decides how the clients receive and accept the observations and conclusions.

Using visualizations does not mean that the analyst needs to make fancy or beautiful diagrams. In fact, the visualization should be clear,simple and straightforward.

How to find the best visualization to pinpoint the analysis? A recommended way is to use scattetplots. Just start by choosing an X and Y dimensions and start drawing the data points. Very often the analysts can find many interesting patterns and are able to define new dimensions. A standard analysis would use independent variables as the X axis and dependent variables as the Y axis. Of course there will be more than one dependent and independent variables. The analysis report normally involves diagrams with different combinations of variables which can help to identify certain patterns in the data points.

There exists many excellent data visualization tools. Besides Excel, Tableau and other commercial tools, open source tools, such as R, can be a very good choice.

在Blogsy发表的

Sunday, July 28, 2013

Understanding Data Analysis: Part One

I have a series of blog entries on data analysis concepts, process, and techniques. This is the first part of the series. This blog entry is motivated and actually based on the following references.

Michael Milton: Head First Data Analysis

Part I: The basic steps

Making decisions at a management position can be either easy or difficult. Using intuitions and always bet on one's luck has been a tradition for many "cowboy" style managers. To keep being successful and stay at the right career steps, many managers choose to use decent approaches and make decisions based on a set of rational and serious analysis. This starts the era of data analysis.

No matter what analysis it is, the very basic steps of any analysis are to define the problem, break it down, take observations and make partial solutions, assemble the results and make the final conclusions. Data analysis is about applying the analysis with a large set of data at hand.

The information age brings massive amount of data to every business entity. Data means value and opportunities if you know how to understand and use it. In many situations we found ourselves working as a data analyst without knowing the basic techniques and practices about this role. A data analyst knows how to break down a large set of data and give it a structural understanding so that the data becomes intelligence and insights. With a powerful toolbox, a data analyst transforms data into knowledge which often pushes forward decisions to be made. Thus, the basic skill or instinct of a data analyst is the capability to understand and structure data.

Given a scenario or just a piece of data, the typical steps that every data analyst will go through can be summarized below.

1. Define the problem

This is the very first step. Know your problem is the basic instinct for a data analyst. All analysis is targeted at a goal, that is, the question that should be answered or the problem that should be solved. Without defining the question or the problem, the analysis will head for nothing. How to define the problem? Just ask! You should of course ask the person who will make decisions based on your analysis. Sometimes it can just be yourself if you are to make the decision. Don't just ask for the final goal of the problem. Ask what the decision maker means.

In many scenarios, the decision maker needs the help from a data analyst because she/he

- Do not know about the data

- Know somethinng about the data but is not totally sure with it

- is not sure about the problem

- is not sure about the scenario of the problem

- Is not good at making decisions

- Use more intuition than analytics

The data analyst should start by clearly defining the problem with her/his clients. It can be that the client brings more problems or requirements after this step, which is totally fine. Changes to requirement may bring little overhead in terms of administration effort. But it also helps to narrow down the problem in many cases. A lot of the times, to help the clients to identify her/his problem is the very first and the most important task for the analyst. Many clients are even not aware of the problems that they are facing.

There is, of course, a method or concept called "exploratory data analysis." That means the analyst must find valuable hypotheses from the data which worth evaluating. This is already more than a concrete problem.

2. Disassemble the scenario

To find a solution to a problem scenario, a very typical approach is to divide the problem and scenario into small pieces and analyze. This step requires the data analyst break down the problem into smaller pieces in order to start the detailed analysis. It is important to keep breaking down to the level that best fit the analysis.

To break down the problem, one approach is to create a kind of tree structure, starting with the problem defined in the previous step and extending to different sub-nodes until you reach the leaf level.

This also applies to the data. The data analyst will need more detailed data if the first step only provides a general level summary of the data. More data brings more room for analysis and testing of conclusions.

What is the most important technique to use when looking at data? Making comparisons! In fact, making good comparisons is at the core of data analysis. A data will always find it useful to break down the data by finding good and interesting comparisons of data.

As a product from this step, I can imagine that the data analyst completes with a document of two sections. The first is the break-down structure of the problem. The second is a set of rules on how to divide and slice the data.

As it always says, you cannot move the whole mountain all at once, but you can move it piece by piece.

3. Evaluate your conclusion

This step evaluate all the observations from the previous steps. The key to evaluate is to compare. With all the problems on the left hand and all the observations on the right side, compare, compare and compare!

A compulsory to this step is to insert the analy her/him-self to this process. This means the analyst bring her/his responsibility on board and focus on getting the best value for the client as well as the analyst her/him-self. While the analyst is indeed betting on her/his credibility, there comes also more likelihood that the analyst will get more trust and belief from the clients.

I'd like to quote this paragraph from the book Head First Data Analysis.

"Whether you’re building complex models or making simple decisions, data analysis is all about you: your beliefs, your judgement, your credibility"

4. Make the decision

At this step, the analyst summarizes the conclusion and gives the final decision or recommendation. All the analysis results must be formed into a format that is easy for making decisions. Otherwise, the analysis was useless.

The analyst must make it simple and straight for the decision-makers. It is important to get the voice heard (that is, to make it understandable) so that people can make sensible decisions based on the input from the analyst.

One more thing to add is about the analysis report. The analysis report can take as simple as three sections, " background," "understanding of data," "recommendation." And again, it must be simple, concise and direct.

With all the steps of data analysis, will we be ready for all kinds of challenges? Not exactly. There are quite a list of things to set ready for. One first question will be "What to do when my analysis turns wrong or incomplete?"

Well, not everything can be so easy and safe. In many cases, the initial data analysis report can get bad feed back and results. What could go wrong then?

The first thing to consider is the assumptions made in the previous analysis. Assumptions are usually based on mental models we built on the scenario. Mental model is how we normally see the world around us. It helps us to understand and interpret information around us. It is like a model in mathematics, where one can build a ton of basic elements and describe the world with them. Very often, the analyst's mental model is affected by the mental model from the clients. So an important lesson for the analyst is to make all the mental models, at least the important parts, explicit to the analysis.

Every part of the analysis, such as the statistic model, the data model, etc. is always dependent on the analyst's mental model. When mental model is wrong, all steps of analysis must be run again to provide a new round of analysis result.

To build the right mental model, the analyst must include both things that are taken as assumption and things that are unknown or uncertain. The "anti-assumption" is very often more important and can lead to new discovery and insights. For example, consider making a list of "things I do not know," "things that I do not know how to do," "things that I have not done," etc. This must be applied to the analysis process and the analyst needs to challenge the client on many aspects of the clients' mental model at the "disassemble" step of the analysis.

And of course, the analyst must ask for new and more detailed data for the new round of analysis. Results from the previous analysis can shed the light on where and how to ask for the new data. A very typical mistake from many analysts is to focus too much on the numerical values (also called "measures") in the data. In fact, the row or column header information is even more important to look at. In many cases, how you drill into and mesh around data decides how the data look like in the end.

Analysis never stops. If the analyst always finds out new clues or observations, she/he can continue the four steps for ever.

在Blogsy发表的

Saturday, August 11, 2012

A Simple Tutorial on PowerPivot (4)

5. Publishing and Sharing Reports

After doing analysis on the data, it is the time to share the reports and data to other business users. As mentioned early in this tutorial, it is very important to use SharePoint to create a BI collaboration environment in order to protect the Self-service BI from being a pure spreadsheet mart.

Before publishing PowerPivot workbook to SharePoint, it is normal to consider "beautify" the workbook so that other users find it useful to read and use the report. For example, it is useful to remove or hide worksheets that are not part of the report, to improve the placement and appearance of the worksheets, or even turn of the grid lines to make it better for using. Another important thing to do is about the configuration of data refresh. The connections that you have at your local PowerPivot for Excel require certain data connection providers, and these connections, after being uploaded to SharePoint, still require the same data connection providers or at least require a re-configuration of the data connections. So it is important to install (or ensure that they are installed) the relevant data connection providers at the server.

When publishing PowerPivot workbook to SharePoint. There is an option to turn on document versioning in SharePont 2010 site. However, you should be really careful when deciding this option. After turning on the versioning settings, every change to the PowerPivot workbook is saved to the content database. That means many versions (at almost the same size) of the PowerPivot workbook is saved to the content database. This can make the system heavy loaded after some time.

To load the workbook to SharePoint, choose to "Save & Send" in the Excel Ribbon or just upload the workbook to the site. You should use the PowerPivot Gallery at SharePoint to keep the PowerPivot workbooks.

At the SharePoint server side, you must make sure that relevant features and system services are installed in order to use PowerPivot. To be more specific, Excel Web Access, Excel Calculation Services, PowerPivot System Services, and SQL Server Analysis Services are involved when users interact with published PowerPivot workbook at PowerPivot gallery. Here one can have the same "slicing-and-dicing" , "soring and filtering" user experiences as the PowerPivot for Excel). As illustrated in the example below, one can still work with the “slicing-and-dicing” feature inside the web page open by Internet Explorer.

The PowerPivot Gallery is designed for end-users of Self-service BI to use the PowerPivot. So any user of the workbook, with provided access, can re-use the data and even do further analysis based on it. One can click on a specific sheet in the PowerPivot workbook to further analyze the data with given features from Excel Services.

The PowerPivot Gallery has three standard views, i.e., Gallery View, Theatre view, and Carousel view. The following picture shows the differences of these views.

After a PowerPivot workbook is published at SharePont site, users of the published workbook can create their own reports based on the PowerPivot workbook data (meaning that the workbook can be used like a data source). When you hover on the icons of the workbooks, you can see a option to create new report. Then there are two options available, Open New Excel Workbook and Create Report Builder Report. Below is an example.

When you create an Excel Workbook by using the PowerPivot workbook data, the data is provided from the PowerPivot workbook (in SharePoint) in the same ways as provided from an Analysis Service database.

When using Report Builder to create reports based on PowerPivot workbook data, the data is treated as a data connection just like data connections.

Another important work when sharing your PowerPivot report in SharePoint is to consider refreshing data at a PowerPivot workbook stored in SharePoint. You can easily manage the data refresh in the PowerPivot Gallery. The PowerPivot data refresh feature currently does not support data refresh within a day (shame, isn't it). But you can perform an "immediate" data refresh by manually initiate a data refresh command. And you can also configure the SharePoint to send you an email notification after data has been refreshed. One thing to remember, if you use linked table in PowerPivot workbook, it will not have any data refresh. Below is an example page of data refresh configuration.

6. Conclusion

This Tutorial is made to describe the key features in PowerPivot in simple words. Apparently one can look into the Microsoft site or Youtube to look for more insights or "live-actions" on these features.

Hopefully this tutorial can be help for someone looking to start using PowerPivot. Good luck and enjoy the world of Self-service Business Intelligence!

Friday, August 10, 2012

A Simple Tutorial on PowerPivot (3)

4. Analyze with PowerPivot

The key feature that makes Excel an ideal tool for business analysis is the PivotTable and PivotChart capabilities. Now with Excel 2010, the “slicer” capability (as illustrated below) is also added so that the interactivity is further improved.

PowerPivot provides a set of options to create PivotTable and PivotChart (as illustrated below). Details about each option is out of scope of this tutorial. It is very similar to Excel with more enhancement on slicer and “measures” which can be created by DAX.

One important option, is to create “Flattened PivotTable”. One usage of this option is to print out the data. Another way is to convert the result PivotTable to Excel Formulas so that one can apply Excel functions on the result table. The option is illustrated below.

It is important to note that the “Field List” for PivotTable or PivotChart under PowerPivot workbooks is different from what it was for other data in Excel. The following picture is an example. The left one is a typical PivotTable Field List while the right one is what you can see in the PowerPivot Field List.

Another interesting feature in PowerPivot is Automatic Relationship Detection. As illustrated below, PowerPivot is being active to detect any potential relationships among tables and fields.

As the core part of data analysis, defining and using measures if always compulsory. PowerPivot provides full support through DAX expressions. Besides the implicit measures which are created while dropping things into the Value area in Pivot chart or Pivot table design. One can also create explicit measures by explicitly specify the DAX expression. To begin creating a measure, right-click on a table name in the PowerPivot field list and select Add New Measure (as illustrated below).

And defining DAX measures requires to use the DAX functions. As illustrated below, the measures are defined in a similar way to Excel formulas.

What to be careful with here is that you should not thing in “rows” when measure values are defined here. And this measure must be defined with an aggregation function (such as sum, count, average, etc.) that will return a single value. In such sense, this is a bit different from the definitions of some calculated columns. This is how it looks like after this measure is defined on the PivotTable.

One more exciting feature to mention, Excel 2010 allows users to create “Named Sets” when using PivotTables (illustrated below). This link provides more detail about this feature: http://blogs.office.com/b/microsoft-excel/archive/2009/10/05/pivottable-named-sets-in-excel-2010.aspx

PowerPivot can also work with Named Sets in Excel.

Thursday, August 9, 2012

A Simple Tutorial on PowerPivot (2)

3. Enriching the Data

After importing data, the next natural thing to do is to integrate and enrich the data. There are following ways to enrich data.

First, it is possible to change the actual data types for imported columns. As illustrated below, after choosing a column in a PowerPivot table, one can choose the “Data Type” option in the ribbon to convert the data types.

Compared to changing data type of columns, a second approach is to just change the way data looks like in the table. That is done by re-“formatting” the data which only changes the “view” of data without converting the underlying data type. Formatting changes in PowerPivot are applied to an entire column. This is illustrated below.

One of the key features of PowerPivot is to create relationships among tables. As illustrated below, it is possible to set up relationships after the PowerPivot tables are imported. The “Create Relationship” option at the “Design” ribbon in PowerPivot window allows one to set up relationship between tables. Due to the limitation of the current version, relationships can only be defined on single columns (as shown below). This means that if you want to establish two distinct relationships between two tables, you must import another instance of the table. It is a bit similar to using SQL to define join statements that use multiple columns in the same table. So users cannot expect to define complicated relationships in PowerPivot. The “Manage Relationship” button shows all the defined relationships of all the imported PowerPivot tables.

A very existing feature in PowerPivot is that one can define ”measures” (like in Cubes) using the DAX formulas. This includes calculated columns (which are materialized into the table) and measures explicitly defined (which are kept virtual until called). Below is an example of a defined calculated column in a PowerPivot table. The DAX formula in the Fx textbox seems very similar to Excel. But the main difference is that this definition covers a whole column “TotalSales” instead of a single cell in Excel.

A little bit more about DAX. DAX in fact has a total of 135 functions to help with data analysis. About 80 functions are supported by Excel. DAX also provides functions that operate on tables.The DAX functions can be categorized into eight categories, i.e., Date and Time, Information, Logical, Math and Trigonometric, Statistical, Text, Filter, Time Intelligence. Similar to math and trigonometric functions , only a subset of Excel’s statistical functions is supported by DAX in the current release of PowerPivot. When typing DAX formulas, PowerPivot provides the “autocomplete” feature. One needs to press the Tab key to use this feature.

Like Excel, PowerPivot window provides the ”Sorting” and ”Filtering” options at the “Sort and Filter” Section, and the “Freeze” option at the “View” Section for users to refine the data. This is illustrated below. The column width can also be adjusted in a similar way to Excel.

Another useful feature for improving the view is to use ”hide” and ”unhide” options. As illustrated below, it is possible to hide columns from PowerPivot or PivotTable or both.

After data is loaded to PowerPivot, the “Refresh” button can be used to retrieve the latest source data to PowerPivot. As illustrated below, it is possible to refresh the current table or refresh all tables in PowerPivot. When clicking the “refresh” button, what happens behind the scene is that PowerPivot sends two commands to the VertiPaq engine when you click Refresh:

The first command is to “Alter”, i.e., to make any metadata changes you have made in the PowerPivot Window.

The Second command is to “Process”, i.e., to refresh the data from external data sources and any calculations that depend on the refreshed data.

Wednesday, August 8, 2012

A Simple Tutorial on PowerPivot (1)

1. Introduction

Using PowerPivot is pretty easy. In general, four steps are involved.

First, you need to get the data to PowerPivot. This is about importing and doing a little preparation on the data. Data can be brought from relational databases, Analysis Service databases, PowerPivot that have been published at SharePoint sites, Excel and Text files, data feeds (ops, something new from SharePoint), through Copy-Paste actions, linked worksheet, etc.

Second, after taking everything into PowerPivot, you need to enrich the data through various ways, such as filtering some rows, hide/unhide some columns, formatting columns (or change data types), create calculated columns through DAX definitions, and most important of all, defining relationships among different PowerPivot tables.

Third, with all data ready to use, it is the time to analyze the data.Here is the time to use DAX to create measures and more Excel functionalities to find out interesting results.

Forth, and the most important of all, create and share your reports.This is about showing and sharing your reports through SharePoint.

This tutorial has four parts where each part introduces one of the above steps.

Let’s start with the first step.

2. Data Importing and Preparation

To bring data to PowerPivot is similar to bringing data to SQL Server Express. You will have to be able to connect to different data sources. PowerPivot requires database providers and, in some cases, client libraries in order to be able to import data from these sources. In most cases, the database providers and client libraries are not provided by PowerPivot for Excel and must be acquired and installed separately.

To import data, the simplest way is to import using Table Import Wizard at the PowerPivot window. There are two options when importing data using the Table Import Wizard (so you should open PowerPivot window and click “Table Import Wizard”.

1. Select from a list of tables or view

2. Specify a SQL Query.

If you choose “Select from a list of tables or view”, the following picture shows the result.

The button “Select Related Tables” gives a big help if you know one of the tables (and have selected it” and wants to make sure that all other relevant tables are included.”

For the selected tables, you can use “Preview&Filter” button to decide whether to include or exclude certain columns or rows from the data. The following picture is an example.

After selecting all the relevant tables and completing relevant filtering options. Click “Finish” button to start the actual importing process. Result of the Importing process looks like the following.

Now, let’s take a look at the second way, i.e., “Specify a SQL Query” when doing the Table Import Wizard. When choosing this option, “Specify a SQL Query”, one can use the “Design” button to go into the Query Designer to specify the SQL Query. The picture below is an example of the SQL query.

In addition, Table Import Wizard lets you import data from MS Analysis Services (MSAS) and other PowerPivot workbooks. It’s important to note that the result of MSAS query is typically a "cellset" object, which is a multi-dimensional object. But when the data is imported into PowerPivot, it is transformed into a simple table. When you import data from MSAS you won’t have the same interactivity of slicing and dicing as you typically have when using BI client tools, such as Excel connected to an Analysis Services Unified Dimension Model (UDM) cube. Another very interesting feature of PowerPivot is that PowerPivot can connect to data in a published PowerPivot workbooks at sharepoint sites. This makes the enterprise mash-ups into reality. It’s worthy to note that you cannot import data from PowerPivot workbooks if they are .xlsx files on a local disk drive.

Besides importing tables, PowerPivot now supports “data feed”. Data feeds is a new type of technology that provides tabular-like data such that users can subscribe to the data in the same way as the RSS feeds in order to receive updated data as the source changes. Data feeds are based on the Open Data Protocol, which uses the Atom Publishing Protocol and its Data Services URI and Payload Extensions. Data feed can come from SQL Server reporting services, SharePoint List and any generic data feed that supports the OData protocol.

Here is an example when you choose to import “From Data Feeds” in PowerPivot. PowerPivot understood the URL description and import the initial data from the source.

Text Files is another important source of data. Just like importing data in DTS, PowerPivot works with both Delimited text files and fixed-length-field text files. But to work with the fix-length-field files is a little bit tricky. One of the key step here is to define a schema.ini (exact name as this) at the same location as the file. The contents of the schema.ini file are described in the MSDN topic at this link

You can also import from Excel files. This is demonstrated below.

There are also other ways to bring data into PowerPivot. First one is to paste data from clipboard (well, you need to copy these tabular data first), as demonstrated below.

Second one is to create ”Linked Excel tables” which keeps the source and target in sync automatically. The following is an example.

And the result of the linked table is shown in the PowerPivot window with a special icon (the one next to the “DimStore” name in the below picture).

After specifying the import criteria, PowerPivot will start the importing with the progress reports. It is important to note here that although PowerPivot can import data from many diverse data sources. The data types supported in PowerPivot is very limited compared to the big variety of data types at different sources. One of the things to be alerted is when you are importing data under different currency types, PowerPivot will import currency types as general currency types (“general” to the regional format defined by the machine where PowerPivot is running on). This can definitely lead to unexpected results. But you can use the PowerPivot formatting tools to adjust the columns’ data type, and format after the import is completed.

Monday, August 6, 2012

Basics About Zachman framework

With inspirations from traditional architectural and engineering discipline, John Zachman started the Zachman framework with its first publication in 1987. Zachman framework provides a simple (at least it looks simple) but well-structured approach that supports many ways of architectural presentation using a common vocabulary and viewpoints.

To be more specific, Zachman framework describes a model of an enterprise system from six perspectives, Planner, Owner, Designer, Builder, Subcontractor and the working system. The major feature in Zachman framework is this matrix with 36 cells (example picture from Wikipedia) indicating all "possible" aspects to an enterprise system.

Here are the rows of the matrix:
Scope, Business Model, System Model, Technology Model, Components, Working system

Here are the columns of the matrix:
Who, when, why, what, how, where

As indicated in the publications about Zachman framework The major principles behind Zachman framework are:

1. "A complete system can be modeled by depicting answers to the following questions: why,
who, what, how, where, and when."
2. "The six perspectives capture all the critical models required for system development."
3. "The constraints for each perspective are additive; those of a lower row are added to those of
the rows above to provide a growing number of restrictions."
4. "The columns represent different abstractions in an effort to reduce the complexity of any
single model that is built."
5. "The columns have no order."
6. "The model in each column must be unique."
7. "Each row represents a unique perspective."
8. "Each cell is unique."
9. "The inherent logic is recursive."

This framework with 36 cells can be used recursively to management the overall enterprise systems and each components inside it. One can build the 36 cells in different levels to keep views of an enterprise system at different levels. Implementing Zachman framework must recognize two critical challenges in current enterprises.

One must beginning making the descriptive representation of enterprise more explicit. This means to populate the various cells of Zachman framework
Formalize a system to keep populate the different cells in a disciplined way.

There can be two different approaches to get a company a quick-start on using the Zachman framework.

Start developing the cells with the purpose of developing an enterprise architecture strategy
Start developing row 1 of the matrix in order to establish an implementation road map.

As claimed by different parties, one drawback of this framework is that it does not have so much public, open documentation as compared to TOGAF. However, from my own experience, this is also a good part of a framework where certain spaces are left for the enterprise itself to find out and define the discipline. To many enterprise architects, working by using everything available makes their life easier and simple. While to some other enterprise architects, having more space to dig out valuable works for the enterprise is of more fun and self-satisfying.