SQL, BI, and Information Management: October 2013

Data analysis is a basic skill for any business personel at any modern enterprises. Most of the decision making processes at such organizations lay on a solid analytical background before they reach into the final resolution. Almost all the modern Business Intelligence technologies originate from requirements at the data analysis steps. Learning the data analysis process and concept gives very good meaning for understanding why and how the BI technologies have evolved into the current state.

I have a series of blog entries on data analysis concepts, process, and techniques. This is the second part of the series. This blog entry is motivated and actually based on the following references.

1. Michael Milton: Head First Data Analysis

Part II: Test, Optimize and Visualize

While following the four steps described in my previous entry, every analyst is facing the problem of defensing her/his decision-input by the hard, bloody reality of the real world result. All analytical models, decisions and recommendations have to survive through the examinations when they are deployed and used in the real world. Time proves the truth.

But what about fine-tuning and adaptive adjustment to a defined analytical (or mental) model? The world evolves (if not revolutionalizes), so does the mental models. How to give the analysts opportunities to adjust the models and improve them?

Test, test, and test. Yes, test your models. And by "test" I am not saying using just simulations and paper-based data and examples (made by empirical data). You must test your model in the real world and do the experiments in a real scenario.

In many real world scenario, doing test gives very good sense to solve the problem in a sustainable manner. Making a well-planned and well-executed experiments gives a powerful proof of the analyst's judgements. How to carry out the experiment? The very basic element of experimenting is to do surveys. Making surveys before and after a planned test activity can let the analyst have more insight on how effective the test activity has been.

One good example in Michael's book is to use a few local coffee shops to experiment the analytical models which can contribute to a national wide deployment to all the coffee shops. Before running the experiments, the analyst must define what are the factors to observe and compare. And a baseline data must be prepared in order to compare the results. A set of control groups, in contrast to experimental groups, must be established to ensure the comparison. Selecting the control groups requires the analysts to be very careful. The analyst must make sure that other confounders are not blending into the experiment and make sure that all the control groups are "equally" the same on the confounders. After the groups are defined, the rest of the steps are simple and direct. One just need to execute the experiments and collect result for comparison. And the final conclusion will be clear.

What is confounder, precisely? A confounder is a difference among the target group of the analyst's research. A confounder is a factor that, if you use it to compare, makes the analysis result sensible compared to other factors which are non-confounders. A confounder can be factors such as locations, age groups, gender type, etc.

Doing data analysis with identified confounders is not very difficult. The analyst just needs to use the confounders to break out data into smaller chunks to test the result. Identifying confounders based on a given set of data is not easy. Analysts can typically use a set of mathematical or statistical tools to work it out.

A very typical method, when the analytical model or the scenario for analysis gets complicated, is to define a mathematical or statistical model and to optimize it. One thing to make it clear is that no mathematical model can be made so perfect that all real world factors can be modelled, represented and "mathematicalized". One always has to drop those that are too complicated and focus on the "main thing."

The way that an analyst tries to solve the optimization is to define a mathematical model for the problem scenario and then solves the problem by classical mathematical methods. When it happens that the results do not show to be the most optimal result, the analyst must find out where in the model was set to a wrong assumption or if anything should have been included in the model or if it is the right mathematical model to use. There is no perfect model for all scenarios. The analyst's role is to choose the right one for the specific scenario.

Optimization can be done when it is possible to represent the main factors using a mathematical model. There are many data analysis tools (software packages) that can help a data analyst to find out the most optimal result. The Solver package in Excel is a good example of it. And I shall also list R, SAS, SPSS, Mathematica, Matlab and so on.

To solve an optimization problem, the analyst must define an objective function, find out all constants, variables, and search for other constraints related to the identified variables. In many cases, it brings extra benefit if a visualization of the objective function, together with the constraints, can be drawn on a two or three dimensional space. Visualization gives the best chances to verify the analysis and estimate the most optimal result.

In fact, visualization has been broadly used in the scientific (or, academical) world to understand observations and facilitate key decisions on research directions. The business world (and the BI industry) is following this trend in recent years. Good examples are the population of visual analytics tools such as SAS Visual Analytics, Tableau, and Qlikview.

Data visualization means to show your data with diagrams instead of tables of numbers. Visualization is important and extremely useful in most scenarios. It is equally to say that finding out the right visualization is also vital and decides how the clients receive and accept the observations and conclusions.

Using visualizations does not mean that the analyst needs to make fancy or beautiful diagrams. In fact, the visualization should be clear,simple and straightforward.

How to find the best visualization to pinpoint the analysis? A recommended way is to use scattetplots. Just start by choosing an X and Y dimensions and start drawing the data points. Very often the analysts can find many interesting patterns and are able to define new dimensions. A standard analysis would use independent variables as the X axis and dependent variables as the Y axis. Of course there will be more than one dependent and independent variables. The analysis report normally involves diagrams with different combinations of variables which can help to identify certain patterns in the data points.

There exists many excellent data visualization tools. Besides Excel, Tableau and other commercial tools, open source tools, such as R, can be a very good choice.

在Blogsy发表的

SQL, BI, and Information Management

Sunday, October 27, 2013

Understanding Data Analysis: Part Two

Blog Archive