SQL, BI, and Information Management: 2013

Sunday, November 3, 2013

Understanding Data Analysis: Part Three

Data analysis is a basic skill for any business personel at any modern enterprises. Most of the decision making processes at such organizations lay on a solid analytical background before they reach into the final resolution. Almost all the modern Business Intelligence technologies originate from requirements at the data analysis steps. Learning the data analysis process and concept gives very good meaning for understanding why and how the BI technologies have evolved into the current state.

I have a series of blog entries on data analysis concepts, process, and techniques. This is the third part of the series. This blog entry is motivated and actually based on the following references.

1. Michael Milton: Head First Data Analysis

Part III: Mathematics, Statistics, and Data Mining

One very basic professional background of every data analyst is the knowledge and education in mathematics, statistics, and data mining. There has been a saying that pure mathematical education is not useful for the real world any more. In fact, this "saying" has been completely wrong. Lots of mathematical models, methods, and theories are directly applied to real world scenarios. What could be true is that if one is not paying enough attention to what has been taught at the mathematical lectures, one tends to lose all the key part of how to use the mathematics in the real world. But everyone is good at blaming the theories rather than regreting what has been missed in the courses.

Theories cannot talk (or blame), they only give value.

One good example is histogram. Histogram has been a powerful tool in the mathematical and statistical world. And this tool is also very useful in the data analysts´ world. Histogram is a powerful tool to understand the distribution of data across different groupings. Most data analysis involves large groups of numbers and histogram can be used to start the observations when the analysis starts from a blank piece of paper. Both Excel and R provide good support on histograms. The data analysis function in Excel can help to draw histograms. Using R is also a good choice and R is in fact more effective in diagraming tasks like this. In fact, R is better in terms of finding the groupings in the data set.

Another data analysis scenario where mathematics is very often used is to make predictions. Making predictions is just like finding out the best mathematical models for the problem scenario. There will never be perfect models or equations for every scenario or every object. In the end, it is just a guess. An analyst can always start with the very basic mathematical approach, linear regression. If it is possible to find correlations between certain variables, the linear regression is the best way to make the predictions. However, not lots of things in the real world actually fall into the linear model. So an important step for the analyst is to draw scatter plots and try to see if any mathematical models can efficiently represent the trend. It is also very possible that part of the data points on the plots can be represented with linear regressions while the others cannot. Just like the normal step in the analysis, the analyst needs to break the problem into pieces where linear regression works and where other model works. Of course, different non-linear regression models can also be tested in such cases. Back to the universities, a very good course called "numerical analysis" presents all basic mathematical models for this scenario.

Statistics is also a frequently-used tool in data analyst´s daily life. For example, the practices of falsification, hypothesis testing, is a very good method for dealing with heterogeneous sources of information. Such as rumors that you have heard from various places. Do not underestimate the power of analyzing on rumors. In the business world where there more 99% rumors and the rest is only surprises (yes, there is no truth in the business world), making the right "prediction" and "understanding" of the near future business trend means that the analyst must sit and work with rumors. Navgating around the ocean of lies and rumors, the analyst must learn to draw things on a paper and try to link them into different relationships. Different variables can be negatively or positively linked in such diagrams. Very often the analyst will get a network diagram after this session. This is absolutely right! Things in the real world are linked through causal networks.

The core part of hypothesis testing is falsification. The analyst should focus on eliminating the disconfirmed hypotheses rather than picking the right one. What to do when there are more than 1 hypothesis left after you have tried everything to eliminate them? A simple approach is to decide which one has the strongest support based on all the possible evidences. Evidence is diagnostic if it helps you rank one hypothesis as stronger than another. The analyst needs to look at each hypothesis in comparison to each piece of evidence and each other and see which has the strongest support. Very often, what the analyst needs is to make a ranking matrix. Every evidence is ranked with different values for each hypothesis. Sometimes an evidence can have stronger values than the others. Well, how stronger or weaker is decided by the analyst. This is not rocket science and we are just giving a statistically strongest vote to the alternative hypothesis. One thing very important to remember is that there can be new evidence (which may be very strong and can change the total ranking orders of all the candidate hypotheses). It is the analyst' role to make sure that the new evidence is included in time and the latest changes are communicated.

Besides hypothesis testing, I would recommend to learn about Bayesian statistics and the practices about subjective probabilities. In fact, subjective probabilities is a very practical tool i the real world. When it happens that an analyst is not sure about a judgement or a conclusion (which is very often), she or he normally tends to use words like "probably," "perhaps" and so on. Here is the time when one should consider to use subjective probability. That is, to ask for a percentage number instead of theses words. For example, something like "90% probability" is a subjective probability.

With a set of subjective probabilities collected from a group of analysts, it is very easy to use all kinds of statistical methods to further analyze the scenario. One good tool is the standard deviation. The standard deviation measures how far typical points are from the average (or mean) of the data set. Given the set of probabilities collected from different analysts, the standard deviation tells how different the opinions are.

There are many other powerful tools and theories in the statistical world for every data analyst.

Data mining is based on different mathematical and statistical theories and method. The computer scientists tends to "computerize" the theories and make it into a computing model which is ready for real world usage. Details about data minings will be posted in future blogs (where I will concentrate on each data mining method). Here is a list of typical data mining tools and theories used in the data analysis world.

Bayesian Network
Decision Tree
Neural Nework
Support-vector Machines
K-nearest Neighbors
Clustering
Association Rules

Having a background in mathematics, statistics and data mining has always been a career foundation for all data analysts. The techniques and methods mentioned in this blog entry are only a small portion of the whole ocean of the mathematical world. The important thing for data analyst is to keep learning new tools and try to apply them into their real world scenarios. in the end, it is the business value that decides the success of all techniques, methods and theories.

在Blogsy发表的

Sunday, October 27, 2013

Understanding Data Analysis: Part Two

I have a series of blog entries on data analysis concepts, process, and techniques. This is the second part of the series. This blog entry is motivated and actually based on the following references.

1. Michael Milton: Head First Data Analysis

Part II: Test, Optimize and Visualize

While following the four steps described in my previous entry, every analyst is facing the problem of defensing her/his decision-input by the hard, bloody reality of the real world result. All analytical models, decisions and recommendations have to survive through the examinations when they are deployed and used in the real world. Time proves the truth.

But what about fine-tuning and adaptive adjustment to a defined analytical (or mental) model? The world evolves (if not revolutionalizes), so does the mental models. How to give the analysts opportunities to adjust the models and improve them?

Test, test, and test. Yes, test your models. And by "test" I am not saying using just simulations and paper-based data and examples (made by empirical data). You must test your model in the real world and do the experiments in a real scenario.

In many real world scenario, doing test gives very good sense to solve the problem in a sustainable manner. Making a well-planned and well-executed experiments gives a powerful proof of the analyst's judgements. How to carry out the experiment? The very basic element of experimenting is to do surveys. Making surveys before and after a planned test activity can let the analyst have more insight on how effective the test activity has been.

One good example in Michael's book is to use a few local coffee shops to experiment the analytical models which can contribute to a national wide deployment to all the coffee shops. Before running the experiments, the analyst must define what are the factors to observe and compare. And a baseline data must be prepared in order to compare the results. A set of control groups, in contrast to experimental groups, must be established to ensure the comparison. Selecting the control groups requires the analysts to be very careful. The analyst must make sure that other confounders are not blending into the experiment and make sure that all the control groups are "equally" the same on the confounders. After the groups are defined, the rest of the steps are simple and direct. One just need to execute the experiments and collect result for comparison. And the final conclusion will be clear.

What is confounder, precisely? A confounder is a difference among the target group of the analyst's research. A confounder is a factor that, if you use it to compare, makes the analysis result sensible compared to other factors which are non-confounders. A confounder can be factors such as locations, age groups, gender type, etc.

Doing data analysis with identified confounders is not very difficult. The analyst just needs to use the confounders to break out data into smaller chunks to test the result. Identifying confounders based on a given set of data is not easy. Analysts can typically use a set of mathematical or statistical tools to work it out.

A very typical method, when the analytical model or the scenario for analysis gets complicated, is to define a mathematical or statistical model and to optimize it. One thing to make it clear is that no mathematical model can be made so perfect that all real world factors can be modelled, represented and "mathematicalized". One always has to drop those that are too complicated and focus on the "main thing."

The way that an analyst tries to solve the optimization is to define a mathematical model for the problem scenario and then solves the problem by classical mathematical methods. When it happens that the results do not show to be the most optimal result, the analyst must find out where in the model was set to a wrong assumption or if anything should have been included in the model or if it is the right mathematical model to use. There is no perfect model for all scenarios. The analyst's role is to choose the right one for the specific scenario.

Optimization can be done when it is possible to represent the main factors using a mathematical model. There are many data analysis tools (software packages) that can help a data analyst to find out the most optimal result. The Solver package in Excel is a good example of it. And I shall also list R, SAS, SPSS, Mathematica, Matlab and so on.

To solve an optimization problem, the analyst must define an objective function, find out all constants, variables, and search for other constraints related to the identified variables. In many cases, it brings extra benefit if a visualization of the objective function, together with the constraints, can be drawn on a two or three dimensional space. Visualization gives the best chances to verify the analysis and estimate the most optimal result.

In fact, visualization has been broadly used in the scientific (or, academical) world to understand observations and facilitate key decisions on research directions. The business world (and the BI industry) is following this trend in recent years. Good examples are the population of visual analytics tools such as SAS Visual Analytics, Tableau, and Qlikview.

Data visualization means to show your data with diagrams instead of tables of numbers. Visualization is important and extremely useful in most scenarios. It is equally to say that finding out the right visualization is also vital and decides how the clients receive and accept the observations and conclusions.

Using visualizations does not mean that the analyst needs to make fancy or beautiful diagrams. In fact, the visualization should be clear,simple and straightforward.

How to find the best visualization to pinpoint the analysis? A recommended way is to use scattetplots. Just start by choosing an X and Y dimensions and start drawing the data points. Very often the analysts can find many interesting patterns and are able to define new dimensions. A standard analysis would use independent variables as the X axis and dependent variables as the Y axis. Of course there will be more than one dependent and independent variables. The analysis report normally involves diagrams with different combinations of variables which can help to identify certain patterns in the data points.

There exists many excellent data visualization tools. Besides Excel, Tableau and other commercial tools, open source tools, such as R, can be a very good choice.

在Blogsy发表的

Sunday, July 28, 2013

Understanding Data Analysis: Part One

I have a series of blog entries on data analysis concepts, process, and techniques. This is the first part of the series. This blog entry is motivated and actually based on the following references.

Michael Milton: Head First Data Analysis

Part I: The basic steps

Making decisions at a management position can be either easy or difficult. Using intuitions and always bet on one's luck has been a tradition for many "cowboy" style managers. To keep being successful and stay at the right career steps, many managers choose to use decent approaches and make decisions based on a set of rational and serious analysis. This starts the era of data analysis.

No matter what analysis it is, the very basic steps of any analysis are to define the problem, break it down, take observations and make partial solutions, assemble the results and make the final conclusions. Data analysis is about applying the analysis with a large set of data at hand.

The information age brings massive amount of data to every business entity. Data means value and opportunities if you know how to understand and use it. In many situations we found ourselves working as a data analyst without knowing the basic techniques and practices about this role. A data analyst knows how to break down a large set of data and give it a structural understanding so that the data becomes intelligence and insights. With a powerful toolbox, a data analyst transforms data into knowledge which often pushes forward decisions to be made. Thus, the basic skill or instinct of a data analyst is the capability to understand and structure data.

Given a scenario or just a piece of data, the typical steps that every data analyst will go through can be summarized below.

1. Define the problem

This is the very first step. Know your problem is the basic instinct for a data analyst. All analysis is targeted at a goal, that is, the question that should be answered or the problem that should be solved. Without defining the question or the problem, the analysis will head for nothing. How to define the problem? Just ask! You should of course ask the person who will make decisions based on your analysis. Sometimes it can just be yourself if you are to make the decision. Don't just ask for the final goal of the problem. Ask what the decision maker means.

In many scenarios, the decision maker needs the help from a data analyst because she/he

- Do not know about the data

- Know somethinng about the data but is not totally sure with it

- is not sure about the problem

- is not sure about the scenario of the problem

- Is not good at making decisions

- Use more intuition than analytics

The data analyst should start by clearly defining the problem with her/his clients. It can be that the client brings more problems or requirements after this step, which is totally fine. Changes to requirement may bring little overhead in terms of administration effort. But it also helps to narrow down the problem in many cases. A lot of the times, to help the clients to identify her/his problem is the very first and the most important task for the analyst. Many clients are even not aware of the problems that they are facing.

There is, of course, a method or concept called "exploratory data analysis." That means the analyst must find valuable hypotheses from the data which worth evaluating. This is already more than a concrete problem.

2. Disassemble the scenario

To find a solution to a problem scenario, a very typical approach is to divide the problem and scenario into small pieces and analyze. This step requires the data analyst break down the problem into smaller pieces in order to start the detailed analysis. It is important to keep breaking down to the level that best fit the analysis.

To break down the problem, one approach is to create a kind of tree structure, starting with the problem defined in the previous step and extending to different sub-nodes until you reach the leaf level.

This also applies to the data. The data analyst will need more detailed data if the first step only provides a general level summary of the data. More data brings more room for analysis and testing of conclusions.

What is the most important technique to use when looking at data? Making comparisons! In fact, making good comparisons is at the core of data analysis. A data will always find it useful to break down the data by finding good and interesting comparisons of data.

As a product from this step, I can imagine that the data analyst completes with a document of two sections. The first is the break-down structure of the problem. The second is a set of rules on how to divide and slice the data.

As it always says, you cannot move the whole mountain all at once, but you can move it piece by piece.

3. Evaluate your conclusion

This step evaluate all the observations from the previous steps. The key to evaluate is to compare. With all the problems on the left hand and all the observations on the right side, compare, compare and compare!

A compulsory to this step is to insert the analy her/him-self to this process. This means the analyst bring her/his responsibility on board and focus on getting the best value for the client as well as the analyst her/him-self. While the analyst is indeed betting on her/his credibility, there comes also more likelihood that the analyst will get more trust and belief from the clients.

I'd like to quote this paragraph from the book Head First Data Analysis.

"Whether you’re building complex models or making simple decisions, data analysis is all about you: your beliefs, your judgement, your credibility"

4. Make the decision

At this step, the analyst summarizes the conclusion and gives the final decision or recommendation. All the analysis results must be formed into a format that is easy for making decisions. Otherwise, the analysis was useless.

The analyst must make it simple and straight for the decision-makers. It is important to get the voice heard (that is, to make it understandable) so that people can make sensible decisions based on the input from the analyst.

One more thing to add is about the analysis report. The analysis report can take as simple as three sections, " background," "understanding of data," "recommendation." And again, it must be simple, concise and direct.

With all the steps of data analysis, will we be ready for all kinds of challenges? Not exactly. There are quite a list of things to set ready for. One first question will be "What to do when my analysis turns wrong or incomplete?"

Well, not everything can be so easy and safe. In many cases, the initial data analysis report can get bad feed back and results. What could go wrong then?

The first thing to consider is the assumptions made in the previous analysis. Assumptions are usually based on mental models we built on the scenario. Mental model is how we normally see the world around us. It helps us to understand and interpret information around us. It is like a model in mathematics, where one can build a ton of basic elements and describe the world with them. Very often, the analyst's mental model is affected by the mental model from the clients. So an important lesson for the analyst is to make all the mental models, at least the important parts, explicit to the analysis.

Every part of the analysis, such as the statistic model, the data model, etc. is always dependent on the analyst's mental model. When mental model is wrong, all steps of analysis must be run again to provide a new round of analysis result.

To build the right mental model, the analyst must include both things that are taken as assumption and things that are unknown or uncertain. The "anti-assumption" is very often more important and can lead to new discovery and insights. For example, consider making a list of "things I do not know," "things that I do not know how to do," "things that I have not done," etc. This must be applied to the analysis process and the analyst needs to challenge the client on many aspects of the clients' mental model at the "disassemble" step of the analysis.

And of course, the analyst must ask for new and more detailed data for the new round of analysis. Results from the previous analysis can shed the light on where and how to ask for the new data. A very typical mistake from many analysts is to focus too much on the numerical values (also called "measures") in the data. In fact, the row or column header information is even more important to look at. In many cases, how you drill into and mesh around data decides how the data look like in the end.

Analysis never stops. If the analyst always finds out new clues or observations, she/he can continue the four steps for ever.

在Blogsy发表的