Sunday, November 3, 2013

Understanding Data Analysis: Part Three

Data analysis is a basic skill for any business personel at any modern enterprises. Most of the decision making processes at such organizations lay on a solid analytical background before they reach into the final resolution. Almost all the modern Business Intelligence technologies originate from requirements at the data analysis steps. Learning the data analysis process and concept gives very good meaning for understanding why and how the BI technologies have evolved into the current state.

I have a series of blog entries on data analysis concepts, process, and techniques. This is the third part of the series. This blog entry is motivated and actually based on the following references.

1. Michael Milton: Head First Data Analysis


Part III: Mathematics, Statistics, and Data Mining

One very basic professional background of every data analyst is the knowledge and education in mathematics, statistics, and data mining. There has been a saying that pure mathematical education is not useful for the real world any more. In fact, this "saying" has been completely wrong. Lots of mathematical models, methods, and theories are directly applied to real world scenarios. What could be true is that if one is not paying enough attention to what has been taught at the mathematical lectures, one tends to lose all the key part of how to use the mathematics in the real world. But everyone is good at blaming the theories rather than regreting what has been missed in the courses.

Theories cannot talk (or blame), they only give value.

One good example is histogram. Histogram has been a powerful tool in the mathematical and statistical world. And this tool is also very useful in the data analysts´ world. Histogram is a powerful tool to understand the distribution of data across different groupings. Most data analysis involves large groups of numbers and histogram can be used to start the observations when the analysis starts from a blank piece of paper. Both Excel and R provide good support on histograms. The data analysis function in Excel can help to draw histograms. Using R is also a good choice and R is in fact more effective in diagraming tasks like this. In fact, R is better in terms of finding the groupings in the data set.

Another data analysis scenario where mathematics is very often used is to make predictions. Making predictions is just like finding out the best mathematical models for the problem scenario. There will never be perfect models or equations for every scenario or every object. In the end, it is just a guess. An analyst can always start with the very basic mathematical approach, linear regression. If it is possible to find correlations between certain variables, the linear regression is the best way to make the predictions. However, not lots of things in the real world actually fall into the linear model. So an important step for the analyst is to draw scatter plots and try to see if any mathematical models can efficiently represent the trend. It is also very possible that part of the data points on the plots can be represented with linear regressions while the others cannot. Just like the normal step in the analysis, the analyst needs to break the problem into pieces where linear regression works and where other model works. Of course, different non-linear regression models can also be tested in such cases. Back to the universities, a very good course called "numerical analysis" presents all basic mathematical models for this scenario.

Statistics is also a frequently-used tool in data analyst´s daily life. For example, the practices of falsification, hypothesis testing, is a very good method for dealing with heterogeneous sources of information. Such as rumors that you have heard from various places. Do not underestimate the power of analyzing on rumors. In the business world where there more 99% rumors and the rest is only surprises (yes, there is no truth in the business world), making the right "prediction" and "understanding" of the near future business trend means that the analyst must sit and work with rumors. Navgating around the ocean of lies and rumors, the analyst must learn to draw things on a paper and try to link them into different relationships. Different variables can be negatively or positively linked in such diagrams. Very often the analyst will get a network diagram after this session. This is absolutely right! Things in the real world are linked through causal networks.

The core part of hypothesis testing is falsification. The analyst should focus on eliminating the disconfirmed hypotheses rather than picking the right one. What to do when there are more than 1 hypothesis left after you have tried everything to eliminate them? A simple approach is to decide which one has the strongest support based on all the possible evidences. Evidence is diagnostic if it helps you rank one hypothesis as stronger than another. The analyst needs to look at each hypothesis in comparison to each piece of evidence and each other and see which has the strongest support. Very often, what the analyst needs is to make a ranking matrix. Every evidence is ranked with different values for each hypothesis. Sometimes an evidence can have stronger values than the others. Well, how stronger or weaker is decided by the analyst. This is not rocket science and we are just giving a statistically strongest vote to the alternative hypothesis. One thing very important to remember is that there can be new evidence (which may be very strong and can change the total ranking orders of all the candidate hypotheses). It is the analyst' role to make sure that the new evidence is included in time and the latest changes are communicated.

Besides hypothesis testing, I would recommend to learn about Bayesian statistics and the practices about subjective probabilities. In fact, subjective probabilities is a very practical tool i the real world. When it happens that an analyst is not sure about a judgement or a conclusion (which is very often), she or he normally tends to use words like "probably," "perhaps" and so on. Here is the time when one should consider to use subjective probability. That is, to ask for a percentage number instead of theses words. For example, something like "90% probability" is a subjective probability.

With a set of subjective probabilities collected from a group of analysts, it is very easy to use all kinds of statistical methods to further analyze the scenario. One good tool is the standard deviation. The standard deviation measures how far typical points are from the average (or mean) of the data set. Given the set of probabilities collected from different analysts, the standard deviation tells how different the opinions are.

There are many other powerful tools and theories in the statistical world for every data analyst.

Data mining is based on different mathematical and statistical theories and method. The computer scientists tends to "computerize" the theories and make it into a computing model which is ready for real world usage. Details about data minings will be posted in future blogs (where I will concentrate on each data mining method). Here is a list of typical data mining tools and theories used in the data analysis world.

  • Bayesian Network
  • Decision Tree
  • Neural Nework
  • Support-vector Machines
  • K-nearest Neighbors
  • Clustering
  • Association Rules
Having a background in mathematics, statistics and data mining has always been a career foundation for all data analysts. The techniques and methods mentioned in this blog entry are only a small portion of the whole ocean of the mathematical world. The important thing for data analyst is to keep learning new tools and try to apply them into their real world scenarios. in the end, it is the business value that decides the success of all techniques, methods and theories.

1 comment:

Ruth Selgas said...

In the field of statistics and data analysis it is really important that you have a good understanding of all the information that being mentioned here. I really commend you for the information and how important statistical analysis and data mining are.