Before we can perform statistical analysis on our data, we have to check the data. Do we assume that the data is normally distributed? Do we assume that each feature (attribute) is independent? Do we assume that data comes from the same population, under the same conditions? No, we don’t. We must check (explore) the data before we build any models. How do we do that? We look at three key attributes of the data:
- Homogeneity of Variance
Independence In both statistics and relational database design, independence means that each feature or attribute is independent of all the other features or attributes. As a simple example, if we had a database with the attributes Subtotal, Tax, and Grand Total, then Tax & Grand Total are both dependent on Subtotal since Tax will depend on the tax rate as well as the value in Subtotal. Grand Total will depend on both Subtotal & Tax since it is just adding together the values of those two attributes. Instead, having Subtotal and Tax Rate would be enough and they would not depend on each other.
This one is more difficult to understand. Basically, in a random sample, the distribution of values should look like a bell (or hill) with more values in the middle (at the top) and less on the two ends. If it looks this way, we can run statistical parametric tests. If it is skewed (has more values on one end or the other) or has kurtosis (too many or too few values in the middle of the bell), then parametric tests cannot be used, unless the data is corrected. First thing is to look at the shape of the data – does it have that familiar bell shape. We can use a histogram to see the this (with ggplot2 in R): myhistogram <- ggplot(mydata, aes(mycol)) + geom_histogram (aes(y = …density…)) This will give us the shape of the data (theres loads more stuff you can add here (like colors, labels, etc.). It should look something like this if the data is normally distributed: However, if it is positively skewed, it will look something like this: We can also use a Q-Q plot (Quantile-Quantile plot)) to view the data. This function will rank and sort the data and look like this: myQQplot <- qplot(sample=mydata$myattribute, stat=“qq”) The graph will look something like this if the data is normal: or like this if the data is not normal: These are nice visual ways of checking for normality of data. However, they can be subjective because it’s hard to tell in some cases if the shape is a little too skewed or has a little much kurtosis.
So we move on to a function that will output values to allow you check whether the data has a normal shape(this function is in the pastecs package in R): mystat <- stat.desc(mydata$myattribute, basic = FALSE, norm = TRUE) The stat.desc will describe the data in statistical terms. We want more than the basic stats, so basic = FALSE and we want to know the distribution of values, so norm = TRUE. When this is run, it will output the median, mean, mean of standard error, confidence interval mean, variance, standard deviation, and seven other values. However, the ones we want to look at are the one labeled skew.2SE and kurt.2SE. Those two will tell you the skewness and kurtosis of the data. If either one is over 1, ignoring the +/- signs (e.g.: >1 or <-1), there is an issue with skewness or kurtosis and the data is not normal.
Homogeneity of Variance
If we are comparing interval data values by categorical data values (like ecommerce basket size by city), then the variance should be similar for each category. For example, we shouldn’t have a large ecommerce basket size variance for Los Angeles and a small one for New York. If we do, then the variance is heterogeneous and there could be real issues with attempting to compare cities (in this example). We can run a test called Levene’s test in R to check the homogeneity of variance. This package is found in the car package: myvarTest <- leveneTest(mydata$basketsize, mydata$city) This will output the degrees of freedom, F-value and the Pr(>F) value. That last one is most important. If it over 0.05, then the variances are similar. However, if it under 0.05, it means the variances are not similar and will need to be addressed.
We want to explore our data and run these types of exploratory tests before running any other tests or build any models based on the data. This is because the data may need correcting or may be flawed in ways we did not anticipate. This post just scratches the surface of exploratory data analysis, but I hope it is a useful step in understanding how to explore data.