With this post I am going to briefly walk through some things about data you many not have thought about. I think that this might give you a different and useful perspective on the nature of data. I’ll start first with the various types of data and data elements. Then I’ll get into some data issues and finish up with data classification and standardization.
Types of Data Generally, data falls into one of three types:
Quantitative data describes what happened. For example, how many visitors came to your website or how many subscribers opened your email. Qualitative data describes why it happened, such as why visitors came to your website (product information, shopping, etc.) or why did the subscriber open your email (good sale, reminded them they needed a new pair of shoes, etc.). Descriptive data describes the individual (visitor or email subscriber, for example).
Digging a little deeper, the attributes of data elements typically fall into one of four data element types:
Continuous data elements are those that are most often aggregated (measures, in data-warehousing speak), such as income or sales. Categorical data elements (nominal, ordinal, or binary) are data elements that take on a limited set of values. Categorical data elements are those that are most often used as dimensions in graphs/tables/charts. They are either nominal, which means that they take values with no particular order, like marital status or profession. Or they are ordinal, which is like nominal with the exception that the values have an order to them, such as credit rating or engagement score expressed as high, medium, or low. Lastly, binary data elements are data elements that can only take on one of two values, such as gender or employment status (if you are only accepting employed or unemployed, otherwise it would be a nominal data element).
Data is rarely perfect, and issues with certain attributes of data can skew any meaningful analysis. Here are three attributes that can cause issues with your data:
Outlier data can be as simple as unidimensional data, such as income, where all values are below \$200K, and then one value is \$10Mil. Or it can be compounded by multivariate data outliers, such as income and age, where both are far outside the norm, thereby skewing data in both directions.
There are a number of ways of dealing with outliers. A simple histogram or box plot can point out outliers. These are simple visualizations where unidimensional data outliers can be identified. You can also use a little statistics and calculate z-scores, determine a score beyond which the data will not be included in the analysis. Regression lines can be used to identify multivariate outliers. I have rarely had to concern myself with multivariate outliers in digital analysis, but often have to deal with unidimensional outlier data. So, it is important to know how to identify and deal with them.
Incomplete data are data missing some data elements. For example, a personal profile missing the email address data element. This would be described as a data gap and a plan would need to be formulated to deal with this kind of issue.
And one attribute of data we all try to avoid, dirty. Dirty data is data that is inconsistent or duplicated. This usually occurs because the data store was not well defined. For example, a data element for the city of Los Angeles is stored as ‘Los Angeles’, ‘LA’ and ‘L.A.’ or there are two personal profiles with the same email address. Like incomplete data, a plan would need to be formulated to deal with this kind of issue.
Data Standardization & Classification
One of the things you may run into is comparing two or more sets of data. To compare apples to apples, data needs to be standardized because values, when divided (a post # comments/fans, for example), may become so small that it may not be meaningfully compared when looking at trending (are the values rising or falling). Therefore, using a technique such as decimal scaling (dividing value by a power of 10) can be used to bring the values to a similar scale without compromising the trending.
Lastly, sometimes it necessary to categorize (or classify) data to reduce the number of variables. This essentially transforms a continuous data element into a nominal or ordinal data element or can reduce a categorical data element into fewer categories/classes. For example, age is often categorized into age groups. This is done for easier analysis when you want to use age, for example, as a dimension instead of a measure.
I hope this gets you thinking not only about your data, but the types of data you have and some ideas of how to deal with them. Data is increasingly becoming very important in making intelligent business decisions (data-driven decision making) and that’s not going to change any time soon. I think that’s a good thing.