With this post, I am going to briefly walk through some things about data you many not have thought about. I think that this might give you a different and useful perspective. I’ll start first with the various types. Then I’ll get into some issues you may encounter and finish up with classification and standardization.
Types of Data
Generally, data falls into one of three types:
Quantitative describes what happened. For example, how many visitors came to your website or how many subscribers opened your email. Qualitative describes why it happened, such as why visitors came to your website (product information, shopping, etc.) or why did the subscriber open your email (good sale, reminded them they needed a new pair of shoes, etc.). Descriptive describes the individual (visitor or email subscriber, for example).
Digging a little deeper, attributes typically fall into one of four types:
Continuous types are those that are most often aggregated (measures, in data-warehousing speak), such as income or sales. Categorical types (nominal, ordinal, or binary) are those that take on a limited set of values. Categorical types are those that are most often used as dimensions in graphs/tables/charts. They are either nominal, which means that they take values with no particular order, like marital status or profession. Or they are ordinal, which is like nominal with the exception that the values have an order to them, such as credit rating or engagement score expressed as high, medium, or low. Lastly, binary types are those that can only take on one of two values, such as gender or employment status (if you are only accepting employed or unemployed, otherwise it would be a nominal type).
Data is rarely perfect, and issues with certain attributes can skew any meaningful analysis. There are three attributes that can cause issues with your data:
Outliers can be as simple (unidimensional), such as income, where all values are below \$200K, and then one value is \$10Mil. Or it can be compounded by multivariate outliers, such as income and age, where both are far outside the norm, thereby skewing data in both directions.
There are a number of ways of dealing with outliers. A simple histogram or box plot can point out outliers. These are simple visualizations where unidimensional outliers can be identified. You can also use a little statistics and calculate z-scores, determine a score beyond which the values will not be included in the analysis. Regression lines can be used to identify multivariate outliers. It is important to know how to identify and deal with them.
Incomplete data are data missing some fields. For example, a personal profile missing the email address. This would be described as a data gap and a plan would need to be formulated to deal with this kind of issue.
One attribute of data we all try to avoid, dirty. Dirty data is data that is inconsistent or duplicated. This usually occurs because it was not well defined. For example, an attribute for the city of Los Angeles is stored as ‘Los Angeles’, ‘LA’ and ‘L.A.’ or there are two personal profiles with the same email address. Like incomplete data, a plan would need to be formulated to deal with this kind of issue.
Data Standardization & Classification
To compare apples to apples, you need to standardize because values, when divided (a post # comments/fans, for example), may become so small that it may not be meaningfully compared when looking at trending (are the values rising or falling). Therefore, using a technique such as decimal scaling (dividing value by a power of 10) can be used to bring the values to a similar scale without compromising the trending.
Lastly, sometimes it necessary to categorize (or classify) to reduce the number of variables. This essentially transforms a continuous attribute into a nominal or ordinal attribute or can reduce a categorical attribute into fewer categories/classes. For example, age is often categorized into age groups. This is done for easier analysis when you want to use age, for example, as a dimension instead of a measure.
Data is increasingly becoming very important in making intelligent business decisions (data-driven decision making) and that’s not going to change any time soon. I think that’s a good thing.
You can also check out my GitHub repository – https://github.com/daranjjohnson for free code in R & SQL.