A one-way ANOVA is used to test a null hypothesis by comparing three or more sample groups from a population (a t-test is generally used if comparing two sample groups). To use this method, we take a random, equal sample from each group. Then we examine the mean and variance between samples.
For this post, we are going to test whether the city an email recipient resides in (the independent variable) affects whether they open emails from our email campaigns (the dependent variable). We will do this by comparing the City column to the Number of Marketing Emails Opened column.
In a broader, more business relevant sense, we are asking if some cities are more receptive to our email campaigns than other cities. If so, perhaps we should segment email campaigns based on the city that the email recipient lives in.
Here is our fictitious data of email recipients:
|Name||Number of Marketing Emails Opened||City||Education|
|Ellie Fickle||10||Boston||Some College|
|Tina James||12||Boston||No College|
|Heidi Bloom||12||Boston||Some College|
|Wendy Hines||12||Boston||Some College|
|Terry Meirs||5||Boston||No College|
|Lee Newbery||8||Boston||Some College|
|Wess Leas||9||Boston||No College|
|Omar Parsons||10||Boston||No College|
|Don Rains||11||Boston||Some College|
|Morris Heims||11||Boston||Some College|
|Tom Jones||1||Los Angeles||No College|
|Joe Durham||2||Los Angeles||Grad|
|Fred Bailey||3||Los Angeles||No College|
|Timmy Tom||2||Los Angeles||No College|
|Ron Howard||1||Los Angeles||Some College|
|Evan Goram||2||Los Angeles||PostGrad|
|Weisley Fir||2||Los Angeles||No College|
|Henriett Kenatte||5||Los Angeles||Some College|
|Wendy Porsens||8||Los Angeles||No College|
|Mindy Charms||7||Los Angeles||Some College|
|Randi Lamb||2||Los Angeles||Some College|
|Irene Kittens||1||Los Angeles||Some College|
|Jamie Rider||1||Los Angeles||Some College|
|Laura Burns||1||Los Angeles||PostGrad|
|Yolly Role||12||Seattle||No College|
|Parker Jones||12||Seattle||Some College|
|Icabod Craine||10||Seattle||No College|
|Rainer Forest||12||Seattle||Some College|
|Stick Mudd||11||Seattle||No College|
Calculate Sum of Squares Total
The first thing we have to do is sum the differences between each email recipient’s number of emails opened and the average emails opened for all email recipients. Because we want to know how far each email recipient is from the overall average, we are not interested if a given email recipient’s number of emails opened is less then or greater than the average, just the distance from the overall average. To make this easier, we square each result to eliminate negative numbers. Then we add them together. This will give us the Sum of Squares Total. Which in this case is 713.62.
Calculate Sum of Squares Between
Next we have to do the same for each city. So, calculate the mean (average) Email Opens for each city. Then take the mean for each city and subtract the total mean. Next square each result (to eliminate negative numbers). Then multiply each result by the number of email recipients in that city. This will give us the Sum of Squares Between. If I’ve done this correctly, then the Sum of Squares Between should be 544.19.
Calculate Sum of Squares Within or Sum of Squares Error
Now we need to calculate how far each value within each group is from the group mean. We do this by doing the same thing we did to get the Sum of Squares Total, but this time we use the mean of the email recipients of the city, not the total recipients mean. This can also be calculated by subtracting the Sum of Squares Between from the Sum of Squares Total. I calculated this both ways and came up with 169.43, which gives me better confidence in both the calculations above.
Variance Estimates of Sum of Squares Between and Sum of Squares Within
To calculate the variance estimate of the Sum of Squares Between, we divide the Sum of Squares Between by the number of groups minus one. This calculation gives us 272.10.
To calculate the variance estimate of the Sum of Squares Within, we divide the Sum of Squares Within by the number of email recipients minus the number of cities. This comes out to 56.48.
Calculating the F-ratio
The final step is to calculate the F-ratio. We divide the variance estimate of the Sum of Squares Between by the variance estimate of the Sum of Squares Within. This number will be used to decide whether there is correlation between the city and how many emails a recipient opens as well as how strong that correlation appears to be. Our F-ratio is calculated to be 4.82.
Use a table of the critical values for the F distribution to find the relevant F-value. If the value in the last step is greater than the value from the table, the correlation is significant at the level of significance of the table (ex. p = 0.05). Otherwise, the correlation is not significant at that confidence level.
If you look up the F-value with a p=0.05, it is 2.83. That is far lower then our F-ratio of 4.82. So, in this case, we have to conclude that there is probably a correlation between the city an email recipient resides in and how many of our emails they open.
To summarize the process:
- Calculate the sum of squares for each email recipient for the whole sample.
- Calculate the sum of squares for each city.
- Calculate the sum of squares for each email recipient within each city.
- Divide the #2 above by the # of groups minus one.
- Divide the #3 above by the # of email recipients minus the # of groups.
- Divide #4 above by #5 above. This is the F Ratio.
- Look up the F Ratio on a F Values table. If F-ratio is greater then the F-values table, then we can conclude that the city that an email recipient resides in has an effect on whether they open an email.
If you want to give it a try, calculate if Education is also an independent variable that can determine whether an email recipient opens our emails (our dependent variable).
Later, you can look at both City & Education together and whether both together determine whether an email recipient opens the emails as well as if they interfere with each other and to what extent that affects the outcome.
Of course, this feels very laborious and there are many tools that will do all these calculations for you. But, I think it’s important to understand what is happening under the hood, since this is just one model to do this type of calculation.