A/B testing is one of the best ways to optimise your website. However, it is not a magical cure for websites that are:

• Poorly Conceived – Website is off-brand or there is a major disconnect with users.
• Poorly Designed – Website is designed in a way that is hard to maintain/confusing.
• Poorly Developed – Website has a lot of code bloat or inefficient code.
• Using Sub-Par/Inflexible Technology – Website is created with technology that allows for very little optimisation or growth.

For major issues such as these, it is best to start with…a redesign. That is not something I would recommend just because your website is a few years old. But if your website has real issues, there is no way to A/B test your way out that.

A/B tests are designed for specific functions, such as:

• Comparing a percentage off versus a dollar value off for conversion.
• Comparing different percentages off for conversion.
• Comparing a sales-focused landing page versus a content-focused landing page.
• Removing and/or adding a step in a funnel.

Notice that these are tests that can be utilised beyond just the test itself. For example, knowing that a percentage off drives more conversions is useful for all your marketing. Knowing that 25% off does just as well as 30% off would be very valuable in saving the company from unnecessary discounts.

Also, notice that some of these tests are not just a webpage versus another webpage. Instead, think of these as one experience versus another experience. For example, if we are testing percentage off versus a dollar value off for conversion, that will need to be reinforced throughout the website, from the landing page to the thank you page.

Bottom line is that for A/B tests to be successful, they need to be planned and programming has to be in place to execute them properly. Once you setup a process for A/B testing and execute it successfully, you will find it incredibly useful. For this post, I am going to address the analytics side of A/B testing, which is just as important as the technical side.

Before Starting An A/B Test

Before beginning an A/B test, define the purpose and scope of the test. A/B tests can get complex (believe it or not, you may even forget exactly what you were testing), so you want to carefully document the test. What is a complex A/B test, you ask? How about testing which cross-sell products to recommend to a returning vistor, but only if they have been to the site within the last six months and only bought products from a given category and have come from three given campaigns and for desktop users only. There’s a solid business reason behind this test, but without good documentation, the exact reason can be forgotten.

The A/B test should not morph into something else, where other metrics are used instead of the original, agreed upon metrics. For example, the test example above should not morph into a test of overall converion of each of the three campaigns, since the test is focused on the cross-sell products and not overall conversion. You can imagine the test above becoming confusing quickly – are we testing which cross-sell products get selected most often or which cross-sell product categories lead to better conversion or which cross-sell product combinations have the highest number of cross-sell products selected per user.

The next step in preparation for an A/B test is to define success of the B (or B/C/D – there can be more than one alternative). In other words, what would have to be the difference between the control and the alternative for you to switch to the alternative. Something simple would be a greater than 0.50% increase in conversion rate. If you do not do this step, you could be running this test indefinitely and only stop when you get your desired outcome (nullifying the result).

If we use frequentist apporach to the test (it would be similar with a Bayesian approach), lets say we want a 0.05% lift in conversion and think the alternative experience will deliver on that goal. Our current conversion rate is 3.00% and we would like to increase that to 3.05%. We want to use a 95% CI (that’s a standard confidence interval, but you can use a lower CI – it’s a website, not a clinical trial ;-)) and power of 80%. To find out the sample size required, you can use the pwr.t.test from the pwr package.

library(pwr)

# Enter the control conversion rate.
# Here we'll say it's 3%.
control_conv_rate <- 0.03
# For the alternative, we add 0.05% to the control.
alternate_conv_rate <- control_conv_rate + 0.0005

# Next we need to calculate the effect size by using the ES.h function.
effect_size = ES.h(p1 = alternate_conv_rate, p2 = control_conv_rate)

# We feed in the effect size, with the sig.level (CI), power and type,
# which is a two.sample test.
pwr.t.test(d = effect_size, sig.level = .05, power = .80, type = 'two.sample')

##
##      Two-sample t test power calculation
##
##               n = 1841937
##               d = 0.002919316
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
##
## NOTE: n is number in *each* group


In the results, you can see an n = 1,841,937. That would be the sample size needed for each of the alternatives tested for a 0.05% increase in conversion rate at a 95% confidence interval and 80% power. Unless you have a lot of users coming to your site, this test is probably not worth doing. My point is that for small differences, it takes a lot of data. Also, just stating something to the effect that you just want to see which one performs better, is really not a good way of conducting an A/B test.

Let’s say instead of a 0.05% increase in conversion, ou want to test for a 0.50% increase in conversion, from 3.00% to 3.50%:

# Enter the control conversion rate.
# Here we'll again say it's 3%.
control_conv_rate <- 0.03
# For the alternative, we add 0.05% to the control.
alternate_conv_rate <- control_conv_rate + 0.005

# Next we calculate the effect size by using the ES.h function.
effect_size = ES.h(p1 = alternate_conv_rate, p2 = control_conv_rate)

# We feed in the effect size, with the sig.level (CI), power and type.
pwr.t.test(d = effect_size, sig.level = .05, power = .80, type = 'two.sample')

##
##      Two-sample t test power calculation
##
##               n = 19716.14
##               d = 0.02821746
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
##
## NOTE: n is number in *each* group


That’s a lot more reasonable sample size – you will need 19,716 users per each alternative. So if you are testing a control against a single alternative, you will need a total of 39,432 users. When you reach that number, the test is over. You don’t want to stop it before (but you think you do because you looked at the results less than a quarter of the way through and it was clear that the alternative was winning, with a 10% increase in coversion). It often happens that one outperforms the other, then back again. Only by waiting till the appropriate number of users is reached can you have the most accurate data to analyse the test.

Now you have all the information to start the test. You have documented what you are testing, what you expect the outcome to be and the number of samples needed for a credible test. The rest is the technical details of ensuring that you are capturing the data correctly.

During the Test

You do nothing during the test. No peeking!

After the Test

You have all the data from the test. You might have a couple different metrics (that you stated before the test) to analyse. The way I like to analyse an A/B test is with RMarkdown and pull the data into R and use functions, graphs and a summary of the results. One of the issues that I find is that a lot of times people will take the sample numbers as the population numbers. So, I use ranges instead. Below is how you can assign all the variables. I have hard-coded some of the values. You can do it that way, or you can pull the data in from a data source.

# Hard coding the users and orders for control.
control_users <- 20000
control_orders <- 600
# Simple calculation for the control conversion rate.
control_conv_rate <- control_orders/control_users

# The main calcualtion is with the prop.test function.
# We're going to use a two sided test to get the
# CI for the upper and lower values.
control_results <- prop.test(x = control_orders,
n = control_users,
alternative = 'two.sided')

# We use the brackets to retrieve the upper and lower
# CI values from the prop.test model.
control_lower_ci <- control_results$conf.int[1] control_upper_ci <- control_results$conf.int[2]

# We do the exact same thing for the alternative
# that we did for the control.
alternate_users <- 20000
alternate_orders <- 700
alternate_conv_rate <- alternate_orders/alternate_users

alternate_results <- prop.test(x = alternate_orders,
n = alternate_users,
alternative = 'two.sided')

alternate_lower_ci <- alternate_results$conf.int[1] alternate_upper_ci <- alternate_results$conf.int[2]


Once I have all the variables assigned, I can create a graph showing the difference in performance between the control and alternative. First, I am going to create a table with all the values.

# Create a vector for the conversion rates.
conv_rate <- c(control_conv_rate,
alternate_conv_rate)

# And another one for the lower CI value.
# I'm going to round these to 3 decimal places.
lower_ci  <- c(round(control_lower_ci, 3),
round(alternate_lower_ci, 3))

# And one for the upper CI value.
# I'm going to round these to 3 decimal places.
upper_ci <- c(round(control_upper_ci, 3),
round(alternate_upper_ci, 3))

# Then add a name for each.
version <- c('A. Control',
'B. Alternative')

# Finally, create the dataframe from the four vectors.
conv_tbl <- data.frame(version, conv_rate, lower_ci, upper_ci)

# Display the results.
conv_tbl

##          version conv_rate lower_ci upper_ci
## 1     A. Control     0.030    0.028    0.032
## 2 B. Alternative     0.035    0.033    0.038


Now we can create a graph comparing the two. I’m going to use an error bar and not highlight the conversion rates. Instead, the focus will be on the error bars and nothing else. Also, don’t forget to add the source of the data.

# Add the tidyverse for ggplot
# and the pipe (%>%).
library(tidyverse)
# Add scales for the percent
# on the Y axis.
library(scales)

conv_tbl %>%
ggplot(aes(x=version, y=conv_rate, group=1)) +
geom_errorbar(width=.1, aes(ymin=lower_ci, ymax=upper_ci), colour="black")  +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = 'none') +
ggtitle('Conversion Rate - Confidence Intervals') +
ylab('Conversion Rate') + xlab('Version') +
labs(caption = 'Data Source: Your Data Source Here') +
theme(axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.position = 'none',
panel.grid.major =   element_line(colour = "grey",size=0.75)) +
scale_y_continuous(labels = scales::percent)


As you can see, you have two error bars to focus on and no chart junk to distract you. The results here show that the alternative is indeed better than the control, since the error bar for the alternative is higher than the error bar for the control and does not cross each other on the Y axis.

What if our sample size was smaller. Let’s say our sample size was only 1,000. Let’s see what that would look like. We’ll keep the conversion rates the same.

# Again, hard coding the users and orders for control.
control_users <- 1000
control_orders <- 30
# Simple calculation for the control conversion rate.
control_conv_rate <- control_orders/control_users

# The main calcualtion is with the prop.test function.
# We're going to use a two sided test to get the
# CI for the upper and lower values.
control_results <- prop.test(x = control_orders,
n = control_users,
alternative = 'two.sided')

# We use the brackets to retrieve the upper and lower
# CI values from the prop.test model.
control_lower_ci <- control_results$conf.int[1] control_upper_ci <- control_results$conf.int[2]

# We do the exact same thing for the alternative
# that we did for the control.
alternate_users <- 1000
alternate_orders <- 35
alternate_conv_rate <- alternate_orders/alternate_users

alternate_results <- prop.test(x = alternate_orders,
n = alternate_users,
alternative = 'two.sided')

alternate_lower_ci <- alternate_results$conf.int[1] alternate_upper_ci <- alternate_results$conf.int[2]

# Create a vector for the conversion rates.
conv_rate <- c(control_conv_rate,
alternate_conv_rate)

# And another one for the lower CI value.
# I'm going to round these to 3 decimal places.
lower_ci  <- c(round(control_lower_ci, 3),
round(alternate_lower_ci, 3))

# And one for the upper CI value.
# I'm going to round these to 3 decimal places.
upper_ci <- c(round(control_upper_ci, 3),
round(alternate_upper_ci, 3))

# Then add a name for each.
version <- c('A. Control',
'B. Alternative')

# Finally, create the dataframe from the four vectors.
conv_tbl <- data.frame(version, conv_rate, lower_ci, upper_ci)

# Display the results.
conv_tbl

##          version conv_rate lower_ci upper_ci
## 1     A. Control     0.030    0.021    0.043
## 2 B. Alternative     0.035    0.025    0.049

conv_tbl %>%
ggplot(aes(x=version, y=conv_rate, group=1)) +
geom_errorbar(width=.1, aes(ymin=lower_ci, ymax=upper_ci), colour="black")  +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = 'none') +
ggtitle('Conversion Rate - Confidence Intervals') +
ylab('Conversion Rate') + xlab('Version') +
labs(caption = 'Data Source: Your Data Source Here') +
theme(axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.position = 'none',
panel.grid.major =   element_line(colour = "grey",size=0.75)) +
scale_y_continuous(labels = scales::percent)


The alternative error bar is higher than the control, but they cross each other on the Y axis. The sample size was not large enough to to distinguish the difference between the 3.0% conversion rate of the control and the 3.5% conversion rate of the alternative. So, even if you calculated the sample size incorrectly, the prop.test is separate and will calculate the results based on the data captured.

A CI of 95% does not mean that we are 95% confident that the population conversion rate is between the upper and lower values. It also doesn’t mean that there is a 95% chance that the population conversion rate is between the upper and lower values (if you used a Bayesian calcuation, then that is what the interval would represent). It means that if we took 100 samples, 95 of those samples would have a conversion rate between the upper and lower values of the CI.

Conclusion

I would recommend using/learning RMarkdown and including the charts as well as embedding the variables in a written summary (everyone involved will ask for the actual numbers, regardless). There should also be conclusions drawn from the test and actions to be taken based on the results.

–Bonus–

Just to be confusing, there is another way to calculate the results. Instead of using a frequentist apporach, you can use a Bayesian one. I’m not going to get too much into it, but here is how to calcuate the results (using 3% conversion rate as a prior):

# Load the Bayes package.
library(bayesAB)

# Enter the number of alternatives:
alternates <- 1

# hard code or pull and assign variable values.
control_users <- 20000
control_orders <- 600

alternate_users <- 20000
alternate_orders <- 700

# Create a dataframe the same way as the
# other two examples.
version <- c('A. Control',
'B. Alternate')

smpl_sze <- c(control_users,
alternate_users)

conv_rte <- c((control_orders/control_users),
(alternate_orders/alternate_users))

version_data <- data.frame(version, smpl_sze, conv_rte)

# new columns.
version_data$better_than_cntl <- NA version_data$better_than_b <- NA

# Note -
# if you don't want the
# probability to keep changing use
# set.seed(123)

i <- 1

# Loop through all of the variants.
while(i <= nrow(version_data)){

# Work through each variation
c <- 4 # Column
r <- 1 # Row

# Loop through version_data,
# since we have 1. version, 2. sample size,
# 3. conv. rate. Ex: if there is only one alternate,
# then there will be column 5 and 6 to work through.
while(c <= (4 + alternates)){

if(i == r){

# In this case you would be comparing a version to itself.
version_data[i, c] <- 'N/A'

} else{

# This is the Bayes calculation of successes of one
# variant versus another variant.
# All variants will be compared to each other.
null_binom <- rbinom(version_data[i, 'smpl_sze'],
1,
version_data[i, 'conv_rte'])
alt_binom <- rbinom(version_data[r, 'smpl_sze'],
1,
version_data[r, 'conv_rte'])

# The bayesTest will use priors of successes and failures
# (or just use 1 for each if the priors is unknown).
# The n_samples is the # of posterior samples to draw.
# 'bernoulli' is for binomial tests.
AB1 <- bayesTest(null_binom,
alt_binom,
priors = c('alpha' = 600, 'beta' = (20000-600)),
n_samples = 1e6,
distribution = 'bernoulli')

# This will retrieve the probability of A better than B.
version_data[i, c] <- as.character(percent(round(as.double(summary(AB1)\$probability), 2)))

}

c <- c + 1
r <- r + 1

}

i <- i + 1

}

version_data <- version_data %>%
dplyr::rename(Version = version,
Better Than Control = better_than_cntl,
Better Than Alternate = better_than_b) %>%
dplyr::select(Version,
Better Than Control,
Better Than Alternate)

# In this case, we are just using a table with the results.
# This is given as a table of probabilities.
knitr::kable(version_data,
caption = 'Bayesian Version Comparison - Conversion Rates')

Version Better Than Control Better Than Alternate
A. Control N/A 0%
B. Alternate 95.0% N/A

And for comparison, here is what the table looks like with a sample size of 1000, keeping the conversion rate the same:

Version Better Than Control Better Than Alternate
A. Control N/A 36.0%
B. Alternate 69.0% N/A

I really like getting into the data, programming and analysing the outcomes of A/B tests. As you can see, there are different ways of analysing A/B tests and displaying the results. There are different approaches to modeling the data, but they have similar outcomes.