A/B Testing: An introduction

6.2.2024 | 27 minutes of reading time

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding of A/B testing and offers a fundamental introduction to the topic itself. We’ll also go through the most common statistical methods used in A/B tests and show how each tool uses them for evaluating the results.

This is how the series of blog articles will be structured:

In this blog post we will begin with a general overview of A/B testing. We'll go through the main aspects and notions in A/B testing and analyse pros and cons. We'll also have a detailed overview of the most common statistical methods used to evaluate the results.
In the blog post(s) that will follow we will examine multiple A/B test tools: specifically, we will assess the tools by employing them to execute various test scenarios involving simulated user interactions. Unlike real A/B tests, where the actual underlying values are unknown, these simulated user interactions enable us to feed the tools with input metric values that we control.

What is an A/B test?

An A/B test is a method of comparing two variants (like two different versions of a webpage or email campaign) to determine which one performs better. Typically, a single metric is used to measure the performance in A/B testing. Most often, user-centric metrics such as engagement, satisfaction, or UI/UX usability are measured. Since the metric is crucial in the outcome of the whole method, it will be further discussed below.

An A/B test is mostly used inside a hypothesis-driven development process. Such a development process applies the scientific approach of:

Idea generation
Hypothesis creation
Experimental design
Experimentation
Inference
Continue to iterate with 1. or abort

In such a hypothesis-driven development process, A/B tests are just one form of experiment to validate a hypothesis. There are other ways to test a hypothesis that may even be faster to implement and bring better results. For example:

Examining existing data: when testing a hypothesis such as "shopping increases with a faster payment process", you can analyze existing data to establish a correlation between shopping volume and payment process speed. Once the hypothesis is confirmed, in subsequent iterations, you can generate ideas to further enhance the payment process speed.
Usability prototypes (for example paper prototypes or wireframes): these prototypes provide the capability to lead a customer or a group of customers through a simulated journey, presenting a modified sequence of screens and dialogues. This enables you to observe and discern the elements they appreciate and those they find unfavorable in the dialogue.

A/B tests can be further supplemented with qualitative research (e.g. interviews) to better understand the motives behind user behavior.

Pro and cons of A/B testing

A/B tests are more expensive compared to alternatives since they require an implementation of the variants which should scale to a large part of the production user base. A full implementation of all aspects of the variants is also needed: for instance, it is not enough to show the UI of the new search system, but the search itself has to be implemented too.

Moreover, a significant number of interactions or users is necessary to achieve statistical significance in outcomes. The required number depends on the current metric value and the expected difference between the variants. In a later section we will see how the required number of users' interactions changes in different scenarios.

Some companies also see the risk that an A/B test might negatively influence their business because the change upsets or confuses the users or because the variation that is being tested is really worse than the current version.

On the contrary, if executed correctly, A/B testing enables us to impartially evaluate and compare the effects of a modification without additional limitations (like “a non-representative group of 10 people found variant B better”).

A/B testing may also be the only alternative when no other experiment type is possible or feasible. In practice A/B tests are often applied in various areas from application frontend variations, marketing email variations to ML (Machine Learning) models.

Frontend variations (layout changes, different orders of fields or choices, changes of font / color / size, …) have the issues that there is no strong science to estimate the effect of a modification. A/B tests are especially appealing here if the software changes require minimal effort.
The performance of ML models (for example for product recommendation or fraud detection) can change in a complex way when the model generation is changed (for example: new features in input, different tuning parameters, a different learning algorithm used). Here A/B tests may be the only way to estimate the end result from the users' perspective. Also here the effort to train and deploy a second model may be small compared to traditional software modifications.

Roles

In hypothesis-driven development processes, there are distinct roles that contribute to the success of the experiment. The experiment designer is responsible for selecting the appropriate type of experiment and designing it to effectively test the hypothesis. The evaluator analyzes the raw data from the experiment and translates it into insights that can be used to inform decision-making. The experimenter combines the responsibilities of the experiment designer and evaluator, and can be fulfilled by a variety of professionals such as data scientists, data analysts, or UX/UI designers. Finally, developers are responsible for implementing the necessary changes to the software, particularly in the case of A/B testing.

Aspects of an A/B test

In an A/B test the objects (e.g. the users) are split into disjunctive buckets. This splitting is called assignment. Usually there are two buckets (labeled “A” and “B”) but there can also be more. To achieve statistical independence, the objects are randomly assigned to a bucket. Associated to each bucket is a variant: this is a specific version of something we want to test, like a button in a webpage, a different ML model or, in medical studies, a specific treatment.

In clinical studies the buckets (of patients) have special names: the group of patients who receive medication or a treatment is commonly referred to as the treatment group while the group of people which do not get a treatment is called the control group. In business A/B tests, such as those conducted in e-commerce, the terms baseline or control are typically employed to refer to the currently implemented variant, while challenger denotes the proposed new version. Confusingly, in addition to “challenger” you may also come across the terms “variant” and “variation” (as in “measuring the baseline and the variant”). We will not use these terms in this way since we defined “variant” differently.

In most cases, the probability of assignment to a bucket (also referred to as weight) in an A/B test is equal (even-split), typically resulting in a 50% allocation to each bucket in a two-buckets scenario. As a result, the ultimate sizes of the buckets may either be equal (which is uncommon) or closely resemble one another. However, it is also possible to deviate from the 50/50 split approach, for example if you are unsure about the new variant and want to minimise negative effects: an unequal allocation of traffic (e.g. 30/70 or 20/80) may help to reduce this risk, as only a minority of the traffic (30% or 20%) is exposed to a new, unknown variation. But note that the precision/power of an A/B test depends on the size of the smaller/smallest bucket. So in general using an even split leads to faster statistical significance in your experiment.

What has not been discussed so far is which objects are assigned to the buckets. Most common objects are users, user interactions (like page-views, shop visits, etc.), sessions or API requests are assigned. Very important here is that the bucket assignment of users and sessions have to be consistent. So, for example, a user should see the same frontend variant (corresponding to his bucket) even when she refreshes the browser or logs out and logs in again. Achieving consistency can be accomplished by retaining the initial random bucket selection in a data store or by utilizing consistent hashing, an ingenious technique that eliminates the need for data storage and which will be covered later in this blog post. We will often use the term users to represent the objects assigned since this is the most common use case.

As part of the experiment design, it is essential to determine the user base that will participate in the test. The test can include all users of a platform, but it is also possible to select a subset based on specific criteria. So an experiment setup may be: show 50% of the users from France a new red button, while the other 50% should see a new blue button and users from all the other countries should see the default black button.

Test types

In addition to the common A/B test there are two other test types:

A so called A/B/n test is used to test a wider range of options when more than two buckets are needed. This is usually chosen when you don't know which variant is the most desired and the production of variants is cheap. The buckets may or may not have equal assignment probability.

In an A/A test there are two buckets which are assigned to the same variant (resulting in equal behavior of all users). This is commonly used to identify issues in the data pipeline during the introduction of an A/B test tool or after major changes in the data pipeline. Such an A/A test (with 50/50 split) will:

verify that the assignment produces two buckets of similar size. If this is not the case, a Sample Ratio Mismatch (SRM) has occurred.
verify that the assignment is random and independent by looking at the metric which should also be similar.
verify that the statistic part of the A/B test tool works: the tool should not detect any statistically significant differences between the two buckets
give you an idea what the baseline metric value is before you introduce a new variant
give you information about how many interactions happen in a given time to help later with the estimation of the experiment run time

There are advanced experiment setups like the Multi-Armed-Bandit, but this will be covered in a later blog post.

More about metrics

Metrics are the way to judge the variants. So in this regard they have to be aligned with (other) business metrics; otherwise there is a risk of spending a lot of time and effort on the wrong goal. Some companies set a so called North Star Metric, which captures the product's essential value to the customer, defining the relationship between customer problems and revenue. Optimizing growth activities center around this metric, providing direction for long-term, sustainable growth throughout the customer lifecycle.

The management business metrics like sales and profit are usually not suitable as an A/B test metric because they are lagging by days or weeks and also have multiple other influences that cannot be controlled. So usually a more technical metric is chosen which can be calculated fast (like once per day), is only influenced by the application under test and is a proxy for the management business metrics in some way. This is commonly a conversion rate (obtained dividing the number of users which take a positive action in a specific part of the software by the number of all users visiting it). Other metrics are, for example, basket size, average item price in the basket or average sale by customer. We will frequently use conversion rate as metric in the examples that follow, but it's important to note that other metrics may also be applicable.

It is usually effortless to identify several metrics that require optimization. However, it is generally preferable to concentrate on a single metric while other metrics can be included as guardrail metrics to ensure that they are not negatively impacted during A/B testing. Consider an example: if your North Star Metric is the number of purchases, you might attempt to boost this figure by either lowering the price (even though profits may decrease) or by reducing both the price and the quality of products (while this maintains the profit, it will likely increase the return rates and decrease customer satisfaction). In this scenario, it becomes crucial to incorporate profit, return rate, and customer satisfaction as guardrail metrics to ensure a comprehensive evaluation of the overall impact.

Another factor that must be taken into account when selecting a metric is the reliance on consistent object assignment into buckets. For instance, if the bucket assignment is determined based on the session, then a user who logs out and logs back in would be treated as a separate session and potentially placed in a different bucket, resulting in varied behavior. It is evident that a metric such as monthly customer profit, which is computed per user and not per session, would be inappropriate, since the metric should provide insight into the value of one variant and would not be useful if the user was exposed to multiple variants during the month. Of course if the bucket assignment is done with the most strict object type (user) you are free in the selection of metrics. But for technical reasons this may require large efforts and a session or request based approach might be preferable.

Statistics

Think about the following scenario: After running a test for one week you look at the metrics for the two variants and see 5% in the baseline and 5.5% in the challenger variant. You see that the challenger variant is clearly better (by 10% nonetheless): you end the test and communicate the outcome.

Is there something wrong with this scenario? Yes, the main problem is that there is a risk that the difference in the metric values could be caused purely by chance and not a difference in user behavior (which we wanted to measure). If the difference is caused by chance, we may select the worse performing variant and therefore cause a negative impact in the long run on our metric and business. In more statistical terms you have to make sure that the difference between the variants is statistically significant. Most of the A/B testing tools in the markets can help with this evaluation by delivering an overview of the results with some statistical information. The statistical methods used might be different though. In the following sections we will go through the two main “schools of thought” between statisticians: Frequentist and Bayesian approach. For each approach there are many tutorials and blog-posts online that you can follow (see, for example, ab-testing-with-python, bayesian-ab-testing-in-python, bayesian-in-pymc3). To clarify the concepts we will also use an example of an A/B test and explore the evaluation methods using Python.

Frequentist approach

This is the most “classical” approach to evaluate the statistical significance of your A/B test results and it uses only output data from your experiments in addition to the metric value of the baseline (for sample size estimation).

There are different types of tests that can be used in the frequentist case. Subsequently, we will walk through the primary steps of a two-tailed, two-sample t-test, which essentially examines whether there is a significant difference in a given metric between two variants (two-sampled) in two directions (two-tailed), either positive or negative. There are tools (like Analytics-Toolkit) that prefer to use a one-tailed test: as the name suggests this measures the difference only in one direction (e.g. is the challenger variant better than the baseline?) and this is suitable in many A/B-test scenarios as one would usually act (e.g. by implementing the new variant) only if a difference is found in a specific direction. However, there might be situations in which knowing the direction of your outcome is important (e.g.: a new version of your site is increasing or decreasing users' interactions) and the tools we will analyze use a two-sided approach in the frequentist case.

Assume you want to run an A/B test on your website to see if a new version (the challenger variant) of the “buy-now” button on the product page is better than the current version (the baseline variant). You might decide to split your users randomly into two roughly equal sized buckets and to show the baseline variant to all users from the first bucket and the challenger variant to all other users. You use the click-rate (the number of clicks on the button divided by the number of impressions, meaning the number of times the button is shown) as the metric for the A/B test.

In the Frequentist approach the null hypothesis (i.e. there is no significant difference between the click-rates of the two variants) is tested against an alternative hypothesis (i.e. the two variants have different click-rates). The main steps for this test are:

STEP 1: Set a maximal threshold (alpha) for the p-value for the significance of your test.

As we mentioned above, we want to be confident that if we see a difference between the click-rates of the variants, this is not due to chance. In more statistical words we want the p-value of the test to be below a certain threshold: this is the probability that you would have a difference (calculated via a test-statistic. See step 4. for a definition of this) as high as the one observed assuming the two variants are statistically equal (i.e. assuming the null-hypothesis is true).

So we fix a maximum threshold (this is usually set to 0.05) and we will make sure the p-value of our test is below this threshold before we reject the null-hypothesis. This threshold is commonly called the alpha level.

1alpha = 0.05 # threshold for the p-value

STEP 2: Calculate the sample size

Obviously you want to see results as soon as possible but to be able to see a significant difference between the variants you have to make sure you have enough objects in your test. The estimation of the number of objects required is called power analysis and it depends on:

the so-called powerof the statistical test, which indicates the probability of finding a difference between the variants assuming that there is an actual difference
the threshold alpha set above
the minimum detectable effect (MDE): how big should the difference between the variants expressed as percentage change.

In the example above, let’s assume that we have measured a conversion rate of 5% on the baseline variant in a given period of time. With the new variant we would like to reach at least 6% (which means an MDE of 20%). How many users do we need in each bucket to reach this in a confident manner?

In the following code example we calculate this using Python and the statsmodels library. Alternatively you can also use different online tools to get this estimate easily.

1from statsmodels.stats import proportion, power
2
3# conversion rate observed on the current variant
4cr = 0.05
5# rate we would like to reach with the new variant
6mde = 0.2
7expected_cr = cr*(1+mde)
8
9print("MDE: {:.2f}".format(mde))
10print("Expected click rate: {}".format(expected_cr))
11
12# calculate the effect size of proportions 
13effect_size = proportion.proportion_effectsize(cr, expected_cr)
14
15nr_users = power.TTestIndPower().solve_power(
16    effect_size=effect_size, 
17    nobs1=None,
18    alpha=alpha, 
19    power=0.8, # standard value for power of the test
20    ratio=1.0
21)
22print(f"Nr of users needed in each bucket: {round(nr_users)}")

Which will output:

MDE: 0.20
Expected click rate: 0.06
Nr of users needed in each bucket: 8144

Notice that power.TTestIndPower().solve_power needs some input parameters that need to be estimated: we set the power to 0.8 as this is a common value for such statistical tests. To estimate the size effect between the conversion rate (i.e. the magnitude of the difference) we use proportion.proportion_effectsizewhich implements Cohen's h formula.

In a later section you will see how the sample size required changes based on different baseline click rates and MDE. You will notice that the smaller the expected difference, the larger the sample size needed: this means that to detect smaller MDE your test has to run longer.

Notice that the results given by using TTestIndPower with effect_size from statsmodel differ from the sample results you can get using the Evan’s Miller online calculator)(the difference becomes smaller when the needed sample size increases). This is due to a different assumption on the standard deviation to use under the null hypothesis (statsmodels use a pooled estimate while Evan’s Miller calculator uses the standard deviation of the baseline). For more details have a look at the discussion in Stackoverflow. In general we observed differences using different online calculators(check, for example results from optimizely section “why is your calculator different from other sample size calculators?”).

In the following sections we will use the results from the Python code from above since it is easier to inspect and understand how the numbers are calculated compared to the online tools mentioned.

Based on the sample size and frequency of interactions, you can estimate for the first time the running time of your A/B test.

STEP 3: Run the test (activate the splitting into buckets, serve the two different variants,...) and wait till the number of interactions has reached the calculated sample size.

Be aware that waiting till the sample size is reached is essential for the frequentist evaluation to work properly! The underlying issue is called the peeking Problemand it occurs when the A/B test is stopped as soon as a satisfying “significance” (i.e. low p-value) is reached before the sample size calculated above is obtained. Indeed the p-value can vary during the duration of the test and might reach low values even if there is no difference between the variations.

STEP 4: Accept/Reject the null-hypothesis depending on the p-value as calculated by a t-test and the threshold alpha from the first step.

Finally, once you run the test and collect the results you are ready to see if there is an actual difference between the variants. In an A/B testing scenario where two variants are compared this is usually achieved via an (independent) two-sampled t-test. This test delivers two values:

test statistics: this is the difference between the means of two groups divided by some standard error. In our case, the means would be the two click-rates.
p-value (see explanation before): based on this and on the alpha level set at the beginning we can accept/reject the null hypothesis

Let’s continue with the scenario of the A/B test for the new “buy-now” button introduced above. Assume we stop the test and we have the minimum sample size required from the calculation above: 8200 users in the bucket of the baseline variant and 8240 in the bucket of the challenger. We will make a simulation using numpy arrays of 1 (user click) and 0 (user did not click) and the actual click rates of 5% and 5.7%:

1import numpy as np
2seed = 42
3np.random.seed(seed)
4
5def create_random_rawdata(users, clicks):
6    """Generate an array of zeros and ones having:
7       length = users and clicks random ones"""
8    rawdata = np.array([1] * clicks + [0] * (users - clicks))
9    
10    # Shuffle the data
11    np.random.shuffle(rawdata)
12    return rawdata

1from statsmodels.stats import weightstats
2
3# number of users for each variation
4users_baseline = 8200
5users_challenger = 8240
6
7# get number of clicks from conversion rate
8clicks_baseline = round(users_baseline*0.05)
9clicks_challenger = round(users_challenger*0.057)
10
11# create fake data of users interactions using randomly generated arrays of 1 and 0s
12data_baseline = create_random_rawdata(users_baseline, clicks_baseline)
13data_challenger = create_random_rawdata(users_challenger, clicks_challenger)
14
15# calculate test statistic and p value
16tstat, p, _ = weightstats.ttest_ind(
17    x1=data_challenger, 
18    x2=data_baseline,
19    alternative='two-sided', 
20    usevar='pooled', 
21    weights=(None, None), 
22    value=0
23)
24print("click-rate baseline: {:.2f}%".format(np.mean(data_baseline)*100))
25print("click-rate challenger: {:.2f}%\n".format(np.mean(data_challenger)*100))
26
27print("Alpha-level: {}\n".format(alpha))
28print("T-test statistics: ")
29print("p-value: {:.2f}".format(p))
30print("tstat: {:.2f}".format(tstat))

Which will output:

click-rate baseline: 5.00%
click-rate challenger: 5.70%

Alpha-level: 0.05

T-test statistics: 
p-value: 0.04
tstat: 2.00

The p-value of 0.04 is slightly below our threshold alpha (0.05) set at the beginning. Therefore we can reject the null hypothesis and conclude that there is a significant difference between the new version of the “buy-now” button and the current one.

However, we have only reached an MDE of 14%, which is below the expected MDE we set at the beginning.

It is also possible to avoid using python and use an online calculator.

Bayesian approach

This other approach of evaluating A/B test results is used in some of the newest A/B testing tools and is based on the use of a prior which is then adjusted as you collect data from the results of the test. This approach can be useful when the expected difference between variants is so small that you would need a lot of user interactions to achieve statistically significant results.

In the Bayesian approach the main idea is to estimate the probability that the metric of the challenger is better than the baseline. This is commonly defined as chance to beat control or probability to be best.

Different tutorials can be found online where the Bayesian approach is explained with examples using Python. However, for the sake of completeness, we will go through the main steps of estimating this probability value by following the same example we used in the frequentist approach.

STEP 1: Set a minimum threshold for the probability to be best

We will see in the last point how to calculate the probability that the challenger is better than the baseline based on the current test data. In order to make a safe decision at the end, we need to set a minimum probability we want to reach. A common value for this threshold is 95%.

STEP 2: Choose the prior distribution for the conversion rate

The idea is to find a “good” function to model the probability of observing a certain conversion rate CR. As the conversion rate takes values in between 0 and 1 the beta distributionis a good candidate (as it is defined on the interval [0,1]). Its probability density function (PDF) is defined as:

As you can see from the formula, the probability density function depends on two parameters a and b that we need to choose. If you have no prior information on your conversion rate, a good choice is a=b=1, which is a “flat” prior distribution. Notice that the prior distribution chosen depends on the type of metric you want to analyse. For continuous metrics such as revenue, order value, etc. other probability distributions are considered (e.g. gamma distribution).

STEP 3: Calculate the posterior probability of the conversion rates

Once you start collecting data from the test, you can then “update” your estimation of the probability of observing a certain conversion rate. The posterior distribution can be computed using the prior calculated above via the Bayes’ rule. So assuming we have collected n user interactions and c clicks for a specific variation, we can adapt the probability density function by updating parameters a,b using the data we have collected as follow:

The following Python code generates a graph of the two PDFs

1import matplotlib.pyplot as plt
2import seaborn as sns
3
4rng = np.random.default_rng(seed=seed)
5sim_size = 20000
6
7# extract total nr of users and clicks by variant from the data above
8users_baseline = data_baseline.size
9clicks_baseline = data_baseline.sum()
10
11users_challenger = data_challenger.size
12clicks_challenger = data_challenger.sum()
13
14# set a, b for the prior
15a = 1
16b = 1
17
18# update a and b for the posterior of baseline and challenger
19a_bl = a + clicks_baseline
20b_bl = b + (users_baseline - clicks_baseline)
21
22a_ch = a + clicks_challenger
23b_ch = b + (users_challenger - clicks_challenger)
24
25# calculate posterior distribution f(x, a+c, b+(n-c))
26posterior_bl = rng.beta(a_bl, b_bl, size=sim_size)
27posterior_ch = rng.beta(a_ch, b_ch, size=sim_size)
28
29# plot the distributions
30plt.figure()
31plt.title("Posterior PDF of baseline and challenger CR")
32sns.kdeplot(posterior_bl, color="blue", label="Baseline")
33sns.kdeplot(posterior_ch, color="orange", label="Challenger")
34plt.legend()
35plt.show()

We can then use these posteriors to calculate how likely it is that the challenger is going to be better than the baseline in general.

STEP 4: Compute the probability that the challenger is better than the baseline

Now that we have the distributions of baseline and challenger metric, we can estimate the probability of the challenger being better than the baseline by checking how often the distribution of the challenger is higher than the baseline.

1ch_bl = posterior_ch > posterior_bl
2print("Estimation of the probability that the challenger is better than the baseline: {:.2f}%".format(
3    np.mean(ch_bl) * 100))

Which will output:

Estimation of the probability that the challenger is better than the baseline: 97.58%

As the probability to be best is (slightly!) higher than the minimum threshold we set in step 1 (95%), we can be confident that the challenger will perform better if we choose it. So we might decide to update our “buy-now” button with the new version of the challenger.

Notice that all we did so far can be easily done with the help of Python modules like bayesian-testing or PyMC3 (the latter has more functionalities as it is designed for general bayesian analysis). For example, with bayesian-testing we can just get a an evaluation summary as follows (which also returns the value of 95.73% from above):

1import pprint
2from bayesian_testing.experiments import BinaryDataTest
3
4bayesian_test_agg = BinaryDataTest()
5
6bayesian_test_agg.add_variant_data_agg(name="baseline",
7                                       totals=users_baseline,
8                                       positives=clicks_baseline,
9                                       a_prior=a,
10                                       b_prior=b)
11
12bayesian_test_agg.add_variant_data_agg(name="challenger",
13                                       totals=users_challenger,
14                                       positives=clicks_challenger,
15                                       a_prior=a,
16                                       b_prior=b)
17
18pprint.pprint(bayesian_test_agg.evaluate(sim_count=sim_size, seed=seed))

Which will output

[{'expected_loss': 0.0070563,
  'positive_rate': 0.05,
  'positives': 410,
  'posterior_mean': 0.05011,
  'prob_being_best': 0.02415,
  'totals': 8200,
  'variant': 'baseline'},
 {'expected_loss': 3.43e-05,
  'positive_rate': 0.05704,
  'positives': 470,
  'posterior_mean': 0.05715,
  'prob_being_best': 0.97585,
  'totals': 8240,
  'variant': 'challenger'}]

What we presented here is the more classical Bayesian approach based on the probability of being the best. An alternative approach which is sometimes used is based on the so-called HDI+ROPE decision rule. See this paperfor more details.

Comparing the two approaches with the provided example, we conclude that by using the frequentist approach we would have rejected the new variant as the differences are not significant. On the other hand, with the Bayesian approach the probability of the challenger to be better than the baseline is higher than the threshold set: so if we are willing to accept the resulting effect size (12%) we can be confident to switch to the new variant.

Pros and cons of Bayesian vs. Frequentist

In the Frequentist approach we have to wait until we reach the established sample size based on the MDE we want to reach (which also has to be guessed/estimated). The Bayesian approach, on the other hand, can reach conclusions faster. In the table below we have collected different simulation outcomes using the code from the examples in the previous sections: for different baseline and challenger conversion rates, we have calculated the sample size needed for the Frequentist approach and the corresponding sample size in the Bayesian approach to reach a probability of being best of ~96% (slightly above the threshold).

Minimum sample size for different click-rate of the baseline and MDE


Baseline CR	Challenger CR	MDE	Frequentist Sample Size	Bayesian Sample Size	Ratio
3%	3.2%	6%	117 857	42 297	2.78
3%	3.3%	10%	53 183	18 743	2.83
5%	5.5%	10%	31 218	11 137	2.80
5%	6%	20%	8 144	2 859	2.84

Another advantage is that for the Bayesian approach no estimation of the MDE is required. Which means one less parameter to estimate / guess. On the other side, in the Bayesian case, one has to select a matching probability distribution for the metric.


	Pro	Cons
Frequentist	no prior assumption needed, only data from the test used	estimation of sample size needed and higher risk of peeking
Bayesian	reach conclusion faster	need to choose a probability distribution (like beta or gamma) for the metric

Technical details about bucket assignment

One central aspect is the bucket assignment. It gets some identification (user-id, session-id or request-id) and outputs the bucket-id to use. There are usually two options to implement such a function in A/B testing tools:

using a persistent data store: query the data store and if no entry is found, draw a random number between 0 and 1, assign the user based on the random number and the bucket probabilities, and store the bucket-id in the persistent data store. For web applications, the persistent data store is located at some server with some local cache because the assignment doesn’t change afterwards. One drawback is the added latency for the initial case when the server needs to be contacted.
consistent hashing: hash the user-id and convert the hash to a number between 0 and 1 (this is now pseudo-random), assign the user based on this number and the bucket probabilities, and return the bucket-id. This variant doesn’t need a database or server interaction but breaks when a second experiment would be performed: in the new experiment each user would be put into the same buckets. To avoid this, the hash is computed from the user-id and the experiment-id.

Conclusion & next blog post

In conclusion, we have explored a comprehensive overview of A/B testing in this blog post, delving into its fundamental aspects and concepts. Our analysis encompassed a thorough examination of the advantages and disadvantages associated with A/B testing, accompanied by a detailed exploration of the prevalent statistical methods utilised for result evaluation.

In the upcoming post, our focus will shift to a detailed examination of GrowthBook as the initial tool in our series. This will include a description of how we evaluate such tools. Stay tuned for insights into optimising your experimentation processes!

Was this post helpful?

Likes

Blog authors

Raimar Falke

Senior IT consultant

Do you still have questions? Just send me a message.

Francesca Diana

Data Scientist

Do you still have questions? Just send me a message.

fromRaimar Falke & Francesca Diana

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 19 Minuten Lesezeit

Francesca Diana

Raimar Falke

How development for AWS changes the local setup

When a new project is set up in the cloud these days, this usually means that AWS is used, that the backend is split into multiple services (e.g. microservices), the frontend and backend communicate using REST and multiple managed AWS services are used...

DevOps
Infrastructure
Microservices
AWS
Cloud
Software development
Testing

2.2.2018 | 8 Minuten Lesezeit

Raimar Falke

AWS CodePipeline Dashboard

In modern stacks, the application often consists of multiple microservices. For one of our projects we had to manage about 20 of these. Not only because of the partnership between codecentric and AWS we decided to use the AWS platform. That also led...

Agile
Cloud
DevOps
Microservices
AWS
Software development

30.1.2018 | 2 Minuten Lesezeit

Oliver Hoogvliet

Raimar Falke

Automatic Testing of Logstash Configuration

In the second half I show how you can test your Logstash configuration. However first I want to show why automatic tests for configuration files are important. Feel free to skip this part if you already know this. Configuration is source code and should...

Agile
Infrastructure
Open Source
Search
CI/CD
DevOps
NoSQL
Logging
Testing

20.6.2016 | 5 Minuten Lesezeit

Raimar Falke

Using Exceptions to Write Robust Software for Stable Production

A study shows that the cause for almost all critical faults is bad error handling. I can back this up with my own experience in various projects: the feature is implemented and there are tests in place which verify the correctness of the implementation...

Agile
Agile methods
CI/CD
Java

20.1.2016 | 16 Minuten Lesezeit

Raimar Falke

Why agile development needs automatic tests

Test the basics There are multiple reasons for tests. Two major reasons are: To prove that a change of the software adds the desired functionality.To ensure that a change does not break the existing functionality (regression testing).It is possible in...

Agile methods
Agile
Software architecture
Testing

16.7.2014 | 3 Minuten Lesezeit

Raimar Falke

A/B Testing: Tool support and testing GrowthBook

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 19 Minuten Lesezeit

Francesca Diana

Raimar Falke

The universal recommender in Action(ML)

Introduction Recommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to ...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 Minuten Lesezeit

Francesca Diana

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Test Fixtures mit JUnit 5

Wir Softwareentwickler leben in einem ständigen Dilemma. Jede Funktionalität der Software sollte durch Unit-Tests und Integrationstest abgesichert werden. Es sollten dabei so viel Tests wie nötig, aber nur so wenige wie möglich geschrieben werden. Schreiben...

Java
Testing
Framework
Softwareentwicklung

25.3.2024 | 7 Minuten Lesezeit

Jens Kaiser

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Datenbanken testen mit Testcontainers in Mule4

Hier erfährst du die Möglichkeiten Testcontainers in Mule4 zu nutzen, um deine Datenbankaufrufe zu testen. Vor einiger Zeit hat mein Kollege Christian Langmann eine Blogartikelserie veröffentlicht, in welcher er aufzeigt, wie man in Mule3 Munit-Tests...

Community
Softwareentwicklung
Testing
API
Open Source
Datenbank
Container
Integration

19.1.2024 | 3 Minuten Lesezeit

Benjamin Lüdicke

Goldene Wasserhähne – Wie wichtig ist Qualität in der Softwareentwicklung...

Stellt man Projektbeteiligten die Frage, ob Qualität von Software wichtig ist, antwortet ein Großteil der Befragten vermutlich mit „Ja”. Jede andere Antwort würde sicherlich weitere, unangenehme Fragen aufkommen lassen. Aber was bedeutet Qualität im ...

Testing
Softwareentwicklung

18.10.2023 | 9 Minuten Lesezeit

Kevin Peters

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Die Bingo Bongo-Methode: ein spielerischer Software-Testing-Ansatz

Software-Testing kann zur Herausforderung werden. Aber was wäre, wenn es weniger wie Arbeit und mehr wie ein Spiel wäre? Etwas, das das ganze Team einbezieht und sogar Spaß macht? In diesem Beitrag stellen wir Bingo Bongo vor, einen spielerischen Ansatz...

Testing
Agile Methoden
Agilität

31.7.2023 | 4 Minuten Lesezeit

Benjamin Knauer

Test-Fixtures: Wozu denn überhaupt?

Für uns Softwareentwickler ist der ultimative Endgegner immer die Komplexität. Wir haben zahlreiche, teils ziemlich mächtige Waffen gesammelt, um in diesen Kämpfen bestehen zu können: Dinge wie Modularisierung, Abstraktion, Lean Development, iteratives...

Testing
Java
Test Driven Development

12.5.2023 | 19 Minuten Lesezeit

Rüdiger zu Dohna

Jetpack Compose: So gestaltest du deklarativ User Interfaces für Android...

Anfang 2023 hatte ich ein paar kleine Ideen für spaßige Apps. Daher fing ich an, sie mit Kotlin und Android Studio umzusetzen. Leider fiel es mir von Anfang an schwer, mich in die XML-Dateien hineinzudenken und bei der Navigation ging es endgültig durcheinander...

Android
Kotlin
UX/UI

4.5.2023 | 5 Minuten Lesezeit

Robert Meißner

Microservice Integration Testing done right

In diesem Artikel beschreiben wir gesammelte Best Practices für das Integration Testing von Microservices. Zu diesem Zweck haben wir ein Projekt namens toti-example-service erstellt und auf GitHub veröffentlicht. Wir werden uns in diesem Beitrag immer...

Testing
Microservices
Spring
Kotlin

11.4.2023 | 7 Minuten Lesezeit

Tobias Dittrich

Till Voß

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Mule 4: Test-Parametrisierung – ein Flow für viele Fälle

Immer wieder entdecke ich bei Code-Reviews, dass für verschiedene Testfälle, die sich prinzipiell nur durch die Ein- und Ausgabedaten unterscheiden, eine Vielzahl von MUnit-Tests angelegt werden. Diese Flows werden dann mühselig kopiert, um jeden Testfall...

Integration
API
Testing

16.2.2023 | 5 Minuten Lesezeit

Pasquale Brunelli

AWS CloudFront Functions testen

Mit den CloudFront Functions bietet AWS die Möglichkeit, den Funktionsumfang von CloudFront um kleine JavaScript-Funktionen zu erweitern. AWS führt diese Funktionen direkt an den Edge-Locations aus und ermöglicht es dadurch, alle ankommenden Requests...

Cloud
AWS
Testing
Softwareentwicklung

4.10.2022 | 3 Minuten Lesezeit

Dennis

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Vom PoC zu Produktionssoftware: Trinke, bactane, programmiere, refaktoriere...

In diesem Text richte ich meinen Blick auf den Übergang vom Proof of Concept (PoC) zu Produktionssoftware. Speziell in kleinen Teams sind die Ressourcen nicht vorhanden, Software umfassend zu refaktorisieren, und der eine oder andere PoC landet in Produktion...

Softwareentwicklung
Testing
Agile Methoden
Test Driven Development

20.7.2022 | 7 Minuten Lesezeit

Robert Meißner

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

3 gute Gründe für eine barrierefreie Website

Die Website barrierefrei machen? Bei dem Thema winken viele Unternehmen ab. Für wen denn? Ist doch auch viel zu teuer! Das sind Argumente, die man häufig hört. In diesem Artikel möchte ich darauf eingehen und 3 wichtige Argumente dafür liefern, sein ...

UX/UI
Barrierefreiheit

24.1.2022 | 3 Minuten Lesezeit

Anna Maier

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

A/B Testing: An introduction

What is an A/B test?

Pro and cons of A/B testing

Roles

Aspects of an A/B test

Test types

More about metrics

Statistics

Frequentist approach

STEP 1: Set a maximal threshold (alpha) for the p-value for the significance of your test.

STEP 2: Calculate the sample size

STEP 3: Run the test (activate the splitting into buckets, serve the two different variants,...) and wait till the number of interactions has reached the calculated sample size.

STEP 4: Accept/Reject the null-hypothesis depending on the p-value as calculated by a t-test and the threshold alpha from the first step.

Bayesian approach

STEP 1: Set a minimum threshold for the probability to be best

STEP 2: Choose the prior distribution for the conversion rate

STEP 3: Calculate the posterior probability of the conversion rates

STEP 4: Compute the probability that the challenger is better than the baseline

Pros and cons of Bayesian vs. Frequentist

Minimum sample size for different click-rate of the baseline and MDE

Technical details about bucket assignment

Conclusion & next blog post

Was this post helpful?

Ja

Blog authors

Get in contact

Get in contact

Contact Raimar

Contact Francesca

More articles

A/B Testing: Tool support and testing GrowthBook

How development for AWS changes the local setup

AWS CodePipeline Dashboard

Automatic Testing of Logstash Configuration

Using Exceptions to Write Robust Software for Stable Production

Why agile development needs automatic tests

A/B Testing: Tool support and testing GrowthBook

The universal recommender in Action(ML)

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Test Fixtures mit JUnit 5

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Datenbanken testen mit Testcontainers in Mule4

Goldene Wasserhähne – Wie wichtig ist Qualität in der Softwareentwicklung...

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Die Bingo Bongo-Methode: ein spielerischer Software-Testing-Ansatz

Test-Fixtures: Wozu denn überhaupt?

Jetpack Compose: So gestaltest du deklarativ User Interfaces für Android...

Microservice Integration Testing done right

Bessere SQL-Datenpipelines mit dbt

Mule 4: Test-Parametrisierung – ein Flow für viele Fälle

AWS CloudFront Functions testen

Streaming Wikipedia mit Apache Kafka

Vom PoC zu Produktionssoftware: Trinke, bactane, programmiere, refaktoriere...

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

3 gute Gründe für eine barrierefreie Website

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten