Beliebte Suchanfragen

Cloud Native

DevOps

IT-Security

Agile Methoden

Java

//

A/B Testing: Tool support and testing GrowthBook

18.3.2024 | 19 minutes of reading time

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. 

Now we want to explore the areas in which A/B testing tools can provide assistance for setup, execution and statistical analysis. The general idea is to test and compare different tools that are common in the market through practical application by simulating various scenarios of A/B tests.

In this blog post we will start with a general section on tools support and proceed by testing our first tool, GrowthBook. We start with GrowthBook as it is open source, easy to install and one of our client chose it as A/B testing tools for their experiments.

Tool support

Tools can aid in experiment design (and setup), experiment execution and statistical analysis. While all of these steps (except experiment execution) can be performed “manually”, an A/B testing platform can simplify processes, increase efficiency and reduce the likelihood of human error. Furthermore, tool support enables scaling, making it feasible to conduct dozens of parallel experiments simultaneously and reduce the dependency of the experimenter to the developers to adjust the software or make data available.

In general, an A/B testing tool will provide the following functionalities:

  • model one or more experiments by specifying the number of buckets and the probability of assignment to a bucket
  • integrate with the application under test to perform bucket assignment and collect data (metric to optimize for, general number of users, bucket assignment). This is usually done with a Source Development Kit (SDK)
  • start and stop experiments
  • examine running experiments
  • support in the statistical evaluation of the results

Evaluating A/B testing tools

In this section we will introduce our approach for evaluating and comparing different A/B testing tools. The main idea is to build a web application where two variants (baseline and challenger) are implemented and different A/B testing scenarios can be analyzed with the help of a tool.

The dummy application

To do any evaluation, we need an application that we can test. As web pages / web applications are the most common use case for A/B tests and supported by all relevant A/B testing tools, we created a simple dummy web page simulating a landing page with a subscribe button which will be using the tool-specific SDK. 

In addition, we created a driver (using selenium and python) which simulates the users. We will use the click rate as the metric to optimize for.

As this is a dummy page, the only difference between the variants is the text after the list which is removed in the challenger.

Baseline and challenger variants for the dummy page.

So the driver knows the real conversion rates in addition to the number of users to simulate. The driver reacts to the bucket assignment by the tool SDK to reach the desired conversion rates. We (the experimenter) then use the tool to compare the displayed values with our expectations.

Evaluation aspects

Our goal is to evaluate the tools mainly in the following areas:

  • Statistical evaluation: presentation and reliability of the test results
  • Developer experience: integration of the SDK into the existing application

Any other relevant aspect for our evaluation will be also considered.

Testing scenarios

With the setup described above we have the ability to compare the interactions recorded by the driver with the ones displayed in the tool and additionally examine the interpretation of the results.

To make sure we cover different aspects of A/B testing, we will consider multiple scenarios:

  • A/A Test Scenario. As we mentioned in the previous article, A/A tests can be useful to verify that the tools are working as expected. In our case for the A/A test we will use the same conversion rate of 5% for both buckets. We will use a sample size of 8 144 users per variant, as the first A/B test scenario described below.
  • A/B Test Scenarios. Following the table with the different conversion rates and sample sizes introduced in the previous article, we will evaluate the A/B testing tools in these two cases:
    • Baseline with 5% conversion rate. Challenger with 6% conversion rate. This gives a MDE (minimum detectable effect, i.e. the percentage change between conversion rates) of 20% and results in a sample size of 8 144 per variant in the frequentist approach and of 2 859 per variant in the Bayesian approach (to reach a significant chance to beat control).
    • Baseline with 3% conversion rate. Challenger with 3.3% conversion rate (MDE of 10%). This results in a sample size of 53 183 per variant for the frequentist approach and of 18 743 for the Bayesian approach (to reach a significant chance to beat control).

GrowthBook: Introduction

GrowthBook is an open source A/B testing tool which can be self-hosted although a hosted variant is available. You can manage experiments in GrowthBook and the SDK will do the assignment into buckets based on consistent hashing. GrowthBook doesn’t have any data storage for the bucket assignments or metrics but a wide support and a nice interface to import this data from multiple source types.

Use cases

Like other A/B testing tools, GrowthBook allows developers to control the visibility of specific variations to different segments of users using Feature Flags.

There are three main use cases in which GrowthBook’s functionalities can be applied:

  1. Full Experimentation Platform:
    • Employ feature flags and GrowthBook's SDKs to conduct experiments within an application
    • Analyze experiment results using GrowthBook's Experiment Analysis feature to determine the most successful variant
    • This approach is suitable for companies new to experimentation or seeking a comprehensive switch from existing practices
  2. Feature Flags Only: Exclusively utilize GrowthBook Feature flags within an engineering team. This approach is:
    • Suitable for companies with insufficient traffic for full experiments but still desiring feature flags’ benefits
    • Suitable for companies preparing for future experiments by adding GrowthBook controlled feature flags to their application now
  3. Experiment Analysis Only: Uses GrowthBook exclusively to enhance and automate the analysis process of experiments that are already implemented and running. This approach is:
    • Particularly beneficial for companies with established experiment procedures, aiming to save time and enhance decision-making.
    • Ideal for those currently relying on home-built reporting systems or self-configured data analytics reports like Jupyter notebooks.

In the next sections we will focus on the first use case (Full Experimentation Platform).

General design

This is how GrowthBook works as full experimentation platform:

The application under test uses the GrowthBook SDK (available for 11 programming languages) to contact the GrowthBook Server to fetch the experiment setup like buckets and their corresponding assignment probability (weight). While the experiment is running, the application under test has to write the bucket assignment information and any data to compute the goal metric (for example: conversion rate).

The GrowthBook Server is used by the experimenter to set up the experiments but also to examine the results. To calculate the results, GrowthBook will import data from the Tracking Database.

For the Server you can choose between the managed GrowthBook Server as in the picture above (requires payment) and hosting your own server (open source). For the Tracking Database you can use a basic SQL database (which you may host yourself as part of your application) or import the data from specialized tracking systems like Google Analytics.

Experiment setup

We will skip initial setup steps like installation or user management. For more information on this, you can have a look at the GrowthBook documentation. We will touch another important preparation step (define the import of data from the tracking database and also define how the target metric is calculated) below. To launch and evaluate an A/B test with GrowthBook the following steps are usually taken:

Step 1: Add a new feature

In this step you have to define everything required to make a bucket assignment decision. This includes:

  1. A Feature Key for the feature
  2. A Value Type like boolean, string or number
  3. An Attribute to enable consistent hashing like user-id or session-id
  4. The variations with their values and weights which defines the traffic split. In addition other advanced features (limit overall exposure, limit to a subset of users, limit to certain environments, etc.) can be configured here.
    Adding a Feature that defines the color of a button on your page.

Step 2: Integrate SDK into application

Change the application under test to include the SDK, query the status of the feature during runtime and react in different ways to the feature value. Deploy the application. More details about this will be given in a section below.

Now the data for your A/B test should start to be collected! However, a further step is needed to be able to analyse and visualise the results with GrowthBook:

Step 3: Define an experiment

This is a manual step since features and experiments are split in GrowthBook. 

Add an Experiment for an existing Feature.

Notice that there are differences in design between Features and Experiments: this is likely caused by the desired support for the various use cases outlined above. However, there is still some confusion present, and we hope that developers will refine this aspect in future iterations.

Non-covered Features

In addition to the general properties outlined above GrowthBook has features which are not covered here. These include:

Statistical model support

As stated in the documentation, GrowthBook provides both frequentist and Bayesian stats engines, although the latter is preferred and strongly recommended by the tools vendors for its advantages of requiring fewer samples and offering easier interpretation. For the Bayesian approach in the case of binomial data (like in our experiment click/do-not click on a button), it uses a beta distribution with parameters a=b=1 (uninformative prior). At the moment using an informative prior is not supported. As soon as data comes in (number of clicks, number of users), it updates the distribution to get the posterior (see Bayesian Approach section from our previous post to have a look how the posterior is calculated). In case of metrics like count, duration, revenue Growthbook uses a Normal prior.

For a final evaluation and decision making GrowthBook calculates the chance to beat control. Additionally it uses a violin plot to show the distribution of relative uplift (how much better is it?) and provides an estimation of the risk (how many conversions would I lose if I choose B and it’s actually worse?).

The percentage change value is more likely to be in the thicker part of the graph. The shorter the tails of the graph, the less uncertainty there is.

On the other hand, by choosing the frequentist engine, GrowthBook will show: the p-value (instead of chance to beat control) and the 95% confidence interval of the percentage change (instead of the violin plot). See wikipedia for a correct interpretation of the confidence interval.

If Bayesian is not feasible for your experiment, consider Sequential Frequency Testing (which is a premium feature). Plain frequentist testing is recommended as a last resort due to potential issues related to the peeking problem.

GrowthBook: Evaluation

Following the Evaluation approach we have introduced before, we now present our review of Growthbook as A/B testing tool.

Preparing the application for GrowthBook

For the experiment with GrowthBook we had to implement a simple backend (we used Django here) which stores the assignment and subscription events in a PostgreSQL database. The latter can be read without issues from GrowthBook. At least two small SQL statements have to be written. The GUI has very nice support in the construction of the statements.

The first query is for the bucket assignment:

1SELECT session_id   as session_id,
2      timestamp     as timestamp,
3      experiment_id as experiment_id,
4      variant_id    as variation_id
5FROM abtest_backend_app_variantassignment

The second is the source query for the click rate metric:

1SELECT session_id as session_id,
2      timestamp   as timestamp
3FROM abtest_backend_app_event
4WHERE type = 'subscribed'

Following the documentation, the integration of the JavaScript SDK per se was easy: construct the GrowthBook object, load the features and determine the bucket-id based on experiment-id and session-id. Finally react with some DOM changes to the bucket-id.

1const GROWTHBOOK_URL = …
2const CLIENT_KEY = …
3const sessionId = …
4const featureId = …
5
6// Create a GrowthBook instance
7const gb = new window.growthbook.GrowthBook({
8   apiHost: GROWTHBOOK_URL,
9   clientKey: CLIENT_KEY,
10   // Enable easier debugging during development
11   enableDevMode: true,
12   // Targeting attributes
13   attributes: {
14       sessionId: sessionId
15   },
16   trackingCallback: (experiment, result) => {
17       // TODO: Use your real analytics tracking system
18       console.log("trackingCallback: Viewed Experiment", {
19           experimentId: experiment.key,
20           variationId: result.value
21       });
22   }
23});
24
25// Wait for features to be available
26let loadFeatures = gb.loadFeatures({autoRefresh: true});
27loadFeatures.then(value => {
28   console.log("features loaded. Asking for a variant ... ");
29   const variant = gb.getFeatureValue(
30       featureId,
31       "fallback-param"
32   );
33   console.log("got variant:", variant);
34   if (variant === "fallback-param") {
35      console.log('Unknown feature id ' + featureId);
36   }
37   // Insert here: Change the DOM in same way based on variant
38});

Statistical presentation and reliability

In our Evaluation section, we devised some different A/B testing scenarios to be used for evaluating the tools. Following the instructions given in the Experiment setup, we created a Feature and a corresponding Experiment in the Growthbook UI for each of these scenarios. After sending the users data to the backend we can now view the results in GrowthBook and compare them with our own evaluation.

Expected results and comparing them

Similarly to what we did in the statistical section of our previous blog post, we compare the evaluation of the A/B test scenarios given by GrowthBook with our own analysis using Python libraries. Notice that one should be able to reproduce exactly the same results given in the GrowthBook Experiments UI by using the GrowthBook stats engine Python library gbstats.

First of all, some general remarks about comparing the values displayed by GrowthBook with the Python-calculated ones are important:

  • In all cases we expect the significance of the test to match. 
  • For the frequentist approach we expect the p-value to be slightly different since the estimation done by GrowthBook does not seem to follow the classical formula for the p-value in an independent two-sided t-test (what we chose for our Python evaluation). Have a look at the source code of the gstats library developed by GrowthBook for more information on that.
  • For the Bayesian approach it is a bit more complex since the chance to beat control is the outcome of one or multiple simulation runs. However simulation runs, use a pseudo random number generator (PRNG) which can cause the result to vary. For this reason, we run multiple simulation runs and get a range of chances to beat control. We expect the value shown in the A/B testing tool to be inside this range. Also the bayesian-testing library and GrowthBook use different methods to estimate the chance to beat control. Have a look at the GrowthBook whitepaper and this post to see a detailed explanation on how this estimation is done by GrowthBook. Finally we use different simulation sizes in our calculation as this may also have an effect on the results.

For the Frequentist approach we used the statsmodels library to calculate test statistics and p-value for randomly generated data. 

For the Bayesian approach we used the bayesian_testing library to evaluate a Bayesian test. 

1import numpy as np
2
3
4SEED = 42
5
6
7def create_random_rawdata(num_users, num_clicks):
8   """Generate an array of zeros and ones having:
9      length = num_users and num_clicks random ones"""
10   np.random.seed(SEED)
11   rawdata = np.zeros(num_users)
12   random_indices = np.random.choice(num_users, num_clicks, replace=False)
13   rawdata[random_indices] = 1
14   return rawdata
15
16
17def print_frequentist_evaluation(users_baseline, users_challenger, clicks_baseline, clicks_challenger):
18   """Print click rates, p-value and test statistic for a two-sided two-sampled t-test"""
19   from statsmodels.stats import weightstats
20
21
22   # Create "fake" rawdata using randomly generated arrays of zeros and ones
23   data_baseline = create_random_rawdata(users_baseline, clicks_baseline)
24   data_challenger = create_random_rawdata(users_challenger, clicks_challenger)
25
26
27   cr_baseline = np.mean(data_baseline)
28   cr_challenger = np.mean(data_challenger)
29
30
31   # Calculate test statistic and p value
32   tstat, p, _ = weightstats.ttest_ind(
33       x1=data_challenger,
34       x2=data_baseline,
35       alternative='two-sided',
36       usevar='pooled',
37       weights=(None, None),
38       value=0
39   )
40
41
42   print("Click-rate baseline: {:.2f}%".format(cr_baseline * 100))
43   print("Click-rate challenger: {:.2f}%".format(cr_challenger * 100))
44   print()
45   print("T-test statistics: ")
46   print("  p-value: {:.3f}".format(p))
47   print("  tstat: {:.2f}".format(tstat))
48
49def print_bayesian_evaluation(users_baseline, users_challenger, clicks_baseline, clicks_challenger):
50   """Start a binary bayesian test between baseline and challenger for different simulation
51    sizes and return min and max of chance to beat baseline"""
52
53   from bayesian_testing.experiments import BinaryDataTest
54
55   bayesian_test_agg = BinaryDataTest()
56
57   bayesian_test_agg.add_variant_data_agg(name="baseline",
58                                          totals=users_baseline,
59                                          positives=clicks_baseline,
60                                          a_prior=1,
61                                          b_prior=1)
62
63
64   bayesian_test_agg.add_variant_data_agg(name="challenger",
65                                          totals=users_challenger,
66                                          positives=clicks_challenger,
67                                          a_prior=1,
68                                          b_prior=1)
69
70
71   chances_to_beat_control = []
72   min_sim_size = 10000
73   max_sim_size = 100000
74   num_sizes_to_test=10
75   for sim_size in np.linspace(min_sim_size, max_sim_size, num=num_sizes_to_test).astype(int):
76       # Get bayesian test evaluation
77       evaluation = bayesian_test_agg.evaluate(sim_count=sim_size, seed=SEED)
78       # Extract chance to beat control
79       chances_to_beat_control.append(evaluation[1]["prob_being_best"] * 100)
80
81
82    print(
83        f"For a sim size between {min_sim_size} and {max_sim_size} ({num_sizes_to_test} different values tested), the "
84        f"chance\n  to beat control is between {min(chances_to_beat_control):.2f}%"
85        f" and {max(chances_to_beat_control):.2f}%.")
86
87    chances_to_beat_control = []
88    num_seeds_to_test = 10
89    for i in range(num_seeds_to_test):
90        # Get bayesian test evaluation
91        evaluation = bayesian_test_agg.evaluate(sim_count=min_sim_size, seed=SEED + i)
92        # Extract chance to beat control
93        chances_to_beat_control.append(evaluation[1]["prob_being_best"] * 100)
94
95    print(f"For {num_seeds_to_test} different seeds and a sim size of {min_sim_size}, the "
96          f"chance to beat control is\n  between {min(chances_to_beat_control):.2f}%"
97          f" and {max(chances_to_beat_control):.2f}%.")

A/A Test Scenario

Frequentist Engine

We see a p-value (0.94) close to 1, which indicates no significant difference between the variants. Also the percentage change between the conversion rate is centered at zero.

To verify we call:

1print_frequentist_evaluation(8181, 8163, 410, 407)

which outputs:

Click-rate baseline: 5.01%
Click-rate challenger: 4.99%

T-test statistics: 
  p-value: 0.940
  tstat: -0.08

and we see that the p-values are equal.

Bayesian engine

For the Bayesian approach we see a chance to beat control close to 50% and a percent change centered at zero. This clearly shows no difference between the variants..

To verify we call the other helper method:

1print_bayesian_evaluation(8181, 8163, 410, 407)

and get:

For a sim size between 10000 and 100000 (10 different values tested), the chance
  to beat control is between 46.51% and 47.14%.
For 10 different seeds and a sim size of 10000, the chance to beat control is
  between 46.51% and 47.18%.

We see that the value computed by GrowthBook (47%) is within the range.

In summary we see that the tool doesn’t find any significant difference between the variants in both approaches. This meets our expectations.

First A/B Test Scenario (Baseline 5% - Challenger 6%)

Frequentist Engine

The Python code will be in this case called with print_frequentist_evaluation(8161, 8127, 408, 487) and outputs:

Click-rate baseline: 5.00%
Click-rate challenger: 5.99%

T-test statistics: 
  p-value: 0.005
  tstat: 2.78

As we have mentioned before, we see a minor difference between the p-values. But the final decision on the statistical significance is the same.

Bayesian Engine

The Python code will be in this case called with print_bayesian_evaluation(3134, 3166, 157, 190) and outputs:

For a sim size between 10000 and 100000 (10 different values tested), the chance
  to beat control is between 95.71% and 95.86%.
For 10 different seeds and a sim size of 10000, the chance to beat control is
  between 95.46% and 96.09%.

We see that the evaluation result of GrowthBook is in both Python-computed ranges.

It is possible to change the way GrowthBook analyzes data even after the experiment is finished. When we look at the frequentist data from above with the Bayesian approach, we get the following picture. In this we see that with more sessions (8 144 sessions per bucket) the chance to beat control gets higher and the violin plot of the percent change gets more narrow (more certainty).

Second A/B test scenario (Baseline 3% – Challenger 3.3%)

Frequentist engine

The Python function call of print_frequentist_evaluation(53076, 53290, 1592, 1759) outputs:

Click-rate baseline: 3.00%
Click-rate challenger: 3.30%

T-test statistics: 
  p-value: 0.005
  tstat: 2.81

The situation is similar to the one in the first A/B test scenario.

Bayesian engine

The output of print_bayesian_evaluation(53076, 53290, 1592, 1759) is:

For a sim size between 10000 and 100000 (10 different values tested), the chance
  to beat control is between 95.71% and 96.10%.
For 10 different seeds and a sim size of 10000, the chance to beat control is
  between 95.80% and 96.33%.

Here the GrowthBook value is also in range.

Summary of test scenarios

Testing GrowthBook with our 3 scenarios showed that while there are some minor (and expected) differences for the frequentists approach the statistical results match our computations. The way GrowthBook presents the results is also clear.

Our opinion on GrowthBook

We summarize here some Pros and Cons that we have observed using GrowthBook for implementing our tests.

Pros

  • Versatility: One significant advantage of GrowthBook is its versatility. Unlike other analytics platforms, it is not limited solely to web applications. This flexibility allows businesses operating across various domains to benefit from its features. On the other hand, this means that you also have to create some glue-code even for the basic web application case.
  • Solid Statistical Methods: The platform uses validated statistical methods to evaluate the results. In this way businesses can make informed decisions based on trustworthy insights.
  • Flexible Statistical Approach: Although GrowthBook favors the Bayesian approach, it is possible to switch the engine and evaluate your Test using frequentist methodologies. Additionally, GrowthBook offers the possibility to enable sequential testing for the frequentist approach (although this is only available as a premium feature). Changing the engine (Bayesian to frequentist and vice versa) can be made by a simple change in the settings menu. No restart required. A minor con here is that this setting is global for all experiments.
  • Cost and Setup: All the features we have tested are free to use and easy to set up. This makes the tool very appealing, in particular for private users or small businesses but also for teams who want to start using a tool to get their feet wet.

Cons

  • Missing details in Documentation: Although most of the instructions and concepts are documented on the docs page, some valuable details are missing. For example, while trying to download our results as a Jupyter Notebook we found some difficulties in setting up the right configuration. After searching for more details we ended up modifying the data source settings directly on the config.yml.
  • Requirement for Custom Data Tracking: Users have to implement their own data tracking mechanisms or rely on third-party solutions for bucket assignment. While not necessarily a dealbreaker, this additional requirement can add complexity to the setup process.
  • Feature vs. Experiment: Distinguishing between features and experiments within GrowthBook may be required to support the different use cases but it is not intuitive for a new user and the relationship between them becomes clear only over time.

Summary

In summary, GrowthBook presents an appealing entry point into the world of A/B testing as an open-source solution, eliminating the need for an initial purchase. Developed by individuals well-versed in the intricacies of A/B testing, its self-hosted or cloud-hosted options offer users added flexibility. It serves as an excellent introductory tool for those beginning their experimentation journey. Stay tuned for our exploration of the next tool in our upcoming blog post.

share post

Likes

0

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.