AB split test graphical Bayesian calculator

Version Include Trials Successes Apprx probability of being best 95% chance conversion rate between
A    
B    
C    
D    



 

What is this calculator for?

The aim in analysing split test data is sorting out

  • the signal on which you can act
  • the noise of random variation.

Most split testing tools give you some variation on significance testing to do this job.

There are a number of issues with null-hypothesis significance testing, this wikipedia article give some good examples and references.

This calculator takes a different approach, A Bayesian approach can give you a good estimate of the probability that A beats B given the data you have – which is, after all, the business question!

The plots show the probability distribution of conversion rates, given the data. The probabilities of being the most successful version, displayed in the table, are based on a random sample of several thousand points within the distribution (monte-carlo method). For experiments that are close, you will notice the probabilities may vary slightly if you re-calculate.

The calculations depend on a few assumptions. In particular it is assumed that each trial has equal probability of success, so if something else changed during your experiment, it may throw out the results (such changes would also be a problem for simple approaches to traditional significance testing too).

Why use it?

A Bayesian approach to analysis of AB tests has many important advantages compared to approaches for estimating statistical significance.

It can often enable you to draw useful inferences, even where conversion rates and sample sizes are low.

  • A weak signal – if that is all you have – is enough for some marketing decisions – you can make your own decisions about the level of confidence you need based on the business situation.
  • If you have a strong signal, the answers this calculator gives you will be the same as you get from significance testing.

Measuring conversions – not micro-conversions
One particular issue we see in significance testing for online split testing is what you choose to measure as a conversion – we’ve had clients who were advised to measure only the immediate click-through-rate relating to their variations (the micro-conversion), rather than final conversion, because the tests would “reach significance faster”.

We think that is highly dangerous – in optimising for a micro-conversion you can easily damage your ultimate conversion rate. The point of split testing is to improve conversion – statistical significance is, at best, a tool not an objective!

That’s where the approach of this calculator comes into its own – extracting business meaning from weak signals such as

  • conversions too rare to reach significance
  • low traffic
  • optimising for a smaller segment (eg mobile)

You can use the calculator on its own, or as an adjunct and cross-check of the numbers you are getting from your split-test tool.

Reading the graph

The graphs show a probability distribution for the conversion rates of each variant.

  • The horizontal axis is conversion rate expressed as a percentage.
  • The area under the curve between any two points on the horizontal axis represents the probability that the conversion rate lies between those points.
  • The vertical axis shows a scale that makes the whole area under each curve integrate to 1 – so that the area represents probability.

The spread of the curve represents how precisely the experiment has measured the conversion rate.

The extent to which the areas under the curves overlap corresponds with your experiment not separating the probable conversion rates.

  • If the means are reasonably well separated but the curves are wide, you need more trials.
  • If the means are very close together and the curves are getting quite steep, there probably isn’t much difference between A and B in terms of conversion rate.

You will see that as your number of trials and conversion increases up, the sharpness, and hopefully separation, of the peaks increases. What you are aiming to achieve is a clear signal of well separated peaks.

Assumptions and the maths

The calculation assumes that you are measuring a variable that has only two values: success and failure, and that the assumptions of a binomial distribution apply.

The posterior probability is a beta distribution.

A uniform prior probability is assumed.

Technology

The distribution is calculated and plotted using the jStat javascript statistical library.

Other References

Next steps

Looking at analysis where the variable in question is not binary, for example, spend-per-customer or time-on-site

44 thoughts on “AB split test graphical Bayesian calculator

  • This is really great. Thanks for posting. I was curious if you had any thoughts on the best way to split test when revenue per view is the defining metric? Higher success rates is one part of the story, but when you have revenue differences among the treatments it ads another level of complexity. Would simply a comparison of means be sufficient?

    • Thanks Justin. This is a really interesting question – the calculator here only works for yes/no type variables so far, and we can expect the distribution of revenue per visit to be different.

      I think you need to be careful with a simple comparison of the means, especially if your sample is not very large. Without some mathematical analysis it’s very hard to know if how much of the difference is likely to be random variation.

      I’m going to need a bit of time to work up an answer. In the interim if anyone has one please post.

  • Would you be willing to share the math behind your calculations in the tool? Additionally, I’m interested in your thoughts on Rev/Visit as a metric and how to calculate a probability of success.

    Great post and tool though.

  • This is very interesting. I recently studied Bayesian at Carnegie and it feels to great to see it practically implemented. Could you share the calculation behind it ?

  • your first link (The Technical) gives me a web site that seems to be decommissioned. Like Kartik, I am eager to compare your math to mine…

  • This is fantastic. Boy, would I sure love a tool that measured the statistical significance in regards to revenue/visitor. It has been a year since this post. Have you come up with any innovative and creative ways to measure the confidence in a test involving non-binomials? Thanks for this article!

  • Thanks Justin. Super interesting.

    How do you calculate the “Aprox probability of being best”?

    I’ve reviewed the excellent posts you’ve referenced.

    Many thanks
    Mark

    • Thanks Mark. The ‘probability of being best’ is calculated using a monte carlo approach.
      The displayed curves represent the probability distribution of the conversion rate for each version measured: A, B and optionally C & D. The code generates random points that follow the same probability distributions (using jStat). So I get it to generate one random point for each distribution, then look to see which has the highest conversion rate, repeat 5,000 times, and report the proportion of wins for each distribution. My code for this is visible in this js file.

      This is the same method that Sergey Feldman implements in Python here.

  • Justin, this is truly fantastic. I’m a bit confused about something, though. In another article, I read that the Bayesian approach used Anscombe’s method which provides a formula to determine the stopping point of an experiment. I can’t type it here, but its variables are: y is the difference between results of A and B, k is the expected number of future users who will be exposed to a result, and n is the number of users who are exposed to the test so far, and Phi-inverse is the quantile function of the standard normal.

    Since you don’t use any such variables, are you using the Bayesian approach? Or is Anscombe’s method just part of the Bayesian approach that allows you determine when its ok to stop the test?

    Finally…this is probably a very dumb question….am I safe to assume its ok to stop the test when your calculator “Aprox probability of being best” reads 100% and 0%?

    Thanks!

    • Thanks Kevin – really interesting question.
      As I understand it Anscombe described a method for choosing the optimal stopping time in a Bayesian trial. When to stop is a really important issue in applications such as clinical trials, and even some marketing applications, where you need to optimise quickly and balance the desire for certainty against the need for speed. I’d be interested to see the article you mention!
      However, this calculator leaves the question of stopping up to you. I think in most marketing situations you want to consider quite a few business factors as well as the posterior PDFs you are seeing, for example:

      • – what it’s costing you to run the trial (including the opportunity cost of running a version that seems sub-optimal)
      • – what is at stake – how much difference will this make to the bottom line
      • – your tolerance for risk
      • – any external deadline (eg have to make a decision before the next site build…)
      • – what the next step is (eg are you going to spend a lot of money implementing one of the versions in test; or are you just going to choose the best and move on to testing other versions against that)
      • – and so on.

      I think you can stop if you get to 100%, and often before – 100% certainty is nice if you can get it, but in practice your A and B are often not all that different, and in some cases you can make a decision on much less.
      The nifty thing about the Bayesian approach is, at any point in the test, that you can see the relevant probability (ie is A better than B) based on the data collected, and factor that into what’s ultimately a business decision.

        • Sorry for my slow reply Kevin. It’s an interesting question. Probably splitting hairs, but I don’t think I would say that the Bayesian approach is to “use the statistical likelihood to make a decision when to stop the test” – although I don’t think that’s incorrect. What’s essentially Bayesian about this calculator is the maths that gets you to the probability distribution – well explained by Sergey Feldman here. There are Bayesian approaches to the question of when to stop. However, I haven’t really considered them here.

    • Not really – I think Chi-Squared is generally about the sample distribution of test statistics – used for hypothesis testing and checking for fit to a distribution. This is a different approach to a related question.

  • Hi,

    Thanks for uploading this, it’s very useful. Is there any way to recreate this formula in excel? I would be very appreciative if you knew of any resource you could point me in the direction of which would show me how to set this up in excel.

    Thanks

  • Justin —

    Do you know why the calculator does not display anything if you make the trials number large? My sense is that there are some NaN or divide by zero errors occurring. Question — how would one get around this considering number of data points drives the solution.

    • Yes – sorry about that – it’s a problem with the jstat library not liking beta distributions with large parameters. I see that they have recently fixed this, and I am working on a new version using the latest jstat.

  • Hi Justin, this is a very useful tool and extremely interesting article thank you very much. I will definitely be using this tool to analyse my conversion tests. I also test where I need to compare averages rather than conversion rates, so you know of a calculator I could use for those tests that uses the same Bayesian theory you have used here?

    • Thanks Craig. I don’t know of a calculator that will work straight off the shelf.

      If you are looking at the average of a variable where the underlying distribution is log-normal (eg revenue and time on site can often be modelled as log-normal) then Sergey Feldman’s post gives you the maths (math if you are American!) and code in python.

      I am still aiming to get something working here, but struggling to get time!

  • Hi Justin! Thanks for this tool. Is there a way to add on a version “E”? Or any more for that matter? I have data from 5 versions and I would like to compare them all.

  • I input 89 trials, 10 successes, and 67 trials, 1 success. If I keep clicking calculate sometimes I get 99% for the first variant, and sometimes I get 100%. So the answer keeps changing. Do you think this is a rounding issue? Thanks.

    • Hi Eric,
      Thanks for the comment. Yes – I think this is a combination of sampling and rounding issues. The probability is computed using a monte carlo approach, so it can give very slightly different answers on different runs, when this is rounded to a whole number percentage, it can appear to move quite a long way. I suppose I could expose a couple of decimal places, but I don’t want it to look like it’s more precise than it actually is.
      Justin

    • Thanks Yanir – I’ve had a quick look at your calculator and it looks great! It’s so good to see this approach spreading and I think your work will really help.
      I’ve been thinking of doing a version that exposes more options (as well as updating the look a bit). I’ve been trying to think of a way of gathering useful priors that doesn’t involve major conceptual leaps by the end-user – I’ve found this tricky. I’ll be keen to hear how your setup goes.
      I’m also keen to get back in touch once I’ve had a closer look at your calculator

    • HI Yanir

      Thanks for building this tool. This calculator is awesome and very flexible.

      With limitations on sample size, more and more people are bending towards Bayesian approach these days. It would be great if there are options to include multiple variants.

  • Hi Justin!

    Firstly, thanks for building this awesome tool for A/B split-testing. It has helped a lot in my work to be able to get a good estimate on how should I optimize my campaigns.

    Secondly, I am facing a problem currently, which is that the cost per trials is not the same between each set. Therefore the “conclusion” calculated would not be an accurate estimation. Is there a way to work around this? I’ve seen comments above years ago talking about revenue/view. How could I use that metrics & what’s the way to incorporate into this calculator?

    Looking forward to your kindest reply.

    Best Regards.

    • Hi Zeth,

      Thanks very much for the comment. Interesting problem!

      If I have understood properly, then I think the general Bayesian approach is to work out a loss function for each branch of the test – this combines the expected conversion, the uncertainty around that, and the expected cost with its uncertainty, into one equation, which gives you a distribution of expected utility for each option.
      Unfortunately, this is way beyond the scope of this calculator, and it’s going to depend on the specifics of your situation such as the distribution of profit (eg does the revenue per transaction vary, and in what way) as well as cost.

      If you’re only seeing a small variance in conversion, and that is still going to make a substantial difference to your bottom line, then I’d suggest talking with a mathematician – email me if this is the case, and I can put you in touch with people with relevant expertise.

      Otherwise, you may be able to reason through to a decision based on some simplifying assumptions – eg have a look at the observed distributions of transaction amounts and see if there is much variation between the test branches, if not, and if the conversion rates probability distributions are well separated, then it may be reasonable to simply calculate the overall cost and benefit based on the expectation conversion rates, and have a look at those. Email me if you’d like to discuss in more depth.

      Justin

  • Justin – this is a fabulous tool and I just love using it! Thank you for making it available! Do you think you’d be able to add a cost weighting dimension at some point? Is that even possible?

    I like many others, use your tool for split testing marketing campaigns. While understanding from a small number of trials which campaign is more likely to have a higher conversion rate is incredibly useful, often the cost of each campaign per trial is different. It’s a no brainer when they cost the same or the worse CTR is more expensive. But what to do when the cost is the other way? Is there a way to factor in cost somehow in a bayesian manner?

    • Hi Adam – thanks! Good question. There is a way to factor in costs in a Bayesian way. In terms of the maths, it’s a very similar question to Zeth’s one about accounting for differences in revenue for split test branches, and I think a similar answer applies
      In general the approach would be to produce functions that give the distribution of costs minus rewards with conversion rate as a parameter. Then see how this varies between the split test options given the evidence of the test. You arrive at a probability distribution of value for each branch of the split test, so they can be compared in terms of likely $. Fortunately, numerical approaches and tools make the maths of this approach feasible.
      However, unfortunately, there are so many options for how costs and rewards could be structured that (so far) I can’t figure out how to a make a tool for it that would generic enough to cover many situations, and still be manageable to build and use.
      We can do this case-by-case, but we need quite a lot of specific information about the business situation to build the model and help interpret the results.

  • Great tool, using it now. We use optimizely to stage our site tests – any cautions around not using their numbers instead of your calculator (which logically and through your discussion I prefer.) Tried to figure out optimizely and the best I can surmise is they use old school split tests. Thanks again for putting your work out there.

    • Hi Matt. Thanks!

      In the years since I first published this, Optimizely have done some great work on stats and their stats engine. They now take a hybrid approach – see https://blog.optimizely.com/2015/03/04/bayesian-vs-frequentist-statistics/. I’m not qualified to comment on the quality of the maths, but I can say that they’ve published it and had it reviewed, which is more than can be said for some others. I certainly wouldn’t caution against using their numbers

      Hopefully this approach gives another perspective which can be really useful in the case of low traffic or rare conversions.

  • Thank you for making this calculator available!

    Would you explain what you mean by “A uniform prior probability is assumed.”?

    Are you using an uninformative prior?

Leave a Reply to Justin Cancel reply

Your email address will not be published. Required fields are marked *