An un-edited copy-pasta from an article I wrote several years ago. As mentioned at the end, I do hope to get around to writing about the null hypothesis and statistical power.
So something interesting that I’ve been reading on lately and is becoming a new trend in the software lifecycle of web services, is AB Testing. AB Testing is all about deployment by data, where your software goes through a type of evolutionary process, at the end of which the fittest version of your software wins. It’s the data that allows you to decide what the fittest software is, and to expunge the less suitable. There is heavy overhead in setting up an environment around your product which makes AB testing possible, and its for this reason that very few companies do it (and most of the ones that do have large user bases).
Before we look into AB testing any further, we should understand the context of where it fits in the software lifecycle.
Before there was AB testing, there was user feedback pages. You’d write some software, send it out into the world or publish it to your web service, and users would go to a special page if they felt one way or another, and provide feedback. If you had a user base of a few hundred or in the low thousands, this type of system worked, but it had several problems:
- Doing feedback regularly irritates the very people you’re trying to bring joy to by providing the software in the first place.
- Open-ended “comments” boxes didn’t (and don’t) scale. Who is going to spend their morning reading through the hundreds or thousands of comments from the day before? Most of them are likely to be “You suck” or “BUY CHEAP PILLS AT medco.cn”. Unless you have only a small number of users, the value from this kind of open ended feedback is minimal. (And hey, if you only have a clutch of customers, just call them up and ask them what they think!)
- Closed answered surveys are mostly only good at finding out if something is bad, not necessarily what is bad, or what alternative might be better. Surveys are also more time consuming and people are more likely to ignore it.
- Most feedback is negative. If people like what they have, they just use it, meaning that the only people who are likely to comment are those who really don’t like it, and even then they’ll probably only give you feedback if there’s no alternative software they can use that is better.
This last point is interesting, because its not related to scale, but rather is endemic to any software feedback. If we consider the old adage that “There’s thousands of ways to be wrong, and only few ways to be right”, then negative feedback isn’t very useful as it doesn’t help you uncover what the user wants, only what they don’t want. Negative feedback isn’t very constructive.
Extending on the problem of negative feedback and the points above, we see that what you will get is a small amount of feedback, that will mostly be extremely negative. So when you get a single user feedback saying the big red button is absolutely heinous and destroys their will to live, and no other feedback says anything about it (or even that there may not be any other feedback), do you take it seriously? Is this user right and no-one else can be bothered to say anything, or are they just having a bad day and decided to take it out on your software interface? If you change the big red button, will it frustrate other users? (And indeed, with the negative feedback, what do you change it to?).
Clearly, this sucks. But its easy to implement.
Enter the realm of AB Testing. The concept is similar in nature to conducting medical trials, where one set of patients gets a medicine, and the other gets control in the form of placebos. In software, you show several different users (or groups of users) several different styles of your software, and see which version is liked the most. Some users don’t see any change at all, this is the control group. At a high level, there are three elements to this:
- You have to actually make several different styles or versions of your software.
- You have to have some way to present it to subset user groups (without changing your entire site).
- You have to have some way to determine if they like it.
It would be easy to think the first one is all about man hours, but that’s not entirely true. The key to making many changes easily to the software (usually the interface) is to have a good architecture that separates things you want to change from other core functionality. This allows you to easily localize the change.
The second bit is where the overhead starts to creep in. You need some way of dividing the user base up so some people are diverted to the trial product and others to the normal product. An easy way is to email users and give them the option to either install a new set of binaries, or go to a “-trial” URL instead of the normal one. But this is hit-and-miss, and as we’ll see later can lead to data skews. What you really want is an automated process by which you can select a set of users, and force them to experience the new product. The framework and process development for this is considerable.
The last step is actually a two part item. Firstly, you need to figure out what they do, and then you need to figure out if what they’re doing means they like your product (or new feature) more or less. So you instrument your product so that events are triggered by their actions. These events can be used to answer questions like “did they click save or press ctrl-s?” or “how long did that spend on that screen? (how long did it take them to find what they were looking for)”. With enough, and the right kind of instrumentation you can find out how your users actually use your product without invading their privacy (since how they use it is not the same as what they use it for, or what content they produce with it). Keep in mind that you need to instrument both the control product and the trial product. (There are actually a lot of other reasons why you should instrument your product, more on this later).
This instrumentation sends what it collects back to a central source. Aha! Data! The final step is now clear, to figure out an algorithm to run on the data you have which will tell you quantitatively if the user likes the product. Interestingly, one of the less understood problems with AB Testing, is “Do I have enough data?”. For example, if you have only a few people using the trial product, and they stop using the product when you switch to the trial, you can assume they don’t like it, but you should ask the question “what is the likelihood that they were going to quit anyway even if I didn’t change the product?”. The secret to reducing this random behavior error, is to use a bigger user set. But if you plan on launching a dramatic change in your product, you don’t want to force it in the face of any more users than is absolutely necessary, incase it really is bad and they all do stop using it!
There’s no art to getting the right user size, its science, and I’ll write more on it later (if you want to read more on it, look at Statistical Power, and disproving the null hypothesis).