Continuous Delivery - Part 4 - A/B Testing

Filed under: — By Aviran Mordo @ 4:02 pm

Previous chapter: Continuous Delivery - Part 3 - Feature Toggles

UPDATE: We released PETRI our 3′rd generation experiment system as an open source project available on Github

From Wikipedia: In web development and marketing, as well as in more traditional forms of advertising, A/B testing or split testing is an experimental approach to web design (especially user experience design), which aims to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement). As the name implies, two versions (A and B) are compared, which are identical except for one variation that might impact a user’s behavior. Version A might be the currently used version, while Version B is modified in some respect. For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can be seen through testing elements like copy text, layouts, images and colors.

Although it sounds similar to feature toggles, there is a conceptual difference between A/B testing and feature toggles. With A/B test you measure an outcome for of a completed feature or flow, which hopefully does not have bugs. A/B testing is a mechanism to expose a finished feature to your users and test their reaction to it. While with feature toggle you would like to test that the code behaves properly, as expected and without bugs. In many cases feature toggles are used on the back-end where the users don’t not really experience changes in flow, while A/B tests are used on the front-end that exposes the new flow or UI to users.

Consistent user experience.
One important point to notice in A/B testing is consistent user experience. For instance you cannot display a new menu option one time, not show the same option a second time the user returns to your site or if the user refreshes the browser. So depending on the strategy you’re A/B test works to determine if a user is in group A or in group B , it should be consistent. If a user comes back to your application they should always “fall” to the same test group.

To achieve consistent user experience in a web application is tricky. Most web applications have two types of users: Anonymous user – a user that is not signed in to your web-app; And a signed-in user – a user that have a valid session and is authenticated on your site.

For authenticated users achieving consistency is easy. The algorithm you choose to assign the user to a specific test group should work on the user ID. A simple algorithm is modulus on the user ID. This will ensure that whenever the user returns to the site, regardless of the computer or browser the users logs-in from he will always get the same value.

For anonymous user this is more complex. Since you don’t know who the user is you cannot guarantee a consistent behavior. We can mitigate the problem by storing a permanent cookie on the user’s browser with the value of the A/B test group the user is assigned to. This will ensure the next time the user returns to the site he will get the same group assignment (you should only assign a user to a test group once). However this method has a flaw because of how the web works. If the user surfs to the site from a different browser, different computer or if they clean the browser cookies you cannot know that the user was assigned to a specific test group in the past and you would probably assign the user again to a test group (but he may be assigned to a different group).

A/B testing strategies.
The most common strategy to assign a test group is percentage. You can define what percentage of your users will get A and what percentage will get B.
Like feature toggles you can have multiple strategies to determine a test group to users. Such strategies we use at Wix are language, GEO location, user type, registration date and more. Of course you can also combine strategies, for instance: “50% of UK users would get B and all the rest would get A”.

A very important rule is that bots like google bot will always get the “A” version, otherwise it may index pages that are under experiment and might fail.

Reporting and analysis.
Since the whole point of A/B testing is to determine if a new feature improves your goals or not all the users who were assign to a test group should be tracked and once you decide you have a large enough test participants, analyze your data and decide if the new feature is good or not. If the test is successful you would probably increase your test group or even stop the test and expose the new feature to all your users. On the other hand if the test was not successful you would stop the test and revert.
If a test was not successful, but you want to try and improve and restart the test you could pause the test in order to keep the user experience consistent, do not assign more users to get the new feature group, but whomever got assigned to see the new feature will keep see it. When you make the necessary improvement you should resume the test and resume assigning users to test groups.

Now you may ask yourself what does A/B test has to do with continuous deployment? Well since the whole point of continuous deployment is to get a quick feedback from users, A/B testing is a great mechanism to get this feedback.

Next chapter: Startup and Self-test


6 Responses to “Continuous Delivery - Part 4 - A/B Testing”

  1. Arise.io Says:

    Thanks for interesting chapters on A/B testing tools. Indeed, this marketing tool increases outcome of interest by user’s studies and it’s very powerful.


  2. outdoor and landscape lighting for backyard Says:

    Your way of explaining all in this post is genuinely nice, every one be able to easily know
    it, Thanks a lot.

  3. Conor Says:

    Hi Aviran,

    First of all, great article, well explained, many thanks.

    Just on the A/B testing, how does wix interpret a ‘failed’ B workflow through the application?
    For example, do you use analytics to determine the end-user is not taking that workflow, and going somewhere else, thus a failure? Or some other pointers you use to deduce they do not like workflow B, but may expect/prefer the original workflow A?

    I am curious how you deduce that.

    Kind regards,

  4. Aviran Mordo Says:

    We use our own BI tools to analyze users flow. In our data center we have a Hadoop grid that crunches the information coming from our BI logs and builds a flow for the users so we can analyze the A and the B groups and determine who wins

  5. Nitu S Says:

    How long do you carry these A/B tests for each feature? Are the users internal or external?

  6. Aviran Mordo Says:

    @Nitu It depends on the feature, most experiments live for a week or two, but some can take even months. It depends on how complicated and dangerous the feature is

Leave a Reply

You must have Javascript enabled in order to submit comments.

All fields are optional (except comment).
Some comments may be held for moderation (depends on spam filter) and not show up immediately.
Links will automatically get rel="nofollow" attribute to deter spammers.

Powered by WordPress