How Indeed Uses Proctor for A/B Testing

(Editor’s Note: This post is the second in a series about Proctor, Indeed’s open source A/B testing framework.)

Proctor at Indeed

In a previous blog post, we described the features and tools provided by Proctor, our open-source A/B testing framework. In this follow-up, we share details about how we integrate Proctor into Indeed’s development process.

Customized Proctor Webapp

Our internal deployment of the Proctor Webapp integrates with Atlassian JIRA, Subversion, Git, and Jenkins. We use JIRA for issue linking, various sanity checks, and automating issue workflow. For tracking changes over time, we use Subversion (for historical reasons — Git is also an option). We use Jenkins to launch test matrix builds, and the webapp integrates with our internal operational data store to display which versions of a test are in use in which applications.

Proctor webapp

Figure 1: Screenshot of a test definition’s change history in the Proctor Webapp

Issue tracking with JIRA

At Indeed, we track everything with JIRA issues, including changes to test definitions. Requests for new tests or changes to existing tests are represented by a custom issue type in JIRA that we called “ProTest” (short for “Proctor Test”). We track ProTest issues in the JIRA project for the application to which the test belongs. The ProTest issues also use a custom workflow that is tied into our deployment of the Proctor Webapp.

After accepting an assigned ProTest issue, the issue owner modifies the test definition using Proctor Webapp. When saving the changes, she must provide a ProTest issue key. Before committing to our Proctor test definition repository, the webapp first verifies that the ProTest issue exists and is in a valid state (for example, is not closed). The webapp then commits the change (on behalf of the logged-in user), referencing the issue key in the commit message.

After the issue owner has made all changes for a ProTest issue, the JIRA workflow is usually as follows:

  1. The issue owner resolves the issue, which moves to state QA Ready.
  2. A release manager uses Proctor Webapp to promote the new definition to QA. The webapp moves the issue state to In QA.
  3. A QA analyst verifies the expected test behavior in our QA environment and verifies the issue, which moves to state Production Ready.
  4. A release manager uses Proctor Webapp to promote the new definition to production, triggering worldwide distribution and activation of the test change within one or two minutes. The webapp moves the issue state to In Production.
  5. A QA analyst verifies the expected test behavior in production and moves the issue state to Pending Closure.
  6. The issue owner closes the issue to reflect that all work is complete and in production.

In cases where we are simply adjusting the size of an active test group, Proctor Webapp skips this process and automatically pushes the change to production.

Our QA team verifies test modifications because those modifications can result in unintended behavior or interact poorly with other tests. Rules in test definitions are a form of deployable code and need to be exercised to ensure correctness. The verification step gives our QA analysts one last chance to catch any unintended consequences before the modifications go live. Consider the case of this rule, intended to make a test available only to English-language users in the US and Canada:

    (lang=='en' && country=='US') || country=='CA'

The parentheses are in the wrong place, allowing French-language Canadians to see behavior that may not be ready for them. A developer forcing himself into the desired group might have missed this bug. When we catch bugs right away during QA, we avoid wasting the time it would take to notice that the desired behavior never made it to production.

Test definition files

We store test definitions in a single shared project repository called proctor-data. The project contains one file per test definition: test-definitions/<testName>/definition.json

Modifications to tests most often are done via the Proctor Webapp, which makes changes to the JSON in the definition file and commits those changes (on behalf of the logged-in user) to the version control repository.

The definition files are duplicated to two branches in proctor-data: qa and production.  When a test definition revision is promoted to QA, the entire test definition file is copied to the qa branch and committed (as opposed to applying or “cherry-picking” the diff associated with a single revision). Similarly, when a test definition revision is promoted to production, the entire file is copied to the production branch and committed. Since we have one file per test definition, this simple approach maintains the integrity of the JSON definition while avoiding merge conflicts and not requiring us to determine which trunk revision deltas to cherry pick.

Building and deploying the test matrix

Proctor includes a builder that can combine a set of test definition files into a single text matrix file, while also ensuring that the definitions are internally consistent, do not refer to undefined bucket values, and have allocations that sum to 1.0. This builder can be invoked directly from Java or via an Ant task or a Maven plugin. We build a single matrix file using a Jenkins job that invokes Ant in the proctor-data project. An example of building with Maven is available on GitHub.

A continuous integration (CI) Jenkins job builds the test matrix every time a test change is committed to trunk. That matrix file is made available to applications and services in our CI environment.

When a release manager promotes a test change to QA, a QA-specific Jenkins job builds the test matrix using the qa branch. That generated matrix file is then published to all QA servers. The services and applications that consume the matrix periodically reload it. An equivalent production-specific Jenkins job handles new changes on the production branch.

Proctor in the application

Each project’s Proctor specification JSON file is stored with each project’s source code in a standard path (for example, src/main/resources/proctor). At build time, we invoke the code generator (via a Maven plugin or Ant task) to generate code that is then built with the project’s source code.

When launching a new test, we typically deploy the test matrix before the application code that depends on it. However, if the application code goes out first, Proctor will “fall back” and treat the test as inactive – if you follow our convention of mapping your inactive bucket to value -1.

You can change the fallback behavior by setting fallbackValue to the desired bucket value in the test specification. We follow the convention of falling back on the unlogged inactive group to help ensure that test and control groups do not change size unexpectedly. Suppose that you have groups 0 (control) and 1 (test) for a test that runs Monday-Thursday with fallback to group 0. If your test matrix is broken as a result of a change from Tuesday 2pm to Tuesday 5pm, summing your metrics across the whole period from Monday to Thursday will skew the results for the control group. If your fallback was -1 (inactive), there would be no skew for your control and test groups.

When adding a new bucket to a test, we typically take this sequence of actions:

  1. Deploy the test matrix with no allocation for the new bucket.
  2. Deploy the application code that is aware of the new bucket.
  3. Redeploy the matrix with an allocation for that bucket.

If the matrix is deployed with an allocation for a new bucket of which the application is unaware, Proctor errs on the side of safety by using the fallback value for all cases. We made Proctor work that way to avoid telling the application to apply an unknown bucket in some cases for some period of time, which could skew analysis.

We take similar precautions when deleting an entire test from the matrix.

Testing group membership, not non-membership

Proctor’s code generation provides easy-to-use methods for testing group membership. We have found it best to always use these methods to test for membership rather than non-membership. If you’ve made your code conditional on non-membership, you run the risk of getting that conditional behavior in unintended circumstances.

As an example, suppose you have a [50% control, 50% test] split, and in your code you use the conditional expression !groups.isControl(), which is equivalent to groups.isTest(). Then, to reduce the footprint of your test while keeping an equal-sized control group for comparison, you change your test split to [25% control, 50% inactive, 25% test]. Now your conditional expression is equivalent to groups.isTest() || groups.IsInactive(). That logic is probably not what you intended, which is to keep the same behavior for control and inactive. In this example, using groups.isTest() in the first place would have prevented you from introducing unintended behavior.

Evolving bucket allocations

We recognize that assigning users to test buckets may affect how the site behaves for them. Proctor on its own cannot ensure consistency of experience across successive page views or visits as a test evolves. When growing or shrinking allocations, we consider carefully how users will be affected. 

Usually, once a user is assigned to a bucket, we’d like for that user to continue to see the behavior associated with that bucket as long as that behavior is being tested. If your allocations started as [10% control, 10% test, 80% inactive], you would not want to grow to [50% control, 50% test], because users initially in the test bucket would be moved to the control bucket.

There are two strategies for stable growth of buckets. In the “split bucket” strategy (Figure 2), you add new ranges for the existing buckets, moving from 10/10 to 50/50 by taking two additional 40% chunks from the inactive range. The resulting JSON is shown in Figure 3.

Split bucket strategy takes two 40% chunks from the inactive range and adds them to new control and test ranges

Figure 2: Growing control and test by splitting buckets into multiple ranges

 

  "allocations": [
  {
      "ranges": [
      {
          "length": 0.1,
          "bucketValue": 0
      },
      {
          "length": 0.1,
          "bucketValue": 1
      },
      {
          "length": 0.4,
          "bucketValue": 0
      },
      {
          "length": 0.4,
          "bucketValue": 1
      }
      ]
  }
  ]

Figure 3: JSON for “split bucket” strategy; 0 is control and 1 is test

In the “room-to-grow” strategy, you leave enough inactive space between buckets so that you can adjust the size of the existing ranges, as in Figure 4.

Room-to-grow strategy requires inactive space between buckets, increasing the control and test buckets from 10% each to 50% each.

Figure 4: Growing control and test by updating range lengths to grow into the inactive middle

We use the “room-to-grow” strategy whenever possible, as it results in more readable test definitions, both in JSON and the Proctor Webapp.

Useful helpers

Proctor includes some utilities that make it easier to work with Proctor in web application deployments:

  • a Spring controller that provides three views: the groups for the current request, a condensed version of the current test matrix, and the JSON test matrix containing only those tests in the application’s specification;
  • a Java servlet that provides a view of the application’s specification; and
  • support for a URL parameter that allows you to force yourself into a test bucket (persistent via a browser cookie)

We grant access to these utilities in our production environment only to privileged IP addresses, and we recommend you do the same.

It works for Indeed, it can work for you

Proctor has become a crucial part of Indeed’s data-driven approach to product development, with over 100 tests and 300 test variations currently in production. To get started with Proctor, dive into our Quick Start guide. To peruse the source code or contribute your own enhancements, visit our GitHub page.