A core part of running an experiment is to assign an experimental unit (for instance a customer) to a specific treatment (payment button variant, marketing push notification framing). Often this assignment needs to meet the following conditions:
- It needs to be random.
- It needs to be stable. If the customer comes back to the screen, they need to be exposed to the same payment button variant.
- It needs to be retrieved or generated very quickly.
- It needs to be available after the actual assignment so it can be used in downstream analysis.
When organizations first start their experimentation journey, a common pattern is to pre-generate assignments, store it in a database and then retrieve it at the time of assignment. This is a perfectly valid method to use and works great when you’re starting off. However, as you start to scale in customer and experiment volumes, this method becomes harder and harder to maintain and use reliably. You’ve got to (a) manage the complexity of storage, (b) ensure that assignments are actually random within and across experiments, and (c) retrieve the assignment reliably. All of these are hard at scale.
Using hash spaces helps solve some of these problems. It’s a simple solution but isn’t as widely known as it probably should be. This blog is an attempt to explain the technique. There are links to code in different languages at the end. However, if you’d like, you can also directly jump to code here.
We’re running an experiment to test which variant of a progress bar on our customer app drives the most engagement. There are three variants: Control (the default experience), Variant A and Variant B.
We have 10 million customers that use our app every week and we want to ensure that these 10 million customers get randomly assigned to one of the three variants. Each time the customer comes back to the app they should see the same variant. We want control to be assigned with a 50% probability, Variant 1 to be assigned with a 30% probability and Variant 2 to be assigned with a 20% probability.
probability_assignments = {"Control": 50, "Variant 1": 30, "Variant 2": 20}
To make things simpler, we’ll start with 4 customers. These customers have IDs that we use to refer to them. These IDs are generally either GUIDs (something like "b7be65e3-c616-4a56-b90a-e546728a6640"
) or integers (like 1019222, 1028333). Any of these ID types would work but to make things easier to follow we’ll simply assume that these IDs are: “Customer1”, “Customer2”, “Customer3”, “Customer4”.
This method primarily relies on using hash algorithms that come with some very desirable properties. Hashing algorithms take a string of arbitrary length and map it to a ‘hash’ of a fixed length. The easiest way to understand this is through some examples.
A hash function, takes a string and maps it to a constant hash space. In the example below, a hash function (in this case md5) takes the words: “Hello”, “World”, “Hello World” and “Hello WorLd” (note the capital L) and maps it to an alphanumeric string of 32 characters.
A few important things to note:
- The hashes are all of the same length.
- A minor difference in the input (capital L instead of small L) changes the hash.
- Hashes are a hexadecimal string. That is, they comprise of the numbers 0 to 9 and the first six alphabets (a, b, c, d, e and f).
We can use this same logic and get hashes for our four customers:
import hashlibrepresentative_customers = ["Customer1", "Customer2", "Customer3", "Customer4"]
def get_hash(customer_id):
hash_object = hashlib.md5(customer_id.encode())
return hash_object.hexdigest()
{customer: get_hash(customer) for customer in representative_customers}
# {'Customer1': 'becfb907888c8d48f8328dba7edf6969',
# 'Customer2': '0b0216b290922f789dd3efd0926d898e',
# 'Customer3': '2c988de9d49d47c78f9f1588a1f99934',
# 'Customer4': 'b7ca9bb43a9387d6f16cd7b93a7e5fb0'}
Hexadecimal strings are just representations of numbers in base 16. We can convert them to integers in base 10.
⚠️ One important note here: We rarely need to use the full hash. In practice (for instance in the linked code) we use a much smaller part of the hash (first 10 characters). Here we use the full hash to make explanations a bit easier.
def get_integer_representation_of_hash(customer_id):
hash_value = get_hash(customer_id)
return int(hash_value, 16){
customer: get_integer_representation_of_hash(customer)
for customer in representative_customers
}
# {'Customer1': 253631877491484416479881095850175195497,
# 'Customer2': 14632352907717920893144463783570016654,
# 'Customer3': 59278139282750535321500601860939684148,
# 'Customer4': 244300725246749942648452631253508579248}
There are two important properties of these integers:
- These integers are stable: Given a fixed input (“Customer1”), the hashing algorithm will always give the same output.
- These integers are uniformly distributed: This one hasn’t been explained yet and mostly applies to cryptographic hash functions (such as md5). Uniformity is a design requirement for these hash functions. If they were not uniformly distributed, the chances of collisions (getting the same output for different inputs) would be higher and weaken the security of the hash. There are some explorations of the uniformity property.
Now that we have an integer representation of each ID that’s stable (always has the same value) and uniformly distributed, we can use it to get to an assignment.
Going back to our probability assignments, we want to assign customers to variants with the following distribution:
{"Control": 50, "Variant 1": 30, "Variant 2": 20}
If we had 100 slots, we can divide them into 3 buckets where the number of slots represents the probability we want to assign to that bucket. For instance, in our example, we divide the integer range 0–99 (100 units), into 0–49 (50 units), 50–79 (30 units) and 80–99 (20 units).
def divide_space_into_partitions(prob_distribution):
partition_ranges = []
start = 0
for partition in prob_distribution:
partition_ranges.append((start, start + partition))
start += partition
return partition_rangesdivide_space_into_partitions(prob_distribution=probability_assignments.values())
# note that this is zero indexed, lower bound inclusive and upper bound exclusive
# [(0, 50), (50, 80), (80, 100)]
Now, if we assign a customer to one of the 100 slots randomly, the resultant distribution should then be equal to our intended distribution. Another way to think about this is, if we choose a number randomly between 0 and 99, there’s a 50% chance it’ll be between 0 and 49, 30% chance it’ll be between 50 and 79 and 20% chance it’ll be between 80 and 99.
The only remaining step is to map the customer integers we generated to one of these hundred slots. We do this by extracting the last two digits of the integer generated and using that as the assignment. For instance, the last two digits for customer 1 are 97 (you can check the diagram below). This falls in the third bucket (Variant 2) and hence the customer is assigned to Variant 2.
We repeat this process iteratively for each customer. When we’re done with all our customers, we should find that the end distribution will be what we’d expect: 50% of customers are in control, 30% in variant 1, 20% in variant 2.
def assign_groups(customer_id, partitions):
hash_value = get_relevant_place_value(customer_id, 100)
for idx, (start, end) in enumerate(partitions):
if start return idx
return Nonepartitions = divide_space_into_partitions(
prob_distribution=probability_assignments.values()
)
groups = {
customer: list(probability_assignments.keys())[assign_groups(customer, partitions)]
for customer in representative_customers
}
# output
# {'Customer1': 'Variant 2',
# 'Customer2': 'Variant 1',
# 'Customer3': 'Control',
# 'Customer4': 'Control'}
The linked gist has a replication of the above for 1,000,000 customers where we can observe that customers are distributed in the expected proportions.
# resulting proportions from a simulation on 1 million customers.
{'Variant 1': 0.299799, 'Variant 2': 0.199512, 'Control': 0.500689