A toy game puzzle for AB testers, with a genie

4 min readMay 14, 2021

In a cave you find a genie looking after hundreds of small bags. He tells you that half of them contain 9 pieces of gold, and 11 worthless pieces of lead. The other half contain 11 pieces of gold and 9 worthless pieces of lead. There is no way to tell which is which without peeking inside.

He offers you the chance to buy as many bags as you like for 10 gold pieces each. But you know that’s pointless because they are only worth an average of exactly 10 gold pieces (the bag and the lead must be genuinely worthless, maybe you have to pay to dispose of them?)

But then he makes you a valuable offer: if you ask nicely he will peek into a bag, choose a piece at random, and tell you if it is gold (it’s really random, and the pieces stay in the bag). Then you can still decide to buy the bag for 10 gold pieces if you want to.

Actually he will let you peek 100 times! You can peek 100 times in the same bag, or once each into 100 different bags, and it’s still up to you which bags you would like to buy.

Questions

How many times should you peek in a single bag if you are looking for evidence, significant at a 95% level, to reject the hypothesis it’s a 9-gold-piece bag before deciding to buy it?
What’s the best peeking and buying strategy?

3. How much is that strategy worth?

4. Does the strategy change if you have 1000 peeks? 1m?

Updates for my answers

What does the hypothesis test say?

Confession; I don’t remember all the nuances of this method! But it looks to me like an old-school AB tester would consider using the whole 100 on 1 bag, test for a mean above 45%, and a couple of standard deviations above 45 conversions is 55. So I reckon they’d get true positives about half the time its a valuable bag and miss the rest.

That’s a pretty weak test so I guess the ‘rigorous’ crowd would say “we just can’t know with any rigour whether its worth more than 10, so we can’t invest” and earn zero profit.

The cowboys run the weak test, 100 peeks in 1 bag and invest (almost always correctly) about 25% of the time, a strategy worth near enough +0.25 pieces of gold profit.

2. What is the best strategy

Incrementally, an un-peeked bad is worth zero. 1 gold peek comes from a ‘+1’ bag 11/20ths of the time so that bag is worth (0.55*1+0.45*-1) +0.1.

1 lead peek gives us a bag worth -0.1, and there is no benefit in looking in there again. It’s already more likely to be a bad bag than a good one, and even seeing gold won’t make it worth +0.1 so we never look again in the same bag after seeing lead.

A 2nd peek after seeing gold will show gold again something like 0.55*0.55+0.45*0.45 = 50.5% of the time, leaving us with a bag worth +0.14. That’s only incremental 0.04 higher than 0.1. New bags are worth +0.1 half the time, +0.05>+0.04, so its better to start a fresh bag than improve certainty on an existing one.

3. What is it worth?

So we peek once each into 100 bags, see 50 gold pieces, buy those 50 bags and 11/20ths of them are profitable; 27.5 *+1 and 22.5 * -1, total profit +5

4. Does the strategy change if you have more peeks to work with?

No. But the cowboy hypothesis tester probably earns even less because they take their time.

Conclusions

This is huge if you run a business. There are contexts where engaging a lackadaisical AB tester who uses kind-of null hypothesis significance testing (NHST), without all the assumptions met, would leave the vast majority of incremental value on the table.

Engaging a ‘rigorous’ NHST expert, who makes sure all the assumptions are in place and never cracks and invests early, would be far worse than that.

Just to restate that; in the example above, which is analogous to a queue of consecutive AB tests, NHST, (applied generously) does at least 20 times worse than the optimal profit-maximising approach.

Here’s a predication — once DeepMind or Watson is running our AB testing strategies, having reinvented its methods from the ground up based on the outcome to be achieved, the chance it settles on anything that looks like NHST is zero.

So I never use NHST frameworks or language? No, I do, sort of. Everyone that did maths at 17 or 18 years old learned them in school, and more importantly they were in a module the C-suite did in their MBA’s in the 90s. They didn’t understand any of it, but they remembered to to ask ‘is it significant?’.

So while my recommendations for actions consider the uplift and variance as only part of the evidence, and a significance level as effectively no incremental evidence, it sometimes gets presented as “this is +n standard deviations, which is like a p-value of y and would be significant in an NHST test at z level (with assumptions)”, in order to help tell the story in the language some of the audience expects.

A toy game puzzle for AB testers, with a genie

Updates for my answers

Conclusions

Written by Alex Bowler

No responses yet