That’s a great question, and one that’s by no means easy to answer. So let’s ignore for a moment the math and briefly discuss the idea behind A/B testing.
As you already know, A/B testing is all about comparing an existing page with an alternative design, copy, etc, and seeing if the new variant is “better” than the original page. In order to discover which variant is better, you simply track the number of page views each variant get and the number of conversions (e.g. purchases) each variant generates, and compute and compare each conversion rate. If you think about it, there’s only three possible outcomes:
- Variant A is better (i.e. its conversion rate is higher)
- Variant B is better
- Both variants perform roughly the same
The problem you face when running an A/B test is that you won’t be able to tell with absolute certainty if it’s (1), (2), or (3), because you won’t track everyone always — you’re only tracking a sample of visitors. And that’s when sample size comes in.
If you expect one variant to outperform the other by an extremely large margin, the sample size you’ll need to prove that will be pretty small. After all, you’re assuming (let’s say) variant B is way, way better, so you should see the improvements pretty quickly.
Conversely, if you expect one variant to be slightly better than the other (let’s say, 0.5% better), the required sample size will be way larger. And this also makes sense: the margin is so small that every new user that participates in the test may change the outcome of the test one way or the other — so you need a larger sample to make sure that the results you got are “stable.”
Now, back to the numbers offered by VWO. As far as I can tell, they’re accurate. If you want to be rigorous and make sure the data you collect and the conclusions you draw from your tests are actually true (instead of they’re being the result of random chance), you need larger samples when the expected improvement is small.
However, this is also a balancing act. Smaller shops should avoid analysis paralysis at all costs. You know you need a large sample, but you can’t afford one… so you settle for a smaller one. You know the results aren’t 100% trustworthy (meaning the improvement you see (if any) might be the result of pure chance), but you did the best you could, collected some data, and interpreted the results with a grain of salt.
As long as you’re aware of the limitations your tests have when the sample size is not large enough, I’d say you shouldn’t worry too much. Some data is better than no data, if you keep in mind it’s not perfect data.