A/B testing lets you compare two versions of a design with real users and real data. Instead of debating which button color works better, you measure it. Done well, A/B testing replaces opinion with evidence.
How A/B Testing Works
An A/B test splits your traffic between two (or more) versions:
- Control (A) — the current version
- Variant (B) — the modified version
Users are randomly assigned to one version. You measure a specific metric for each group and determine whether the difference is statistically significant.
The process:
- Form a hypothesis
- Define the metric you will measure
- Calculate the required sample size
- Run the test until you reach that sample size
- Analyze the results
- Make a decision
Writing Hypotheses
A strong hypothesis connects a change to an expected outcome with a reason:
Template: "If we [change], then [metric] will [improve/decrease] because [reason]."
Examples:
- "If we move the CTA button above the fold, then click-through rate will increase because users will see it without scrolling."
- "If we simplify the checkout form from 5 fields to 3, then completion rate will increase because there is less friction."
- "If we add social proof (review count) to product cards, then add-to-cart rate will increase because users trust products that others have purchased."
A hypothesis without a reason is just a guess. The reason forces you to think about why the change should work, which helps you interpret results correctly.
Choosing Metrics
Primary Metric
Every test needs one primary metric — the single number that determines success or failure. Common primary metrics:
| Goal | Primary Metric |
|---|---|
| More signups | Signup conversion rate |
| More purchases | Purchase conversion rate |
| Better engagement | Time on page or pages per session |
| Less friction | Task completion rate |
| Revenue growth | Revenue per visitor |
Guardrail Metrics
Guardrail metrics ensure your change does not cause unintended harm:
- If you are optimizing signup rate, guard against reduced activation rate (people signing up but never using the product)
- If you are optimizing click-through rate, guard against increased bounce rate on the next page
- If you are optimizing revenue per visitor, guard against decreased customer satisfaction
Avoid Vanity Metrics
- Page views — can be inflated by confusion (users clicking around lost)
- Time on site — can increase because users are struggling, not engaged
- Number of clicks — more clicks can mean worse navigation
Always tie metrics to a business outcome. "Conversion rate" is meaningful. "Number of hover events" is not.
Sample Size & Duration
Running a test too short or with too few users leads to false positives. Use a sample size calculator before starting:
The inputs you need:
- Baseline conversion rate — your current metric value (e.g., 3% conversion)
- Minimum detectable effect (MDE) — the smallest improvement worth detecting (e.g., 10% relative improvement, meaning 3% to 3.3%)
- Statistical significance level — typically 95% (alpha = 0.05)
- Statistical power — typically 80% (beta = 0.20)
For a baseline of 3% and MDE of 10% relative, you need approximately 85,000 visitors per variation. This is why A/B testing requires meaningful traffic.
Duration Rules
- Minimum 1 week — to capture day-of-week effects
- Maximum 4 weeks — longer tests risk external factors skewing results
- Never peek and stop early — checking results daily and stopping when "significant" inflates your false positive rate from 5% to as high as 30%
Analyzing Results
Statistical Significance
Statistical significance tells you whether the observed difference is likely real or due to random chance. A p-value below 0.05 means there is less than a 5% chance the result is due to randomness.
But statistical significance does not mean practical significance. A test might show a statistically significant 0.1% improvement — which may not be worth the engineering effort to implement.
Reading Results
| Scenario | Action |
|---|---|
| Variant wins with significance | Ship the variant |
| Control wins with significance | Keep the control, learn from the failure |
| No significant difference | Keep the control (simpler). The change did not matter. |
| Guardrail metric degraded | Do not ship, even if primary metric improved |
Common Pitfalls
- Multiple testing problem — testing 10 variations inflates false positives. Use Bonferroni correction or a multi-armed bandit approach.
- Segment fishing — after a test fails, looking for a segment where it worked ("It worked for users aged 25-34 on iOS!") is data mining, not science. Pre-define segments.
- Network effects — if users interact with each other (social apps), A/B tests can leak between groups.
Analytics as Research
A/B testing is reactive — you need a hypothesis first. Analytics helps you discover what to test:
Funnel Analysis
Map the user journey and measure drop-off at each step:
Landing page: 10,000 visitors (100%)
→ Product page: 3,500 (35%)
→ Add to cart: 1,200 (12%)
→ Checkout: 800 (8%)
→ Purchase: 400 (4%)The biggest drop-off (landing to product: 65% loss) is your biggest opportunity.
Heatmaps and Session Recordings
Tools like Hotjar and FullStory show where users click, scroll, and get stuck. Use them to generate hypotheses:
- Users are clicking a non-clickable element — make it clickable or change the styling
- Users are not scrolling past the hero section — move important content up
- Users are rage-clicking a button — it might appear unresponsive
Cohort Analysis
Compare groups of users based on when they signed up or what feature they used. This reveals:
- Whether onboarding changes improved retention for new cohorts
- Which features correlate with long-term engagement
- Whether a bug affected a specific time period
Building a Testing Culture
A/B testing is most powerful when it is a habit, not a one-time event:
- Log all tests — keep a shared document with hypothesis, results, and learnings
- Share results widely — even failed tests teach the team something
- Iterate — a failed test refines your understanding. Use the insight for the next test
- Test big changes first — micro-optimizations (button color) rarely move metrics. Test structural changes (different flows, different value propositions)