Skip to main content

A/B Testing & Analytics

A/B testing lets you compare two versions of a design with real users and real data. Instead of debating which button color works better, you measure it. Done well, A/B testing replaces opinion with evidence.

How A/B Testing Works

An A/B test splits your traffic between two (or more) versions:

  • Control (A) — the current version
  • Variant (B) — the modified version

Users are randomly assigned to one version. You measure a specific metric for each group and determine whether the difference is statistically significant.

The process:

  1. Form a hypothesis
  2. Define the metric you will measure
  3. Calculate the required sample size
  4. Run the test until you reach that sample size
  5. Analyze the results
  6. Make a decision

Writing Hypotheses

A strong hypothesis connects a change to an expected outcome with a reason:

Template: "If we [change], then [metric] will [improve/decrease] because [reason]."

Examples:

  • "If we move the CTA button above the fold, then click-through rate will increase because users will see it without scrolling."
  • "If we simplify the checkout form from 5 fields to 3, then completion rate will increase because there is less friction."
  • "If we add social proof (review count) to product cards, then add-to-cart rate will increase because users trust products that others have purchased."

A hypothesis without a reason is just a guess. The reason forces you to think about why the change should work, which helps you interpret results correctly.

Choosing Metrics

Primary Metric

Every test needs one primary metric — the single number that determines success or failure. Common primary metrics:

GoalPrimary Metric
More signupsSignup conversion rate
More purchasesPurchase conversion rate
Better engagementTime on page or pages per session
Less frictionTask completion rate
Revenue growthRevenue per visitor

Guardrail Metrics

Guardrail metrics ensure your change does not cause unintended harm:

  • If you are optimizing signup rate, guard against reduced activation rate (people signing up but never using the product)
  • If you are optimizing click-through rate, guard against increased bounce rate on the next page
  • If you are optimizing revenue per visitor, guard against decreased customer satisfaction

Avoid Vanity Metrics

  • Page views — can be inflated by confusion (users clicking around lost)
  • Time on site — can increase because users are struggling, not engaged
  • Number of clicks — more clicks can mean worse navigation

Always tie metrics to a business outcome. "Conversion rate" is meaningful. "Number of hover events" is not.

Sample Size & Duration

Running a test too short or with too few users leads to false positives. Use a sample size calculator before starting:

The inputs you need:

  • Baseline conversion rate — your current metric value (e.g., 3% conversion)
  • Minimum detectable effect (MDE) — the smallest improvement worth detecting (e.g., 10% relative improvement, meaning 3% to 3.3%)
  • Statistical significance level — typically 95% (alpha = 0.05)
  • Statistical power — typically 80% (beta = 0.20)

For a baseline of 3% and MDE of 10% relative, you need approximately 85,000 visitors per variation. This is why A/B testing requires meaningful traffic.

Duration Rules

  • Minimum 1 week — to capture day-of-week effects
  • Maximum 4 weeks — longer tests risk external factors skewing results
  • Never peek and stop early — checking results daily and stopping when "significant" inflates your false positive rate from 5% to as high as 30%

Analyzing Results

Statistical Significance

Statistical significance tells you whether the observed difference is likely real or due to random chance. A p-value below 0.05 means there is less than a 5% chance the result is due to randomness.

But statistical significance does not mean practical significance. A test might show a statistically significant 0.1% improvement — which may not be worth the engineering effort to implement.

Reading Results

ScenarioAction
Variant wins with significanceShip the variant
Control wins with significanceKeep the control, learn from the failure
No significant differenceKeep the control (simpler). The change did not matter.
Guardrail metric degradedDo not ship, even if primary metric improved

Common Pitfalls

  • Multiple testing problem — testing 10 variations inflates false positives. Use Bonferroni correction or a multi-armed bandit approach.
  • Segment fishing — after a test fails, looking for a segment where it worked ("It worked for users aged 25-34 on iOS!") is data mining, not science. Pre-define segments.
  • Network effects — if users interact with each other (social apps), A/B tests can leak between groups.

Analytics as Research

A/B testing is reactive — you need a hypothesis first. Analytics helps you discover what to test:

Funnel Analysis

Map the user journey and measure drop-off at each step:

Landing page: 10,000 visitors (100%)
   Product page: 3,500 (35%)
     Add to cart: 1,200 (12%)
       Checkout: 800 (8%)
         Purchase: 400 (4%)

The biggest drop-off (landing to product: 65% loss) is your biggest opportunity.

Heatmaps and Session Recordings

Tools like Hotjar and FullStory show where users click, scroll, and get stuck. Use them to generate hypotheses:

  • Users are clicking a non-clickable element — make it clickable or change the styling
  • Users are not scrolling past the hero section — move important content up
  • Users are rage-clicking a button — it might appear unresponsive

Cohort Analysis

Compare groups of users based on when they signed up or what feature they used. This reveals:

  • Whether onboarding changes improved retention for new cohorts
  • Which features correlate with long-term engagement
  • Whether a bug affected a specific time period

Building a Testing Culture

A/B testing is most powerful when it is a habit, not a one-time event:

  • Log all tests — keep a shared document with hypothesis, results, and learnings
  • Share results widely — even failed tests teach the team something
  • Iterate — a failed test refines your understanding. Use the insight for the next test
  • Test big changes first — micro-optimizations (button color) rarely move metrics. Test structural changes (different flows, different value propositions)