A/B Testing & Analytics — UX Research & Usability Testing | Sabaoon Academy

A/B testing lets you compare two versions of a design with real users and real data. Instead of debating which button color works better, you measure it. Done well, A/B testing replaces opinion with evidence.

How A/B Testing Works

An A/B test splits your traffic between two (or more) versions:

Control (A) — the current version
Variant (B) — the modified version

Users are randomly assigned to one version. You measure a specific metric for each group and determine whether the difference is statistically significant.

The process:

Form a hypothesis
Define the metric you will measure
Calculate the required sample size
Run the test until you reach that sample size
Analyze the results
Make a decision

Writing Hypotheses

A strong hypothesis connects a change to an expected outcome with a reason:

Template: "If we [change], then [metric] will [improve/decrease] because [reason]."

Examples:

"If we move the CTA button above the fold, then click-through rate will increase because users will see it without scrolling."
"If we simplify the checkout form from 5 fields to 3, then completion rate will increase because there is less friction."
"If we add social proof (review count) to product cards, then add-to-cart rate will increase because users trust products that others have purchased."

A hypothesis without a reason is just a guess. The reason forces you to think about why the change should work, which helps you interpret results correctly.

Choosing Metrics

Primary Metric

Every test needs one primary metric — the single number that determines success or failure. Common primary metrics:

Goal	Primary Metric
More signups	Signup conversion rate
More purchases	Purchase conversion rate
Better engagement	Time on page or pages per session
Less friction	Task completion rate
Revenue growth	Revenue per visitor

Guardrail Metrics

Guardrail metrics ensure your change does not cause unintended harm:

If you are optimizing signup rate, guard against reduced activation rate (people signing up but never using the product)
If you are optimizing click-through rate, guard against increased bounce rate on the next page
If you are optimizing revenue per visitor, guard against decreased customer satisfaction

Avoid Vanity Metrics

Page views — can be inflated by confusion (users clicking around lost)
Time on site — can increase because users are struggling, not engaged
Number of clicks — more clicks can mean worse navigation

Always tie metrics to a business outcome. "Conversion rate" is meaningful. "Number of hover events" is not.

Sample Size & Duration

Running a test too short or with too few users leads to false positives. Use a sample size calculator before starting:

The inputs you need:

Baseline conversion rate — your current metric value (e.g., 3% conversion)
Minimum detectable effect (MDE) — the smallest improvement worth detecting (e.g., 10% relative improvement, meaning 3% to 3.3%)
Statistical significance level — typically 95% (alpha = 0.05)
Statistical power — typically 80% (beta = 0.20)

For a baseline of 3% and MDE of 10% relative, you need approximately 85,000 visitors per variation. This is why A/B testing requires meaningful traffic.

Duration Rules

Minimum 1 week — to capture day-of-week effects
Maximum 4 weeks — longer tests risk external factors skewing results
Never peek and stop early — checking results daily and stopping when "significant" inflates your false positive rate from 5% to as high as 30%

Analyzing Results

Statistical Significance

Statistical significance tells you whether the observed difference is likely real or due to random chance. A p-value below 0.05 means there is less than a 5% chance the result is due to randomness.

But statistical significance does not mean practical significance. A test might show a statistically significant 0.1% improvement — which may not be worth the engineering effort to implement.

Reading Results

Scenario	Action
Variant wins with significance	Ship the variant
Control wins with significance	Keep the control, learn from the failure
No significant difference	Keep the control (simpler). The change did not matter.
Guardrail metric degraded	Do not ship, even if primary metric improved

Common Pitfalls

Multiple testing problem — testing 10 variations inflates false positives. Use Bonferroni correction or a multi-armed bandit approach.
Segment fishing — after a test fails, looking for a segment where it worked ("It worked for users aged 25-34 on iOS!") is data mining, not science. Pre-define segments.
Network effects — if users interact with each other (social apps), A/B tests can leak between groups.

Analytics as Research

A/B testing is reactive — you need a hypothesis first. Analytics helps you discover what to test:

Funnel Analysis

Map the user journey and measure drop-off at each step:

Landing page: 10,000 visitors (100%)
  → Product page: 3,500 (35%)
    → Add to cart: 1,200 (12%)
      → Checkout: 800 (8%)
        → Purchase: 400 (4%)

The biggest drop-off (landing to product: 65% loss) is your biggest opportunity.

Heatmaps and Session Recordings

Tools like Hotjar and FullStory show where users click, scroll, and get stuck. Use them to generate hypotheses:

Users are clicking a non-clickable element — make it clickable or change the styling
Users are not scrolling past the hero section — move important content up
Users are rage-clicking a button — it might appear unresponsive

Cohort Analysis

Compare groups of users based on when they signed up or what feature they used. This reveals:

Whether onboarding changes improved retention for new cohorts
Which features correlate with long-term engagement
Whether a bug affected a specific time period

Building a Testing Culture

A/B testing is most powerful when it is a habit, not a one-time event:

Log all tests — keep a shared document with hypothesis, results, and learnings
Share results widely — even failed tests teach the team something
Iterate — a failed test refines your understanding. Use the insight for the next test
Test big changes first — micro-optimizations (button color) rarely move metrics. Test structural changes (different flows, different value propositions)