Flaky Tests: Root Cause Analysis and the 5-Step Fix

Flaky tests are the most demoralizing problem in test automation. They pass on one run, fail on the next, and developers start ignoring red builds—which defeats the entire purpose of having a test suite. According to the 2026 State of Testing Report, flaky tests are the number one reason engineering teams lose confidence in their CI pipelines, with 61% of teams reporting they routinely merge code despite failing tests because "it's probably just flaky."

That is a catastrophic erosion of quality assurance.

This post diagnoses the five root causes of flaky tests and provides concrete, testable fixes for each.

What Makes a Test "Flaky"?

A test is flaky when it produces non-deterministic results from the same codebase. The code under test has not changed, the test code has not changed, yet the outcome alternates between pass and fail. The cause is always a hidden dependency on something outside the test's explicit control:

FLAKINESS TAXONOMY:
┌────────────────────────────────────────┐
│         Root Causes of Flakiness       │
├────────────┬───────────────────────────┤
│ Timing     │ Race conditions, missing  │
│            │ awaits, animation delays  │
├────────────┼───────────────────────────┤
│ State      │ Shared DB records, leaked │
│            │ auth sessions, global vars│
├────────────┼───────────────────────────┤
│ Network    │ External API calls, slow  │
│            │ responses, timeouts       │
├────────────┼───────────────────────────┤
│ Order Dep. │ Tests that assume other   │
│            │ tests ran first           │
├────────────┼───────────────────────────┤
│ Resource   │ Port conflicts, disk I/O  │
│            │ exhaustion in parallel CI │
└────────────┴───────────────────────────┘

Step 1: Identify and Quarantine the Flaky Test

Before fixing anything, isolate which tests are flaky and how often. Run your suite in a retry loop and collect failure statistics:

# Run the full Playwright suite 5 times and log failures
for i in {1..5}; do
  npx playwright test --reporter=json 2>/dev/null >> .flake-report.jsonl
done

# Count failures per test
cat .flake-report.jsonl | jq -r '.suites[].suites[].specs[] | select(.tests[].results[].status == "failed") | .title' | sort | uniq -c | sort -rn

Playwright also has a built-in flakiness reporter when you use --repeat-each:

# Run every test 10 times to surface intermittent failures
npx playwright test --repeat-each=10 --reporter=html tests/checkout.spec.ts

Any test that fails at least once in 10 runs is flaky. Quarantine it immediately using the @flaky tag and a CI skip annotation so it does not block production deployments while you fix it.

Step 2: Fix Timing Flakiness (Most Common Cause)

Timing is responsible for approximately 40% of flaky tests. The classic symptom: the test passes locally but fails in CI where machines are slower.

Wrong: Fixed Sleep Delays

// ❌ Never use fixed timeouts — slow machines break this
await page.click('[data-testid="submit-btn"]');
await page.waitForTimeout(2000); // Hope 2 seconds is enough...
await expect(page.locator('[data-testid="success"]')).toBeVisible();

Right: Wait for State, Not Time

// ✅ Wait for the element state you actually care about
await page.click('[data-testid="submit-btn"]');

// Playwright auto-waits for visibility, but be explicit for network-bound results
await page.waitForResponse(resp =>
  resp.url().includes('/api/submit') && resp.status() === 200
);

await expect(page.locator('[data-testid="success"]')).toBeVisible();

Right: Use `waitForFunction` for Complex State

// ✅ Wait for a DOM condition rather than a timeout
await page.waitForFunction(() => {
  const spinner = document.querySelector('[data-testid="loading-spinner"]');
  return !spinner || spinner.getAttribute('aria-hidden') === 'true';
});

Step 3: Fix State Isolation (Second Most Common Cause)

Tests that share database records, cookies, or global state are ordering-dependent and inherently flaky. Each test must set up its own state and tear it down after.

// playwright.config.ts — use independent browser contexts per test
export default defineConfig({
  use: {
    // Each test file gets a completely isolated browser context
    // (separate cookies, localStorage, and session state)
  },
  // Use workers to prevent test-to-test state leakage
  workers: process.env.CI ? 2 : 4,
});

For database state, use a factory pattern that creates and cleans up test records:

// tests/fixtures/user.factory.ts
import { db } from '@/lib/db';
import { randomUUID } from 'crypto';

export async function createTestUser(overrides = {}) {
  const user = {
    id: randomUUID(),
    email: `test-${randomUUID()}@example.com`,
    role: 'USER',
    ...overrides,
  };

  await db.query(
    'INSERT INTO users (id, email, role) VALUES ($1, $2, $3)',
    [user.id, user.email, user.role]
  );

  // Return cleanup function
  return {
    user,
    cleanup: () => db.query('DELETE FROM users WHERE id = $1', [user.id]),
  };
}

// Usage in test
test('admin can delete user', async ({ page }) => {
  const { user, cleanup } = await createTestUser({ role: 'USER' });

  try {
    await page.goto(`/admin/users/${user.id}`);
    await page.getByRole('button', { name: 'Delete User' }).click();
    await expect(page.getByText('User deleted successfully')).toBeVisible();
  } finally {
    await cleanup(); // Always runs, even on test failure
  }
});

Step 4: Fix Network Flakiness

External API calls in tests are flaky by definition. Third-party services have rate limits, outages, and variable latency. The fix is always the same: mock the network.

// ✅ Intercept and mock all external API calls
test('displays payment success', async ({ page }) => {
  // Intercept the Stripe API call before navigating
  await page.route('**/api.stripe.com/**', route =>
    route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: JSON.stringify({
        id: 'pi_test_123',
        status: 'succeeded',
      }),
    })
  );

  await page.goto('/checkout');
  await page.getByRole('button', { name: 'Pay Now' }).click();
  await expect(page.getByRole('heading', { name: 'Payment Confirmed' })).toBeVisible();
});

For tests that genuinely need to test external integrations, separate them into a dedicated suite with the --project=integration flag and run them nightly, not on every PR.

Step 5: Fix Test Order Dependencies

Tests that rely on other tests having run first are the hardest flakiness to diagnose because they only fail when test parallelism or test ordering changes.

# Diagnose order-dependent tests by randomizing execution order
npx playwright test --shard=1/1 --workers=1 2>&1 | grep -E "(FAIL|PASS)"

# Then run in reverse order
npx playwright test --grep-invert "^$" | tac

The fix is to ensure every test is fully self-contained:

// ❌ Order-dependent: assumes "setup" test ran first and created the product
test('can add product to cart', async ({ page }) => {
  await page.goto('/products/123'); // Will fail if product 123 doesn't exist
  await page.getByRole('button', { name: 'Add to Cart' }).click();
});

// ✅ Self-contained: seeds its own data
test('can add product to cart', async ({ page }) => {
  const { product, cleanup } = await createTestProduct({
    name: 'Test Widget',
    price: 29.99,
  });

  try {
    await page.goto(`/products/${product.id}`);
    await page.getByRole('button', { name: 'Add to Cart' }).click();
    await expect(page.getByRole('status')).toContainText('Added to cart');
  } finally {
    await cleanup();
  }
});

The Flakiness Scorecard

Before marking a flaky test as "fixed", run this verification:

# It must pass 20 consecutive runs with no failures
npx playwright test tests/your-fixed-test.spec.ts --repeat-each=20

# It must pass in a clean CI environment (no local caches)
docker run --rm -v $(pwd):/app -w /app mcr.microsoft.com/playwright:v1.44.0-jammy \
  npx playwright test tests/your-fixed-test.spec.ts --repeat-each=10

If it passes all 20 runs in Docker, it is fixed. Remove the @flaky quarantine tag and re-enable it in the CI gate.

Conclusion

Flaky tests are not a testing problem—they are a trust problem. Every time a developer clicks "re-run" and the test magically passes, your CI pipeline loses credibility. The fix is systematic: identify flaky tests with retry statistics, diagnose their root cause using the five-category taxonomy, and apply the targeted fix. The goal is a test suite that everyone on the team trusts unconditionally, where a red build means one thing: the code is broken.