Test Data Management: Seeding, Scrubbing, and Factories at Scale

Of all the engineering disciplines that enable reliable automated testing, test data management is the least glamorous and the most often neglected. Teams spend months building sophisticated test automation frameworks only to discover that their tests are unreliable because test data is shared between runs, polluted by previous failures, or riddled with real user PII scraped from production.

Test data management (TDM) is the practice of ensuring that every test run gets clean, correct, predictable data. Done well, it is invisible. Done poorly, it is the root cause of 30–50% of your test failures.

The Three Test Data Problems

Problem 1: Shared State Pollution

When tests share database records, one test's side effects leak into another test's assertions. A test that creates a user named "Test User" affects every subsequent test that queries for all users.

Problem 2: Environment Drift

The schema and data in your test database gradually diverges from production. Tests pass against stale schemas that no longer represent the real system.

Problem 3: PII in Test Data

Using real production data in staging environments violates GDPR, CCPA, and basic security hygiene. Real user emails, phone numbers, and payment data have no place in a test database.

Solution 1: The Factory Pattern

A factory function creates isolated, randomized test records and returns a cleanup function. No two test runs share data. Each record is unique by design:

// tests/factories/index.ts
import { db } from '@/lib/db';
import { randomUUID } from 'crypto';
import { faker } from '@faker-js/faker';

// ---- User Factory ----
interface UserOptions {
  role?: 'USER' | 'ADMIN' | 'MODERATOR';
  isVerified?: boolean;
  plan?: 'free' | 'pro' | 'enterprise';
}

export async function createUser(options: UserOptions = {}) {
  const user = {
    id: randomUUID(),
    email: faker.internet.email(),          // e.g., john.doe@example.com
    name: faker.person.fullName(),          // e.g., Jane Smith
    role: options.role ?? 'USER',
    isVerified: options.isVerified ?? true,
    plan: options.plan ?? 'free',
    createdAt: new Date(),
  };

  await db.query(
    `INSERT INTO users (id, email, name, role, is_verified, plan, created_at)
     VALUES ($1, $2, $3, $4, $5, $6, $7)`,
    [user.id, user.email, user.name, user.role, user.isVerified, user.plan, user.createdAt]
  );

  return {
    user,
    cleanup: async () => {
      await db.query('DELETE FROM users WHERE id = $1', [user.id]);
    },
  };
}

// ---- Product Factory ----
interface ProductOptions {
  price?: number;
  inStock?: boolean;
  category?: string;
}

export async function createProduct(options: ProductOptions = {}) {
  const product = {
    id: randomUUID(),
    name: faker.commerce.productName(),
    description: faker.commerce.productDescription(),
    price: options.price ?? faker.number.float({ min: 5, max: 500, fractionDigits: 2 }),
    inStock: options.inStock ?? true,
    category: options.category ?? faker.commerce.department(),
    createdAt: new Date(),
  };

  await db.query(
    `INSERT INTO products (id, name, description, price, in_stock, category, created_at)
     VALUES ($1, $2, $3, $4, $5, $6, $7)`,
    Object.values(product)
  );

  return {
    product,
    cleanup: async () => {
      await db.query('DELETE FROM products WHERE id = $1', [product.id]);
    },
  };
}

Using factories in tests:

// tests/e2e/admin.spec.ts
import { test, expect } from '@playwright/test';
import { createUser, createProduct } from '../factories';

test('admin can delete a product', async ({ page }) => {
  // Each test creates its own isolated data
  const { user: admin } = await createUser({ role: 'ADMIN' });
  const { product, cleanup: cleanupProduct } = await createProduct({ inStock: true });

  try {
    await page.goto(`/admin/products/${product.id}`);
    // ... test actions ...
    await expect(page.getByText('Product deleted')).toBeVisible();
  } finally {
    // Always clean up, even if the test fails
    await cleanupProduct();
  }
});

Solution 2: Deterministic Database Seeding

For tests that need a consistent baseline state (rather than isolated per-test data), use a deterministic seed script that always produces the same dataset:

// tests/seeds/baseline.ts
import { faker } from '@faker-js/faker';
import { db } from '@/lib/db';

// Pin the seed for reproducibility — same seed = same data every time
faker.seed(42);

export async function runBaselineSeed() {
  console.log('Seeding baseline test data...');

  // Clear existing test data
  await db.query("DELETE FROM users WHERE email LIKE '%@test.example.com'");

  // Seed a fixed set of known users for tests that rely on specific emails
  const users = [
    { id: 'user-admin-001', email: 'admin@test.example.com', role: 'ADMIN' },
    { id: 'user-basic-001', email: 'user@test.example.com',  role: 'USER' },
    { id: 'user-basic-002', email: 'user2@test.example.com', role: 'USER' },
  ];

  for (const user of users) {
    await db.query(
      'INSERT INTO users (id, email, role) VALUES ($1, $2, $3) ON CONFLICT (id) DO UPDATE SET role = $3',
      [user.id, user.email, user.role]
    );
  }

  // Seed randomized-but-reproducible product catalog
  for (let i = 0; i < 20; i++) {
    await db.query(
      'INSERT INTO products (id, name, price, in_stock) VALUES ($1, $2, $3, $4) ON CONFLICT (id) DO NOTHING',
      [
        `product-seed-${String(i).padStart(3, '0')}`,
        faker.commerce.productName(),
        faker.number.float({ min: 10, max: 200, fractionDigits: 2 }),
        faker.datatype.boolean(),
      ]
    );
  }

  console.log('Baseline seed complete.');
}

Run the seed in your globalSetup Playwright configuration:

// playwright.config.ts
import { defineConfig } from '@playwright/test';
import { runBaselineSeed } from './tests/seeds/baseline';

export default defineConfig({
  globalSetup: async () => {
    await runBaselineSeed();
  },
  // ...
});

Solution 3: PII Scrubbing for Production Data Copies

When you need realistic test data at scale, the safest approach is to take a copy of production data and scrub all personally identifiable information before loading it into staging:

// scripts/scrub-production-data.ts
import { db } from '@/lib/db';
import { faker } from '@faker-js/faker';

faker.seed(100); // Deterministic scrubbing

async function scrubProductionData() {
  console.log('Starting PII scrubbing...');

  // Fetch all users
  const users = await db.query<{ id: string }>('SELECT id FROM users');

  // Update each user with fake but structurally valid data
  for (const { id } of users.rows) {
    await db.query(
      `UPDATE users SET
        email = $2,
        name = $3,
        phone = $4,
        address = $5,
        date_of_birth = $6
       WHERE id = $1`,
      [
        id,
        faker.internet.email(),
        faker.person.fullName(),
        faker.phone.number(),
        faker.location.streetAddress(),
        faker.date.birthdate({ min: 18, max: 80, mode: 'age' }),
      ]
    );
  }

  // Hash or null out payment tokens
  await db.query("UPDATE payment_methods SET stripe_token = 'tok_scrubbed_' || id");

  // Remove session tokens entirely
  await db.query('DELETE FROM sessions');

  console.log(`Scrubbed ${users.rows.length} user records.`);
}

Test Data in CI: The Full Pipeline

CI TEST DATA PIPELINE:
┌─────────────────┐
│  Test Run Start │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Global Setup:  │  ← runBaselineSeed() — fixed known users
│  Seed Baseline  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Test Execution │  ← Each test creates its own factory data
│  (Parallel)     │    and cleans up in finally{}
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Global Teardown│  ← Clean up any orphaned factory records
│                 │    (safeguard for interrupted test runs)
└─────────────────┘

Conclusion

Test data management is the foundation everything else in your test suite rests on. Flaky tests, order-dependent tests, and false positives are most often symptoms of unmanaged test data. The factory pattern isolates every test in its own data sandbox. Deterministic seeding gives you a reliable baseline. PII scrubbing keeps you compliant and secure. Together, they transform your test data from a liability into a predictable, trustworthy testing infrastructure.