Skip to main content
AI-Developer → AI Workflow#13 of 14

Part 13 — AI-Powered Testing: Write Tests That Actually Test Something

The AI wrote your tests. They all pass. Your code coverage is 94%. Then production breaks. The problem isn't the AI — it's that most AI-generated tests are hollow: they run, they pass, and they catch nothing. Here's the framework for using AI to write tests that genuinely protect your code.

March 24, 2026
13 min read
#Testing#TDD#AI Testing#Code Quality#Vitest#Jest#Test Coverage#AI Workflow#Developer Productivity

AI Workflow · Module 13

AI-Powered Testing

"100% coverage means nothing if every test is a lie."

3 Failure Modes of AI-generated tests
1 Framework for tests that catch real bugs
AI-TDD Loop Red → Green → Refactor

A developer asks AI to generate tests for their new payment processing function. The AI produces 12 tests. All 12 pass. Coverage report shows 97%. The developer ships with confidence.

Three weeks later, a negative amount slips through and money goes out instead of in.

Looking back at the tests — they all checked that the function runs. Not one checked that it rejects invalid input. The AI wrote tests for the happy path and called it done. Nobody noticed because the tests were green.

This is the hollow test problem, and it's the most dangerous failure mode in AI-assisted development.


The Three Ways AI Generates Bad Tests

Before you can use AI for testing well, you need to recognise what bad AI tests look like. They come in three predictable patterns.

<div style="display: flex; gap: 16px; align-items: flex-start; background: rgba(239,68,68,0.08); border: 1px solid rgba(239,68,68,0.3); border-radius: 12px; padding: 18px;">
  <div style="background: #ef4444; color: #fff; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0; text-align: center;">1</div>
  <div>
    <div style="color: #f87171; font-weight: 700; font-size: 1rem; margin-bottom: 6px;">The Tautology Test</div>
    <div style="color: #fca5a5; font-size: 0.9rem; line-height: 1.7; margin-bottom: 10px;">The test calls the function and asserts the result equals what the function returns. It will always pass — even if the function is completely broken.</div>
    <div style="background: #0f172a; border-radius: 6px; padding: 10px; font-family: monospace; color: #94a3b8; font-size: 0.8rem; line-height: 1.7;">
      ❌ <span style="color: #f87171;">it('calculates total', () =&gt; &#123;<br/>
      &nbsp;&nbsp;const result = calculateTotal(items);<br/>
      &nbsp;&nbsp;expect(result).toBe(calculateTotal(items)); // always true<br/>
      &#125;)</span>
    </div>
  </div>
</div>

<div style="text-align: center; color: #334155; font-size: 1.2rem; padding: 2px 0;">↓</div>

<div style="display: flex; gap: 16px; align-items: flex-start; background: rgba(239,68,68,0.08); border: 1px solid rgba(239,68,68,0.3); border-radius: 12px; padding: 18px;">
  <div style="background: #ef4444; color: #fff; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0; text-align: center;">2</div>
  <div>
    <div style="color: #f87171; font-weight: 700; font-size: 1rem; margin-bottom: 6px;">The Happy-Path-Only Test</div>
    <div style="color: #fca5a5; font-size: 0.9rem; line-height: 1.7; margin-bottom: 10px;">Tests only the ideal scenario. No edge cases, no invalid input, no boundary conditions. AI defaults to this because it mirrors the implementation — and the implementation was written for the happy path.</div>
    <div style="background: #0f172a; border-radius: 6px; padding: 10px; font-family: monospace; color: #94a3b8; font-size: 0.8rem; line-height: 1.7;">
      ❌ <span style="color: #f87171;">// Tests only processPayment(100) — never tests 0, -1, null, NaN, Infinity</span>
    </div>
  </div>
</div>

<div style="text-align: center; color: #334155; font-size: 1.2rem; padding: 2px 0;">↓</div>

<div style="display: flex; gap: 16px; align-items: flex-start; background: rgba(239,68,68,0.08); border: 1px solid rgba(239,68,68,0.3); border-radius: 12px; padding: 18px;">
  <div style="background: #ef4444; color: #fff; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0; text-align: center;">3</div>
  <div>
    <div style="color: #f87171; font-weight: 700; font-size: 1rem; margin-bottom: 6px;">The Implementation Mirror</div>
    <div style="color: #fca5a5; font-size: 0.9rem; line-height: 1.7; margin-bottom: 10px;">The test was generated from the same code it's supposed to test. If the implementation has a bug, the test has the same bug. They fail together — which means they never actually catch each other.</div>
    <div style="background: #0f172a; border-radius: 6px; padding: 10px; font-family: monospace; color: #94a3b8; font-size: 0.8rem; line-height: 1.7;">
      ❌ <span style="color: #f87171;">// Tax rate is wrong in the implementation AND in the expected value<br/>
      // Test always passes. Bug ships.</span>
    </div>
  </div>
</div>

All three patterns produce green CI. None of them protect you.


The AI-TDD Loop — Test First, Then Generate

The solution to every hollow test pattern is the same: write the tests before the implementation. When you specify what the code must do before asking AI to write the code, the tests define reality — not the other way around.

This is classic TDD with AI as your implementation engine.

The AI-TDD Loop — 4 Steps

<div style="background: rgba(239,68,68,0.1); border: 1px solid rgba(239,68,68,0.35); border-radius: 10px; padding: 16px; display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #ef4444; color: #fff; font-weight: 900; font-size: 0.9rem; border-radius: 6px; padding: 6px 12px; flex-shrink: 0; min-width: 60px; text-align: center;">RED</div>
  <div>
    <div style="color: #f87171; font-weight: 600; margin-bottom: 4px;">You write the test (or spec the behavior)</div>
    <div style="color: #fca5a5; font-size: 0.85rem; line-height: 1.6;">Describe what the function must do. Include happy path AND edge cases. Tests should fail because the code doesn't exist yet. This is the most human part of the loop — you define the contract.</div>
  </div>
</div>

<div style="text-align: center; color: #334155; font-size: 1.2rem; padding: 2px 0;">↓</div>

<div style="background: rgba(34,197,94,0.1); border: 1px solid rgba(34,197,94,0.35); border-radius: 10px; padding: 16px; display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #22c55e; color: #000; font-weight: 900; font-size: 0.9rem; border-radius: 6px; padding: 6px 12px; flex-shrink: 0; min-width: 60px; text-align: center;">GREEN</div>
  <div>
    <div style="color: #4ade80; font-weight: 600; margin-bottom: 4px;">AI writes the implementation</div>
    <div style="color: #86efac; font-size: 0.85rem; line-height: 1.6;">Give AI your failing tests as the spec. "Write an implementation that makes all these tests pass." The AI's goal is to satisfy your tests — not to write what it thinks you want.</div>
  </div>
</div>

<div style="text-align: center; color: #334155; font-size: 1.2rem; padding: 2px 0;">↓</div>

<div style="background: rgba(34,211,238,0.1); border: 1px solid rgba(34,211,238,0.35); border-radius: 10px; padding: 16px; display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #06b6d4; color: #000; font-weight: 900; font-size: 0.9rem; border-radius: 6px; padding: 6px 12px; flex-shrink: 0; min-width: 60px; text-align: center;">CHECK</div>
  <div>
    <div style="color: #22d3ee; font-weight: 600; margin-bottom: 4px;">You verify all tests pass AND are meaningful</div>
    <div style="color: #cffafe; font-size: 0.85rem; line-height: 1.6;">Run the test suite. If all pass — good. But also do the mutation check (below) to confirm the tests aren't hollow. This is where the hollow test patterns get caught.</div>
  </div>
</div>

<div style="text-align: center; color: #334155; font-size: 1.2rem; padding: 2px 0;">↓</div>

<div style="background: rgba(168,85,247,0.1); border: 1px solid rgba(168,85,247,0.35); border-radius: 10px; padding: 16px; display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #a855f7; color: #fff; font-weight: 900; font-size: 0.9rem; border-radius: 6px; padding: 6px 12px; flex-shrink: 0; min-width: 60px; text-align: center;">REFINE</div>
  <div>
    <div style="color: #c084fc; font-weight: 600; margin-bottom: 4px;">AI refactors, you validate nothing breaks</div>
    <div style="color: #e9d5ff; font-size: 0.85rem; line-height: 1.6;">Ask AI to improve the implementation (performance, readability, edge case handling) while keeping all tests green. Your tests are now the safety net for the refactor.</div>
  </div>
</div>

The 5-Part Test Prompt Framework

The quality of AI-generated tests is entirely determined by the quality of your prompt. Vague prompts produce hollow tests. The 5-part framework eliminates every ambiguity.

<div style="display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #22c55e; color: #000; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0;">1</div>
  <div>
    <div style="color: #4ade80; font-weight: 700; font-size: 0.95rem; margin-bottom: 4px;">The Function Contract</div>
    <div style="color: #86efac; font-size: 0.88rem; line-height: 1.7;">Paste the function signature, its TypeScript types, and a 1–2 sentence description of its purpose. <em>"Here is the function I need tests for: [paste signature + docstring]"</em></div>
  </div>
</div>

<div style="display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #22c55e; color: #000; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0;">2</div>
  <div>
    <div style="color: #4ade80; font-weight: 700; font-size: 0.95rem; margin-bottom: 4px;">The Happy Path Inputs</div>
    <div style="color: #86efac; font-size: 0.88rem; line-height: 1.7;">Give 2–3 concrete examples of valid inputs and their expected outputs. This anchors the AI to your domain logic. <em>"For input X, the correct output is Y."</em></div>
  </div>
</div>

<div style="display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #22c55e; color: #000; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0;">3</div>
  <div>
    <div style="color: #4ade80; font-weight: 700; font-size: 0.95rem; margin-bottom: 4px;">The Edge Cases — Explicit</div>
    <div style="color: #86efac; font-size: 0.88rem; line-height: 1.7;">This is what most prompts skip. You must explicitly name the boundary conditions. <em>"Test these edge cases: empty array, null input, negative values, zero, values above the maximum limit, duplicate items."</em></div>
  </div>
</div>

<div style="display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #22c55e; color: #000; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0;">4</div>
  <div>
    <div style="color: #4ade80; font-weight: 700; font-size: 0.95rem; margin-bottom: 4px;">The Error Conditions</div>
    <div style="color: #86efac; font-size: 0.88rem; line-height: 1.7;">Specify what should happen when the function receives bad input. <em>"When given invalid input it should throw an error with the message X"</em> or <em>"it should return null"</em> — be explicit about the contract.</div>
  </div>
</div>

<div style="display: flex; gap: 14px; align-items: flex-start;">
  <div style="background: #22c55e; color: #000; font-weight: 900; font-size: 1rem; border-radius: 8px; padding: 8px 14px; flex-shrink: 0;">5</div>
  <div>
    <div style="color: #4ade80; font-weight: 700; font-size: 0.95rem; margin-bottom: 4px;">The Test Framework</div>
    <div style="color: #86efac; font-size: 0.88rem; line-height: 1.7;">Name your testing library and any important setup. <em>"Use Vitest. Our test files use the pattern describe/it/expect. We use @testing-library/react for component tests."</em></div>
  </div>
</div>

Here is the same prompt — vague vs. framework-driven:

❌ Vague Prompt
"Write tests for my calculateDiscount function"
Result: 3 happy-path tests, all green, no edge cases, ships a discount calculation that accepts negative prices.
✅ Framework Prompt
"Write Vitest tests for calculateDiscount(price: number, code: string): number. Valid: price=100, code='SAVE10' → 90. Edge cases: price=0, price negative, empty code, invalid code, code already used. Error: should throw InvalidDiscountError for invalid codes."
Result: 11 tests, all meaningful, catches the negative price bug before it ships.

The Mutation Test — Proving Your Tests Are Real

This is the most important habit in this entire article. After AI generates your tests, deliberately break the implementation and watch if the tests catch it.

// Your function under test
function calculateDiscount(price: number, code: string): number {
  if (price <= 0) throw new Error('Invalid price');
  // ... discount logic
  return discountedPrice;
}

// THE MUTATION CHECK — temporarily change the implementation:
// Option 1: Make it always return 0
// Option 2: Remove the price validation
// Option 3: Return price * 2 instead of discounting

// If your tests don't fail when you make those changes — the tests are hollow.
// Fix the tests before reverting the implementation.

The rule: A test suite that doesn't fail when the implementation is broken is not a test suite. It is documentation that happens to run. Run mutation checks on every critical function before considering it tested.


Unit, Integration, and E2E — The Right Mix with AI

Not all tests are equal, and AI has very different reliability at each level.

Unit Tests
AI reliability: High — with the 5-part prompt framework
Best for: pure functions, business logic, utilities, validators
🔗
Integration Tests
AI reliability: Medium — review mock boundaries carefully
Best for: API routes, DB operations, service interactions
🎭
E2E Tests
AI reliability: Low — use AI for scaffolding only
Best for: critical user journeys you write yourself

For integration tests, the most common AI mistake is generating over-mocked tests that test the mocks instead of the real system. Always push back on over-mocking:

// Your prompt addition for integration tests:
"Do NOT mock the database. Use the actual test database.
Do NOT mock internal service modules.
Only mock: external HTTP APIs, email sending, and payment processors."

The Coverage Trap

AI can reach 90%+ coverage in minutes. This feels like an achievement. It is a trap.

Coverage % — What It Measures
→ Which lines of code were executed
→ Nothing about correctness
→ Nothing about edge case coverage
→ Nothing about assertion quality

You can hit 100% coverage with tautology tests and ship code that is entirely broken.
What You Should Measure Instead
→ Mutation score (do tests catch bugs?)
→ Edge case breadth (are boundaries tested?)
→ Assertion depth (are results verified?)
→ Failure rate over time (do tests catch real bugs in production?)

70% meaningful coverage > 100% hollow coverage.

The right way to use AI for coverage: generate tests for uncovered lines, then run the mutation check on each one to confirm they're meaningful before committing.


The One Habit That Changes Everything

After every AI testing session, ask this one question:

"If this function had a bug right now, would any of these tests fail?"

If the answer is "I'm not sure" — run the mutation check. If the answer is "no" — the tests are hollow. Fix them before you commit. This question has more protective value than any coverage target.


Next in AI Workflow

Part 14 — Taming Legacy Code with AI

200,000 lines. No documentation. Original author left two years ago. Here is how AI turns your most dreaded codebase into something you can actually work with.

AI Workflow
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →