Skip to main content
← All posts
6 min read

Flaky Tests Don't Just Waste Time — They Destroy Trust

Flaky CI doesn't just slow you down. It teaches engineers to ignore red. Once that habit forms, your test suite stops being a safety net.

Share

The build fails. The engineer re-runs it. It passes. They merge.

This is the beginning of the end of your test suite's usefulness.

Flaky tests are not an inconvenience. They're a trust problem. And once trust is gone, it's almost impossible to rebuild without burning the suite down and starting over.

What "flaky" actually means

A flaky test is one that produces different results on the same code without any change to the code. It passes sometimes. It fails sometimes. The failure carries no signal about whether the code is correct.

The causes are predictable:

  • Timing dependencies. Tests that wait for something to happen but don't wait long enough — or wait for a fixed duration instead of a condition.
  • Shared state. Tests that modify global state and rely on execution order.
  • External dependencies. Tests that hit real APIs, databases, or file systems and fail when those are slow or unavailable.
  • Race conditions. Async code that's only deterministic on fast machines.
  • Environment sensitivity. Tests that pass locally but fail in CI because of OS differences, timezone assumptions, or locale-specific behavior.

Most flaky tests start as solid tests that rotted as the codebase changed. A few were always flaky and nobody noticed until CI became the source of truth.

The trust curve

Here's how the trust curve works:

Month 1: Test fails. Engineer investigates. Finds nothing. Re-runs. Passes. Notes it as a one-off.

Month 2: Same test fails again. Two others also flaky. Engineers share a Slack message: "just re-run it." The phrase enters the vocabulary.

Month 3: "Just re-run it" is now institutional knowledge. New engineers learn it in their first week. It's framed as wisdom, not dysfunction.

Month 6: Red CI is a yellow flag, not a red one. Engineers merge on green knowing the previous run was red. Some skip waiting for CI entirely on low-risk changes.

Month 12: A real regression slips through. Nobody caught it because everyone assumed the failure was flakiness. The incident happens in production.

The postmortem will say something about monitoring. The real cause is that the team was trained — by their own test suite — to ignore failures.

Why "just quarantine it" doesn't work

The standard advice is to quarantine flaky tests: skip them, mark them as expected failures, move them to a separate job that doesn't block the build.

Quarantine is appropriate as a short-term triage tool. A quarantined test is honest: it says "this test doesn't work right now." A flaky test is dishonest: it says "this might be fine."

The problem is that quarantine becomes permanent. The quarantine folder grows. Tests in quarantine don't get fixed because fixing them isn't on the critical path. Nobody is rewarded for fixing a test that's already not blocking the build.

After a year, your quarantine folder has 40 tests, covering functionality that's no longer verified by any test that runs. The quarantine folder is where test coverage goes to die.

The only real fix is fixing the test.

The economics of flaky tests

A flaky test that causes one unnecessary re-run per day across a team of ten engineers:

  • 10 re-runs × 5 minutes average wait = 50 engineer-minutes per day
  • 50 minutes × 250 working days = ~208 engineer-hours per year
  • At a $150k fully-loaded engineer cost, that's ~$15,000 per year per flaky test

That's the direct cost. The indirect cost — the trust erosion, the regression that slips through, the incident — is harder to price but much larger.

Most teams have more than one flaky test.

How to actually fix it

Track flakiness systematically. A test that failed and then passed on retry is a flaky test. Log it. Most CI systems expose this data; you just have to collect it. A simple spreadsheet with test name, failure count, last seen date tells you where to focus.

Fix the worst offenders first. Pareto applies: 20% of flaky tests cause 80% of re-runs. Find those and fix them. You don't need to fix everything to stop the bleeding.

Quarantine with a deadline. If you must quarantine, attach a date. "This test is quarantined until [date]. If it's not fixed by then, it gets deleted." Deletion is often the right call — a test that nobody can fix isn't providing coverage anyway.

Eliminate shared state. Most flaky tests share state they shouldn't. Transactions that roll back at the end, in-memory stores that reset, fresh containers per test run. The cost is speed; the benefit is determinism. Determinism is worth it.

Replace timing with conditions. sleep(500) is a lie — it works until the machine is under load, then it doesn't. Wait for the condition: element visible, response received, queue empty. Polling with a timeout is more code but it's honest.

Run the full suite in CI, not locally. Flaky tests are often flaky only under CI conditions — parallel execution, different OS, slower disk. Running the full suite locally on each change helps, but CI is where you find the bugs.

The cultural fix

Technical fixes solve the mechanism. The cultural fix changes what "red CI" means to your team.

The goal is: a failing test is assumed to be a real failure until proven otherwise. Not "assume flakiness, re-run to check." Assume the code is broken, investigate, then merge if it's proven otherwise.

This requires two things:

  1. Flakiness is low enough that most failures are real. You get there by fixing flaky tests.
  2. Re-running without investigation is socially not-okay. Not in a punitive way — in a "we've decided as a team that this isn't how we work" way.

The second is impossible if the first isn't true. You can't ask engineers to investigate every CI failure when 70% of failures are noise. But once flakiness is under 10%, the norm becomes sustainable.

The leading indicator

Measure your re-run rate. What percentage of CI runs that failed were re-run and then passed? That number is your flakiness tax.

Under 5% is healthy. Between 5-15% is concerning. Over 15% means your test suite is a coin flip and you've probably already lost the trust.

The number will shock you. Most teams that measure it for the first time find it's higher than they thought. That's the point — surface it, name it, fix it.

A test suite that engineers trust is a competitive advantage. It's the thing that lets you ship on Friday. It's the thing that gives you confidence in a big refactor. Flakiness erodes that confidence quietly, one re-run at a time.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.