Your CI Pipeline Is Lying to You
Green CI doesn't mean working software. Flaky tests, mocked dependencies, and coverage theater have turned CI into a checkbox ritual.
The build is green. It's been green for two weeks. You merge confidently, deploy to production, and within twenty minutes someone is paging you because the payment flow is broken.
You check the test suite. The relevant tests passed. The coverage report shows 84%. The CI log has nothing but green checkmarks.
Your CI pipeline lied to you, and the worst part is that it's been lying for months. You just didn't notice because the previous lies didn't happen to coincide with production incidents.
The anatomy of a false green
There are a few distinct ways a CI pipeline can pass while hiding real failures. Understanding which one is affecting your system determines what you actually need to fix.
The most common is the mocked dependency trap. Your tests mock the database, the third-party API, the message queue. The mocks behave exactly as documented. The problem is that the real dependency doesn't behave as documented — it returns slightly different error shapes, enforces rate limits you didn't account for, or has schema drift you haven't noticed yet. Your tests are green because they're testing your assumptions about the dependency, not the dependency. When those assumptions are wrong, production breaks and CI stays green.
The second failure mode is flaky tests that everyone knows about and nobody fixes. A test fails 30% of the time due to a race condition or timing issue. The policy — usually unwritten — is to re-run CI until it passes. Developers learn this early. Within a few weeks, a failed CI run is no longer a signal; it's an inconvenience. You hit retry, the test passes, you merge. The suite now has negative signal value: a failure means nothing because it might just be flakiness. The CI pipeline has successfully trained engineers to ignore it.
The third is coverage theater. Lines-of-code coverage rewards you for running code, not for testing behavior. A test that instantiates a class and calls a method has 100% line coverage on that method even if it asserts nothing. Some codebases have high coverage numbers produced mostly by tests that are structurally correct but behaviorally empty — they call the code but don't verify that the code did the right thing. The coverage report is accurate. The quality signal is meaningless.
The slow drift toward ritual
CI starts useful. Early in a project, the test suite is small, fast, and written by engineers who understand what they're testing. A failure genuinely means something. You build trust in the signal.
Then time passes. The team grows. Engineers commit tests because the PR template requires it. The suite gets slower. Someone introduces a shared test utility that makes it easy to write tests that look thorough but don't probe edge cases. A few flaky tests get a retry: 2 annotation instead of a fix. Coverage thresholds get set at the current coverage number so they pass without anyone writing new tests.
None of these individual decisions are catastrophic. Together, they transform CI from a feedback system into a compliance system. The question stops being "does this change work?" and becomes "did CI pass?" Those look identical from the outside. They are not.
The signal decay is gradual enough that teams rarely notice the transition. By the time the CI pipeline is consistently lying, everyone has adjusted their mental model: CI is something you satisfy, not something you trust. But this adjustment is usually implicit. The engineering culture still talks about CI as if it provides quality guarantees while behaving as if it doesn't.
What makes a test worth writing
A test is worth writing if it would catch a real failure that a developer wouldn't immediately catch by reading the diff. That's it. Tests that only catch errors so obvious they'd never be merged aren't providing value. Tests that are coupled so tightly to the implementation that they break on every refactor are creating drag without catching bugs. Tests that run so slowly that CI takes forty minutes are making developers skip local runs.
The hardest category to evaluate is integration tests with mocks. They sit in the middle: more realistic than unit tests, less realistic than end-to-end tests. The question is whether the mock accurately models the dependency's failure modes. If you're mocking a database and your mock never returns a deadlock error, you've excluded a real production failure mode from your test suite. That's not a testing philosophy question. That's a gap in what you're actually checking.
The healthiest test suites have a clear separation: fast, isolated unit tests for pure logic; contract tests or test doubles that are actually verified against the real dependency's behavior; and a small number of end-to-end tests that exercise the critical paths against real infrastructure, even if only in a staging environment. Most teams have the first layer over-built and the third layer absent.
Flaky tests are a debt payment you're deferring
Every flaky test is a defect in your test infrastructure that you're choosing to defer. The immediate cost of fixing it is an afternoon. The ongoing cost of not fixing it is a permanent degradation of the signal value of every CI run. Teams that tolerate flakiness are making a trade: save time now, pay with reduced quality signal indefinitely. That's a bad trade.
The practical approach is a flakiness budget. Any test that fails more than once in a hundred runs without a code change gets quarantined — moved to a separate slow suite that doesn't block merge, with a ticket filed to fix it. The key is that the quarantine is visible. You can see how many tests are in the flaky bucket and track whether the number is growing or shrinking. "Flaky" is not a permanent category; it's a stage in the remediation queue.
What CI should actually catch
The right framing for a CI pipeline is: what failures would be expensive enough to matter in production, and what is the cheapest test that would catch them? Build that test. Don't build the test that's easy to write.
For most web applications, the expensive failures are: broken API contracts between services, database schema changes that break existing queries, authentication and authorization bugs, and payment flow failures. Those are the tests worth investing in. They're harder to write than unit tests. They require real infrastructure or realistic test doubles. They're slower. They're also the ones that actually prevent the incidents that cost you the most.
A CI pipeline with 300 passing tests that don't cover any of those failure modes is not a quality gate. It's a performance of quality — something you can point to in a postmortem as evidence that you tried, while the real production bugs propagate undetected.
Green means the pipeline ran. It doesn't mean the software works. The gap between those two statements is where most production incidents live.
Work with me
I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.
Get in touch →Related posts
Explore more on these topics: