Staging Is Not What You Think It Is
Every team believes their staging environment reflects production. Almost none of them do. Here is how to test in production safely instead.
Ask any engineering team whether their staging environment accurately reflects production and most will say yes. Ask them to walk through the specific differences and you will hear a different story: "well, we use a smaller database in staging," "third-party services point to their sandbox in staging," "we don't run the full job queue in staging, just a simplified version," "the cache configuration is different because staging doesn't need the same performance."
Each of these caveats sounds minor. Together, they describe a fundamentally different system. The staging environment that "reflects production" reflects production in the same way that a drawing of a house reflects a house: it captures the general shape, misses most of the load-bearing details, and is not actually habitable.
Why staging drifts
Staging starts as an honest attempt to mirror production. The intent is genuine. The problem is that the pressures pushing staging away from production are constant, while the pressures keeping them aligned are episodic.
Production infrastructure is sized for real traffic and costs money accordingly. The staging database with the same instance size as production costs the same, but benefits the organization less because it's used by fewer people less frequently. The rational economic decision is to downsize staging. And so it gets downsized, and with it goes the ability to replicate production's behavior under load.
Production has live integrations with third-party services: payment processors, identity providers, email deliverers, analytics systems. Some of these services don't offer sandbox environments. Others offer sandbox environments that behave slightly differently — different rate limits, different error shapes, different latency characteristics. Staging points at the sandboxes, or has the integrations disabled entirely. Every integration that doesn't behave in staging the same way it behaves in production is a class of production bug that staging cannot catch.
Configuration drift is the quietest form of divergence. Staging is initially configured from production's config with a few values changed. Over time, production config evolves: new feature flags, adjusted timeouts, tuned connection pool sizes, new environment variables for features that went live six months ago. Not all of these changes get propagated to staging. Nobody is responsible for ensuring they do. After a year, staging and production share a common ancestor in their configuration but are no longer the same system.
The specific bugs that staging misses
The bugs staging misses are not random. They have a pattern: they are bugs that require scale, real data, or real integrations to manifest.
Data volume bugs are the most common category. A query that returns in 50ms against a staging database with ten thousand rows returns in four seconds against a production database with forty million rows. An index that covers all the cases in staging doesn't cover the rare-but-valid query patterns that occur once the dataset is large enough. The code is identical in both environments; the behavior is not.
State machine bugs that depend on long-lived data are another category. Staging databases are usually reset periodically or populated with synthetic data. Production has users who signed up years ago, accounts with unusual configurations accumulated over time, records in edge-case states that synthetic data generation never thought to create. The production behavior for a five-year-old account with a billing status that has been through three migrations is not testable in staging because that record doesn't exist in staging.
Rate-limit and quota behaviors only appear in production because staging doesn't generate real traffic volume. A third-party API that allows a thousand requests per minute seems unlimited in staging, where your test traffic might generate ten requests per minute. The same integration in production hits the limit and fails in ways the code never anticipated.
Testing in production is not as scary as it sounds
The response to staging's limitations is not "remove staging entirely" but "stop pretending staging is enough and build production testing practices."
Feature flags are the foundational tool here. A change behind a feature flag can be deployed to production without being enabled for users. Once the code is in production, you can enable the flag for internal users only — employees, contractors, known test accounts. You are now running the actual production code against the actual production infrastructure with real data volumes and real integrations, and the blast radius is controlled. This is more realistic than any staging environment and more controlled than a full rollout.
Canary deployments extend this: route a small percentage of real production traffic — one percent, five percent — to the new version before rolling out fully. This exposes the code to real users, real data, and real behavioral patterns with limited overall impact. The monitoring you already have for production applies automatically, because this is production. You don't have to hope your staging monitoring catches the right things; you're watching the real thing.
Dark launching is another technique for the highest-stakes changes: run both the old and new code paths simultaneously in production, compare their outputs, and only surface the new outputs to users once you have statistical confidence that the results match. The new code is exercised under real production load before any user sees it. This is not always practical — it doubles the compute cost of every request during the testing period — but for critical paths like payment processing or data migrations, it is the most reliable way to validate a change.
What staging is actually good for
None of this means staging is useless. It is excellent for a specific category of validation: developer iteration before a change is ready for production, integration testing of interfaces between services when the specific integration is what you're testing rather than scale or data volume, and smoke tests to catch obvious breakage before a deploy reaches any production traffic.
The mistake is treating staging as a complete substitute for production verification rather than as an early filter that catches a subset of problems. Staging should catch your code from working at all. It should not be expected to catch bugs that only appear under production conditions, because it cannot, because it does not run under production conditions.
The reframe that helps: staging is a safety check before deployment, not a validation that the deployment is correct. The validation happens in production, with the tooling — feature flags, canaries, observability, rollback capability — that makes doing so safe.
The cost of the comfort blanket
The false confidence staging provides is not neutral. It leads engineering organizations to make deployment decisions based on staging results that do not transfer to production, and to be surprised by production failures that a realistic assessment of staging's limitations would have predicted.
More significantly, it leads organizations to under-invest in production testing practices precisely because they believe staging covers the risk. The investment that would go into better feature flag infrastructure, better canary deployment tooling, and better production observability instead goes into maintaining a staging environment that provides false assurance.
Acknowledging that staging is not production is not a counsel of despair. It is the precondition for building the actual practices that make production deployments safe. The teams that have the quietest production incidents are not the ones with the most faithful staging environments. They are the ones who test in production carefully, observe constantly, and can roll back instantly. Staging is where they check that the code compiles and the basics work. Production is where they find out if it's actually correct.
Work with me
I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.
Get in touch →Related posts
Explore more on these topics: