Skip to main content
← All posts
6 min read

You're Measuring Developer Productivity Wrong

Lines of code, PRs merged, story points, even DORA metrics can be gamed or mislead. Most orgs measure activity and call it productivity.

Share

Engineering leadership wants to measure productivity. This is understandable. It's also where most teams make a decision that corrupts their data for years: they pick a metric that is easy to collect, announce it as the proxy for developer productivity, and then watch the organization optimize for the metric instead of the thing the metric was supposed to represent.

The damage isn't just that they're measuring the wrong thing. It's that a bad productivity metric actively degrades the behavior it's supposed to incentivize.

The usual suspects

Lines of code is obviously wrong — it rewards verbosity and penalizes cleanup — but it's worth examining why teams still use it, because the failure mode is instructive. It's countable. It produces a number. It goes into a spreadsheet. Management gets a sense of accountability. The fact that it measures almost nothing about actual value delivery is secondary to the fact that it measures something.

PRs merged has the same structure. On the surface, it seems better — shipping code matters, right? The problem is that PR size distributions shift when you metric on PR count. Engineers learn to split work into smaller PRs, which has some genuine benefits, but also produces PRs that are review theater: small, easy to approve, not surfacing real design decisions. The metric gets gamed not because engineers are dishonest but because they're rational. Give people a number to optimize and they will optimize the number.

Story points are worse because they add a layer of indirection. Points are supposed to measure complexity and effort, but they're estimated by the same team that will be evaluated on them. The research on story point inflation is consistent: teams under velocity pressure reliably inflate estimates over time. The metric becomes a negotiation instead of a measurement. Organizations that have been running Scrum for two years often have no idea whether their teams are getting faster or slower because the denominator keeps changing.

Why DORA metrics aren't a complete answer

DORA metrics — deployment frequency, lead time for changes, change failure rate, mean time to recovery — are genuinely better than the alternatives above. They measure outcomes close to business value: how often you ship, how long it takes, how often you break things, how fast you recover. They're harder to game because they're mostly observable from infrastructure rather than self-reported.

But they have two problems that matter when you're using them to understand team productivity rather than just system health.

The first is that they describe the current state, not the trend. A team with high deployment frequency could be shipping fast because they're highly productive, or because they're shipping tiny changes to avoid the risk of large ones, or because their deployment pipeline is so automated that the metric doesn't capture the actual development work at all. The number is real. The interpretation requires context the metric doesn't provide.

The second is aggregation. DORA metrics are system-level measurements. A senior engineer who spends a month rearchitecting a core service to enable faster future development might contribute zero deployments during that period. A junior engineer making trivial fixes contributes several. At the individual level, DORA metrics measure throughput in ways that can penalize exactly the kind of work that makes teams faster in the long run.

What actually predicts team velocity over time

Three things, none of which appear in most productivity dashboards.

The first is feedback loop speed. How long does it take a developer to go from "I have an idea for a fix" to "I can see whether it works"? This includes local test run time, CI duration, deployment time, and how quickly production observability surfaces results. Feedback loop speed is a forcing function on learning rate. Fast feedback loops let engineers iterate. Slow feedback loops mean engineers batch work into larger, riskier changes. The teams that compound velocity over time almost universally have fast inner loops.

The second is deployment confidence. What is the probability that a given deployment works without manual intervention or immediate rollback? A team that deploys daily but reverts 20% of deployments is not a high-performing team. They're a high-activity team with a reliability problem. Deployment confidence is the product of test quality, observability, and architecture that supports safe changes. It predicts whether velocity is sustainable.

The third is cognitive load per change. How much does a developer need to hold in their head to make a change safely? In a well-structured codebase with clear boundaries and good tests, you can change the pricing module without understanding the authentication system. In a tangle of shared state and implicit dependencies, every change requires global context. Teams with high cognitive load per change are slower than their raw throughput metrics suggest, because most of the work is invisible: the mental modeling, the fear of breaking something unexpected, the careful manual testing before each merge.

The measurement that actually helps

If you want a single metric that predicts sustainable developer productivity, measure the time from "decision to ship a feature" to "that feature is in production for real users." Not calendar time, not story points, but elapsed time including waiting, review, blocked states, and rework. This is sometimes called cycle time.

Cycle time is hard to game because you can't inflate the clock. It captures everything: team size, process friction, technical bottlenecks, deployment complexity. When cycle time goes down, something real improved. When it goes up, something real got worse.

But even cycle time is a lagging indicator. By the time you see it rise, the conditions that caused it to rise are already embedded. The leading indicators are the three things above: feedback loop speed, deployment confidence, cognitive load. These predict where cycle time is going before it gets there.

The reason most teams don't measure these things is that they require instrumentation, observation, and conversation rather than a report. You can't download deployment confidence from Jira. You have to measure it by looking at rollback rates, post-deploy alert volume, and whether engineers say they're nervous when they deploy. That's harder. It's also more accurate.

The cost of measuring the wrong thing

When you measure the wrong thing, you don't just get wrong data. You change what your team optimizes for. Engineers are smart people who will respond to incentives. If the metric is PRs merged, you get more PRs. If the metric is story points, you get point inflation. If the metric is deployment frequency, you get small, frequent deployments whether or not that's the right approach for the problem.

The worst outcome isn't a bad metric. It's a bad metric that gets integrated into performance reviews, because then you've coupled individual careers to the wrong signal. Engineers who do genuinely high-leverage work — improving test infrastructure, reducing system complexity, mentoring junior engineers — become invisible in the productivity ledger. Engineers who generate activity become visible. Over time, the team composition shifts toward the measurable kind of work and away from the leveraged kind.

Measure activity and you will get activity. Measure outcomes and you might get productivity. The distinction is not subtle, but it requires resisting the organizational pull toward things that are easy to count.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.