Measuring Engineering Productivity When AI Writes Half the Code

The “ten-times-faster developer with AI” claim does not survive contact with rigorous measurement — and the frameworks built for the pre-AI era need surgical updates, not replacement.

The most striking finding in recent engineering-productivity research is not that AI makes developers faster. It is how spectacularly wrong everyone is about whether it does. In a randomized controlled trial published in mid-2025, the nonprofit research organization METR measured experienced open-source developers working on their own repositories with and without modern AI tools. The developers expected AI to speed them up by twenty-four percent. They believed, after the fact, that AI had sped them up by twenty percent. In reality, AI slowed them down by nineteen percent. The size of the perception gap is what made the study famous. The methodological rigor is what makes it important.

Every engineering leader I talk to is being asked some version of the same question by their CEO: how much faster are we shipping with AI? The honest answer is that almost nobody has good data yet, and the metrics most teams are using to claim productivity gains are exactly the metrics most likely to mislead them. This post is about what the rigorous evidence actually shows, what the existing frameworks (DORA, SPACE, DX Core 4) still measure correctly, what they miss when AI writes a substantial share of the code, and a practical metric stack you can put in place this quarter without buying anything.

What the data actually says

Three sources are worth knowing about before you draw any conclusions.

The first is the METR randomized controlled trial, which tested experienced developers on their own codebases using mainstream tools (Cursor with Claude Sonnet). The headline nineteen-percent slowdown is real, but the more useful finding is the decomposition: developers spent more time prompting, waiting for output, and reviewing or fixing AI-generated code than they would have spent just writing it themselves on familiar territory. The result does not generalize to junior developers on unfamiliar codebases, where AI can genuinely help. It generalizes to anyone confident enough to assume the tools must be helping just because they feel good.

The second is the 2025 DORA Report, Google’s annual survey of nearly five thousand technology professionals, this year subtitled “State of AI-Assisted Software Development.” AI adoption hit roughly ninety percent. Most developers report feeling more productive. And yet — the part that gets quietly buried in summary slides — the report finds that AI does not automatically improve software delivery performance. What matters is the surrounding system: platform quality, workflow design, code-review culture. AI amplifies the engineering organization you already have. If your platform is weak, AI accelerates the production of weak software.

The third is the body of vendor and customer research, including GitHub’s own published case studies with Accenture and others. These typically show modest but real lifts on activity metrics — a few percent more pull requests per developer, slightly higher merge rates, suggestion acceptance rates of roughly a third. The honest summary is that AI moves activity metrics measurably for typical enterprise teams, but the effect on delivery and outcome metrics is far smaller and depends entirely on the system around it.

Three different methodologies, three different angles, one coherent picture: AI moves activity metrics, sometimes moves delivery metrics, almost never moves outcomes by the headline numbers the marketing implies. Anyone telling you their team is twice as fast with Copilot is reporting feelings, not evidence.

What DORA and SPACE still get right

The temptation in the face of new technology is to throw out the old frameworks. Resist it. The foundational work on engineering productivity was built carefully enough that most of it still works.

DORA’s original four metrics — deployment frequency, lead time for changes, change failure rate, and mean time to restore — remain the cleanest available signal of software delivery performance. They are outcome metrics, not activity metrics, and that distinction matters more in the AI era, not less. If AI helped your team ship more code but lead time and change failure rate stayed flat, you have not improved delivery. You have improved typing. The fifth metric DORA added in 2025 (a measure of operational reliability) closes a gap that mattered before AI and matters more after.

SPACE, introduced in 2021 by Forsgren, Storey, and colleagues at Microsoft Research and GitHub, did something different: it gave us a framework rather than a fixed metric set. Its five dimensions — Satisfaction, Performance, Activity, Communication, and Efficiency and flow — explicitly warn that activity metrics in isolation mislead. The framework is still right. If anything, AI-era measurement makes SPACE’s central insight more relevant: throughput without satisfaction, communication, and quality is a dashboard, not productivity.

The DX Core 4, published in late 2024 by DX in collaboration with the SPACE authors, is a useful synthesis: four dimensions — Speed, Effectiveness, Quality, Impact — that combine DORA’s delivery focus with SPACE’s developer-experience focus into something prescriptive enough to deploy. Its core prescription — counterbalanced metrics, so that gaming one becomes visible in another — is the discipline that keeps Goodhart’s law from eating your dashboards.

You do not need to replace any of these. You need to extend them.

What the frameworks miss when AI writes the code

Three specific things DORA, SPACE, and Core 4 did not have to measure before AI, and now have to:

First, where the cognitive work actually went. When a developer spends three hours prompting, reviewing, and fixing AI-generated code instead of writing it themselves, the pull-request count is the same but the work is different. SPACE’s Efficiency dimension hints at this, but no standard metric captures the shift from authoring to supervising. This is the category most likely to be missed, and the one most likely to cause silent burnout.

Second, AI-specific defect patterns. Code that compiles and passes review can still carry characteristic AI failure modes — confabulated APIs, near-correct but subtly wrong logic, missing edge cases the model never saw in training. Tracking change failure rate alone hides this. You also need rework rate, bug-introduced-per-PR, and defects caught post-merge tagged by code origin. Telemetry studies through 2025 consistently flagged a pattern where AI-assisted teams ship more code but introduce more incidents per change. If your measurement does not separate those two, you cannot see it.

Third, trust and reliance calibration. The DORA report’s finding that roughly thirty percent of developers do not trust AI-generated code is not a sentiment metric — it is a leading indicator. Teams that over-trust produce more defects; teams that under-trust waste the tool. The right measurement is how often AI suggestions are accepted, then later reverted or substantially modified — a “regret rate” that complements raw acceptance rates. Vendors will not give you this metric. You have to instrument it yourself.

These gaps do not invalidate the frameworks. They define what to add.

A practical metric stack for AI-era engineering

The metric stack I find most useful — built on top of DORA and SPACE, extended for AI — has four layers. Each measures something the layer above cannot.

Layer 1: Delivery (DORA, unchanged). Deployment frequency, lead time for changes, change failure rate, mean time to restore, plus reliability. These tell you whether your engineering organization actually ships and stays up. They are the outcome floor. If AI adoption is not moving these, it is not moving production results — regardless of what the activity dashboards say.

Layer 2: Quality (extended). Standard quality metrics — escaped defects, post-deploy incidents, P0/P1 rates — plus AI-specific additions: rework rate within a defined window, suggestion acceptance and regret rates, code-review effort per PR (does AI make review work bigger?), and the share of incidents traced to AI-authored changes. This is the layer that surfaces the “shipping faster but breaking more” pattern.

Layer 3: Leverage (new). What share of a developer’s working time goes to high-leverage work — design, architecture, debugging, code review, mentoring — versus mechanical work AI now handles. This is where SPACE’s Efficiency dimension gets concrete. Measure it with quarterly time-use surveys, sampled diary studies, or analysis of calendar and PR data. The goal is not to micro-manage time allocation; it is to verify that AI is letting senior engineers do more senior work, not less.

Layer 4: Sentiment (essential, not optional). A quarterly developer-experience survey covering perceived productivity, perceived quality of AI tools, friction points, and burnout signals. The Developer Experience Index style of survey (or any well-constructed instrument) is fine. What matters is doing it consistently, treating the trend line rather than the absolute number as the signal, and acting on results visibly enough that developers keep responding honestly.

A practical principle ties the layers together: any time a metric in one layer goes up, check that the metric in the next layer did not go down. PRs per engineer up, change failure rate up too? You have made things worse. Throughput up, sentiment down? You are burning out the team. The discipline of counterbalanced metrics is what keeps productivity measurement from quietly becoming productivity theater.

How to avoid the measurement traps

A short list of the failure modes I see most often when engineering leaders try to quantify AI’s impact:

Per-individual metrics. Diffs per engineer, suggestions accepted per developer, lines per day at the individual level get gamed within weeks. Every published framework — DORA, SPACE, Core 4 — explicitly warns against this. Measure at the team or organization level. Always.
Vanity adoption metrics. “Ninety percent of developers use Copilot” tells you nothing about whether it is working. Adoption is a necessary precondition; impact is the actual question. The 2025 DORA report’s central warning was exactly this.
Believing self-reported speedup. Developers in the METR study, after experiencing a measured nineteen-percent slowdown, still believed they had been sped up by twenty percent. Self-reported productivity is a wellbeing signal, not a productivity measurement.
Ignoring the system. If lead time is not moving, the problem is almost never the AI tool. It is the deploy pipeline, the code-review queue, the test suite, the on-call burden, the staging environment. AI accelerates writing code; everything else still has to keep up. The DORA finding here is unambiguous.
Single-metric thinking. Goodhart’s law applies harder than usual in this domain. Whatever metric you reward, your organization will optimize for, including by routing around the intent. Counterbalanced metric pairs are the single most reliable defense.

The engineering leaders who come out of the next two years with credible answers to “how is AI changing our productivity?” will not be the ones with the most sophisticated dashboard. They will be the ones who kept the old frameworks honest, extended them carefully where AI genuinely changed the work, and never confused throughput with outcomes or feelings with facts. That is the same discipline that distinguished good engineering organizations before AI. It distinguishes them more now.