The Agent Code Explosion Is Breaking Your CI. Here's How to Adapt.
AI agents commit 30-60 times per session. Your CI was built for 15 per day. The diagnostic for CI bottlenecks specific to AI-generated changes
TL;DR
Reading the post…
Your CI pipeline was designed for the cadence of human developers. It is now serving the cadence of AI agents, which is roughly five to twenty times higher. Most of the friction you are feeling is at that interface, and most of it is fixable — if you know where to look.
Walk into any engineering organization that has adopted AI coding agents seriously, and the same pattern shows up. The agents are working. Engineers report shipping faster, building features they could not have built alone, automating tasks that used to take days. Then the CI/CD platform engineer pulls up the bill and the conversation changes. Build minutes are up by a factor of three to ten. Cache hit rates have collapsed. The CI queue is permanently backed up. Flaky tests, which the team had been managing, have become unmanageable.
This is not a vendor problem. This is not a configuration problem. This is structural: the CI/CD pipeline was sized, tuned, and operationally tolerated for the cadence of human developers, and a single AI coding agent generates changes at a cadence that is qualitatively different. Google’s 2025 DORA report called this the “mirror effect”: AI adoption amplifies whatever was already true about your delivery system. If the system was strong, AI multiplies output. If the system was already strained, AI exposes the strain and makes it untenable. CI is the layer where most teams discover which side of that line they are on.
The good news: the bottlenecks are predictable, the patterns to adapt are well-understood, and the investment cost is bounded. This post is the diagnostic and the playbook.
What broke: the scale of the change
The most important thing to internalize is how much the cadence shifted. A coding agent working on a complex feature can produce thirty to sixty commits in a few hours; the historical baseline for a human developer was closer to five to fifteen builds per day. That’s not a small delta. It is a step change in volume that hits every layer of the CI pipeline simultaneously.
| Dimension | Pre-AI baseline (human developer) | Post-AI reality (agent-augmented) | Where it hits CI |
|---|---|---|---|
| Commits per developer-day | 5-15 | 30-60 in an agent session | Build queue depth |
| PR size | Historical norm | Faros AI 2025: +154%; their 2026 dataset: +51% | Test runtime, review burden |
| PR throughput | Steady | Faros: +98% more PRs merged | Pipeline concurrency limits |
| Code surface touched | Localized | Broader, often cross-cutting refactors | Cache invalidation, test selection |
| Time of submission | Working hours | Continuous, including agent overnight runs | Capacity planning |
| Reviewer load | Predictable | Discontinuous, bursty | Merge train stalls |
Two things are doing work in this table. The volume is higher across the board — more commits, larger PRs, more concurrent PRs, more around-the-clock activity. More importantly, the shape of the changes is different. Human developers tend to make localized changes; AI agents are more comfortable making broader, cross-cutting refactors that touch many files, invalidate larger swaths of build cache, and trigger more downstream rebuilds. The pipeline did not just see more traffic; it saw traffic of a different character, which is why scaling up CI capacity does not by itself solve the problem.
A specific data point: in April 2026, GitHub paused new Copilot signups, citing in its own statement that “agentic workflows have fundamentally changed Copilot’s compute demands” and that “it’s now common for a handful of requests to incur costs that exceed the plan price.” If the company that ships Copilot found that agentic workloads broke its own pricing model, the pricing your CI vendor offered you eighteen months ago is probably also under stress.
The DORA 2025 finding that completes the picture: AI adoption correlates with both higher throughput and higher instability. Individual output rises sharply. Organizational delivery flattens, because the gains get absorbed by the delivery system the new output is being force-fed into. CI is the most visible layer of that system.
The CI bottlenecks specific to AI-generated changes
A standard CI pipeline has roughly four layers where work happens: source-change detection, build, test, and merge orchestration. AI agents stress each one differently. It is worth being specific about which bottleneck you are hitting, because the fixes for each are different.
Test execution. This is where most teams feel the pain first. Larger PRs touching more files mean more tests need to run, and the naive test runner runs everything. If your test suite takes thirty minutes for a normal PR, an agent’s typical PR — fifty percent to one hundred fifty percent larger by the Faros data — easily pushes that to forty-five minutes or an hour. Multiply by five to ten times more PRs and you have a queue problem. Most teams discover their test-selection logic was tuned for the human-scale workload and falls over at agent scale.
Build cache. AI agents are more comfortable touching shared dependencies, common utility modules, and cross-cutting interfaces than humans are. Each such change invalidates more of the build cache than a typical human PR would. The cache hit rate, which the team had been at eighty or ninety percent, can drop into the fifties or sixties when agent activity ramps up. Build times go up proportionally.
Pipeline concurrency. Most CI platforms cap concurrent pipelines per repository or organization. An agent submitting six PRs in an hour will saturate that cap and force the queue to back up. Subsequent human PRs sit behind agent PRs, latency for everything degrades, and the team feels slower despite the agent’s individual speed.
Merge train integrity. Tools like GitHub’s merge queue, Bors, and similar serializers were designed assuming a manageable number of PRs to land per day. At agent throughput, the serializer itself becomes a bottleneck. PRs in the queue stale-out, get rebased against a constantly moving HEAD, fail their post-rebase tests, and have to be re-tested — burning more CI capacity without delivering more merges.
Diagnostic: which bottleneck do you have?
The bottlenecks compound, which makes them hard to disentangle. The cheapest way to triage is to look at which metric is moving the most relative to its pre-AI baseline.
| Symptom you’re seeing | Likely bottleneck | First fix to try |
|---|---|---|
| CI bill spiked, queue depth normal | Test runtime per PR | Affected-test selection (Bazel, Nx, Turborepo, or custom) |
| Queue depth high, individual builds normal | Pipeline concurrency cap | Raise concurrency limits; tier agent vs human queues |
| Build times long, test times normal | Cache invalidation | Tighter cache scoping; content-hash keys not branch keys |
| PRs stale before they merge | Merge train at saturation | Batch-merge agent PRs; restrict agent submissions to windows |
| Flaky tests dominate | Test reliability under load | Quarantine flakies aggressively; retry deterministically |
| Reviewers can’t keep up | Review-stage bottleneck | Agent PR size limits; structured PR descriptions |
| All of the above | Cadence mismatch | Step back and design for agent-scale, not human-scale |
The honest read of this table is that most teams have several of these bottlenecks at once and try to fix them in the wrong order. The order that tends to work: start with cache scoping (cheapest, highest leverage), then test selection (significant work, significant payoff), then concurrency tiering (operational, not engineering), then merge-train redesign (high-effort, only worth doing when the first three are not enough). The expensive thing to do first is to just scale up the CI capacity. That helps in the short term and masks the underlying inefficiency, which gets worse as agent adoption grows.
Patterns for adapting
The five patterns that consistently work, in roughly the order most teams should tackle them.
Affected-test selection. The single highest-leverage change. Stop running every test on every PR; figure out which tests are affected by the actual code change and run only those. Bazel does this natively. Nx and Turborepo do it for JavaScript monorepos. Custom solutions built on top of language-specific tooling work as well. Done right, this reduces test runtime per PR by sixty to ninety percent on average, with disproportionate impact on agent PRs because agent PRs tend to touch the same set of files repeatedly across iterations of the same feature.
Cache scoping. A surprising amount of pain comes from build caches keyed on branch name or PR number rather than content hash. Content-hashed caches survive across branches, PRs, and merges; branch-keyed caches do not. The fix is operationally simple but requires careful design: figure out what actually goes into each cache key, hash the relevant inputs, and use that as the key. The cache hit rate that comes back is usually dramatic — the difference between fifty and ninety percent on the same workload.
Tiered concurrency. Stop treating agent-submitted PRs and human-submitted PRs as the same priority class. Agent PRs can wait an extra few minutes; a human waiting on their PR cannot. Most CI platforms support concurrency tiers; use them. The platforms that don’t can be wrapped with a thin queue that does the routing. The cultural benefit of this — humans feel faster even when the system is under load — is often larger than the technical benefit.
Batched and windowed agent submissions. A team running multiple agents in parallel does not need every agent’s PR to merge immediately. Batching agent submissions into windows (every fifteen minutes, every hour, end of day) reduces concurrency contention and makes the merge queue’s job tractable. Humans get the synchronous merge experience; agents get the batched experience. Both are appropriate for their respective workflows.
Agent restraint, configured upstream. The least technical but most underused pattern. Most coding agents can be configured to commit less often, to combine related changes into single commits, to run tests locally before pushing, and to skip CI for documentation-only changes. A few minutes of configuration in AGENTS.md or the equivalent often reduces CI load by twenty to forty percent without any infrastructure change. The agents are not malicious; they are doing what they were configured to do, and the default configuration was tuned for visibility, not efficiency.
The investment-versus-restraint decision is the strategic question. Affected-test selection and cache scoping are durable engineering investments that pay back regardless of how AI adoption evolves; spend the time. Agent restraint is configuration work that pays back immediately but needs maintenance. Tiered concurrency is operational and cheap. Batched submissions are policy.
Metrics that catch the problem early
A short list of metrics worth tracking specifically because they will catch the agent-code-explosion problem before it becomes a crisis.
The simplest leading indicator is CI minutes per merged PR. If this is climbing month over month, your pipeline is becoming less efficient. A healthy AI-augmented team should see CI minutes per PR holding roughly flat or declining as test selection and cache improvements compound. If it’s climbing, agent volume is outrunning the optimizations.
Cache hit rate, segmented by PR type or author, will tell you whether your cache strategy still works under agent load. The drop is usually the first sign that the cache keys are not scoped right for how agents touch code.
Merge queue latency — time from “PR is ready to merge” to “PR is actually merged” — captures the merge-train saturation problem. It is also the metric engineers feel most directly in their day-to-day work.
Test flakiness rate under agent load. Flaky tests that the team was managing at human throughput often become unmanageable at agent throughput because the absolute number of flaky failures scales with PR volume. Track flakiness separately for agent PRs and human PRs; the difference will tell you whether your test suite is robust to the volume change.
Finally, CI cost per developer-month. The bill is the most direct expression of the problem. If this is climbing and developer count is flat, you have a CI scaling problem that is consuming the productivity gains AI was supposed to deliver.
The deeper lesson behind all of this is that the CI pipeline is the most visible place where the rest of the engineering system has to catch up with what AI agents can now do. The teams that treat CI investment as catch-up work — necessary, deliberate, ongoing — will absorb the agent productivity gains and turn them into actual shipped value. The teams that treat CI as a fixed cost center to be defended at the existing budget will find their agent productivity gains absorbed by infrastructure friction, which is the DORA mirror effect playing out at the pipeline layer. Neither outcome is the agents’ fault. The agents are doing the work. The question is whether the rest of the system was ready for them to do that much of it, that fast, and the answer for most teams in mid-2026 is honestly no — but it could be yes within a quarter of deliberate work. The diagnostic above is where that quarter starts.