The growing problem with letting AI write production code

AI-generated code is shipping faster than ever, and for many engineering shops, that’s exactly the point: speed, scale and the seductive promise of a copilot that can scaffold features and fix bugs while humans focus on higher-value work. But a wave of fresh research suggests that the tradeoff is real, measurable and growing. A new analysis from CodeRabbit—working from 470 real-world open-source pull requests—found AI-authored changes carried roughly 1.7 times more issues than human-written ones, with disproportionate increases in logic defects, readability problems and security findings. Those numbers aren’t small noise on the margins; they point to systemic patterns in where current models fail when asked to touch real, production code.

The picture that emerges from CodeRabbit’s dataset is not merely about quantity. The platform reports higher severity across AI-generated changes: critical and major problems show up more often, and certain failure modes cluster in predictable places. Logic and correctness issues were among the most common, while readability and maintainability problems spiked—meaning the code often looks plausible at a glance but hides fragility. In practical terms, that translates into more business-logic bugs, insecure object references, poor error handling and performance regressions that evade superficial checks. CodeRabbit’s breakdown also shows especially sharp increases in naming inconsistencies, concurrency and dependency correctness failures—areas where context, domain knowledge and long-range reasoning matter.

Related /

Security researchers and enterprise security teams are already sounding alarms. Apiiro, which examined large enterprise repositories, reported that the same AI assistants that accelerate commits can multiply security problems by an order of magnitude—arguing that developers using AI produced far more vulnerabilities than those who did not. That’s not just an academic metric: when a team’s AppSec workflows and scanning tools aren’t equipped to handle a tenfold surge in findings, triage queues balloon, false negatives proliferate and real risks can slip into production. For security teams that were already running lean, AI-driven velocity becomes an attack-surface multiplier.

If these numbers clash with the rosy marketing for copilots, they also fit into a broader, uncomfortable industry story: tooling alone doesn’t equal value. Bain & Company’s 2025 technology report observes that early deployments of generative AI in coding produced “unremarkable” savings unless organizations reengineer processes around the toolchain. Teams that see real gains, Bain argues, don’t merely plug a copilot into existing workflows—they redesign the lifecycle (tests, CI/CD, integration, product planning) so that faster code generation doesn’t simply create new bottlenecks elsewhere. The bottom line: raw throughput without changed processes often leaves organizations paying for speed with quality.

There’s also evidence that developer experience matters in surprising ways. A randomized trial from Model Evaluation & Threat Research (METR) found experienced open-source contributors were, on average, slower when allowed to use early-2025 AI tools—about 19 percent slower on certain tasks—largely because time was diverted into reviewing, debugging and reconciling machine suggestions with domain knowledge. Developers in the study consistently reported feeling faster when using AI, even as measured performance declined; that mismatch—perceived speed versus actual throughput—helps explain why teams that adopt copilots without new guardrails can suddenly spend more of their time triaging the machine’s work.

Put together, the findings sketch a chain reaction. AI generates more code; more code means more reviews; more reviews mean more reviewer burn and more cognitive load on engineers; that extra burden shifts human effort away from architects and maintainers toward being supervisors for noisy generators. Code review, historically a final safety net, is morphing into the primary defense—an expensive and imperfect one—against a rising tide of model-introduced defects. The consequence is a subtle inversion of labor: fewer people writing original, well-considered code and more people policing machine drafts.

Where do the failures come from? Engineers who study model outputs point to a few recurring causes. First, models are probabilistic—they produce the most likely continuation, not the provably correct one—so they confidently emit code that is superficially plausible but wrong in edge cases. Second, hallucination and context collapse mean an assistant trained broadly on public code can suggest patterns that don’t match a project’s invariants or security posture. Third, prompt and context hygiene matter: short, generic prompts produce generic, brittle outputs; models need lineage and domain signals (tests, project-specific docs, policy-as-code) to behave reliably. Those are solvable problems, but they require investment in process, tooling and guardrails—not just a license key.

The practical implications for engineering leaders are concrete. CodeRabbit and others recommend layered mitigations: stricter CI checks, AI-aware PR templates, security policies enforced as code, model prompts tuned to repository context, and automated testing that prioritizes areas where models historically fail (error handling, edge cases, concurrency). Some firms are experimenting with “AI staging” — where machine-generated patches run through an additional automated gate that focuses on known model weaknesses before human eyes even open the PR. These process interventions act less like band aids and more like re-architecting workflows to make AI an assistive, not an autonomous, actor.

There’s a cultural wrinkle as well. Several engineers describe a risk that matters less to executives chasing KPIs than to long-term maintainers: the possibility of a cohort of practitioners who are expert at orchestrating tools but weaker at deep reasoning about systems. If teams come to rely on AI to scaffold business logic and error handling, the tacit knowledge needed to maintain and evolve those systems may atrophy, compounding technical debt. The remedy isn’t to ban models but to refocus human roles: deeper design, threat modeling, and cross-checking—tasks where people still outperform current models.

That doesn’t mean AI has no role. Where it helps—catching trivial typos, producing boilerplate, suggesting documentation, or surface-level refactors—it can shave drudgery and speed small wins. The challenge is aligning expectations: understanding the domains where models excel, and, crucially, where they do not. Leaders who treat copilots as noisy collaborators that require explicit constraints, provenance and ongoing auditing are the ones likely to get net benefit; those who treat AI as a drop-in productivity monoculture risk shipping speed with growing maintenance tax.

In short, the early verdict is mixed. Generative AI for code is real, powerful and increasingly embedded. But a new generation of empirical studies suggests that without process change, safety rails and focused human oversight, the rush to scale AI in engineering workflows will translate directly into more defects, greater security exposure and higher long-term costs. The sensible strategy for teams is not to reject AI, but to redesign work so that developers spend less time fixing machine mistakes and more time on the reasoning that still distinguishes humans from the models they build.