Depending on whom you consult, coding driven by AI is either providing software developers with an unparalleled boost in productivity or generating vast amounts of poorly constructed code that distracts them and leads to significant long-term maintenance challenges for software projects.
The issue currently is that it’s difficult to discern which perspective is accurate.
As major technology firms invest billions into large language models (LLMs), coding has been heralded as the leading application of the technology. Both Microsoft CEO Satya Nadella and Google CEO Sundar Pichai assert that nearly a quarter of the code developed by their companies is now generated by AI. Furthermore, in March, Dario Amodei, CEO of Anthropic, projected that in the next six months, AI would be responsible for writing 90% of all code. This presents a compelling and clear use case: code is a language, we have a high demand for it, and producing it manually is costly. It’s also straightforward to assess whether it functions correctly—simply run the program to see if it works.
This article is part of MIT Technology Review’s Hype Correction initiative, a series aiming to reassess expectations regarding the capabilities of AI and its future trajectory.
Business leaders intrigued by the potential to eliminate human limitations are urging engineers to embrace an AI-augmented future. However, after interviewing over 30 developers, technology leaders, analysts, and researchers, MIT Technology Review discovered that the reality is not as clear-cut as it may appear.
For many developers directly engaging with the technology, their initial excitement is diminishing as they encounter the constraints of the tools. As a mounting volume of research indicates that the asserted productivity enhancements might be illusory, doubts arise about the veracity of such claims.
The rapid pace of advancements adds complexity to the situation. Regular releases of new models mean the capabilities and peculiarities of these tools are in constant flux. Moreover, their effectiveness often varies depending on the tasks assigned and the organizational frameworks surrounding them. This creates a situation where developers must navigate confusing discrepancies between their expectations and reality.
Is it the best of times or the worst of times (to echo Dickens) for AI coding? Perhaps it’s both.
A rapidly evolving domain
AI coding tools are hard to miss these days. There’s a bewildering selection of products available, both from model creators like Anthropic, OpenAI, and Google and from companies such as Cursor and Windsurf, which provide these models within refined code-editing applications. According to Stack Overflow’s 2025 Developer Survey, these tools are rapidly being adopted, with 65% of developers utilizing them at least weekly.
AI coding tools first appeared around 2016, but they gained significant momentum with the advent of LLMs. Early versions merely served as autocomplete aids for programmers, predicting the next lines of code. Nowadays, they can analyze entire codebases, edit files, debug issues, and even generate documentation to clarify how the code functions. All of this is guided by natural language queries through a chat interface.
“Agents”—self-sufficient LLM-driven coding tools that can execute a high-level plan and autonomously create entire programs—mark the latest advancement in AI coding. This innovation was made possible by the newest reasoning models, capable of addressing complex challenges stepwise and, crucially, accessing external tools to execute tasks. “This is what enables the model to code, rather than just discuss coding,” states Boris Cherny, leader of Claude Code, Anthropic’s coding agent.
These agents have achieved remarkable results on software engineering benchmarks—standard tests that evaluate model performance. When OpenAI launched the SWE-bench Verified benchmark in August 2024, providing a method to measure agents’ success in fixing actual bugs in open-source repositories, the leading model addressed only 33% of issues. A year later, top models consistently perform above 70%.
In February, Andrej Karpathy, an initial member of OpenAI and former AI director at Tesla, popularized the term “vibe coding”—referring to an approach where individuals articulate software in everyday language and let AI write, enhance, and debug the code. Social media is filled with developers who embrace this concept, claiming substantial productivity increases.
However, while some developers and organizations assert such productivity enhancements, tangible evidence remains mixed. Preliminary studies from GitHub, Google, and Microsoft—all providers of AI tools—showed developers completing tasks 20% to 55% faster. Yet a September report from Bain & Company characterized real-world savings as “unremarkable.”
Data from the developer analytics firm GitClear indicates that most engineers have been producing approximately 10% more durable code—code that isn’t deleted or rewritten within weeks—since 2022, likely owing to AI. However, this improvement has coincided with sharp declines in various indicators of code quality. Stack Overflow’s survey also noted a significant drop in trust and positive sentiment toward AI tools for the first time. In a particularly concerning revelation, a July study from the nonprofit Model Evaluation & Threat Research (METR) found that while seasoned developers perceived AI to make them 20% more efficient, objective tests indicated they were actually 19% slower.
Increasing disillusionment
For Mike Judge, principal developer at Substantial, the METR study hit home. An early supporter of AI tools, he eventually became disheartened by their limitations and the minor productivity enhancement they provided. “I was venting to colleagues because I felt, ‘It’s assisting me, but I can’t discover how to make it truly beneficial,’” he says. “I constantly felt the AI was lacking in intelligence, but perhaps I could outsmart it with the right magic phrase.”
When a friend asked him, Judge estimated that the tools gave him about a 25% speed increment. So, when similar estimates appeared in the METR study, he decided to conduct his own test. Over six weeks, he predicted how long a task would take, flipped a coin to decide whether to use AI or code manually, and recorded his time. To his astonishment, AI reduced his speed by a median of 21%—aligning with the METR findings.
This prompted Judge to analyze the statistics. If these tools were genuinely enhancing developer efficiency, he reasoned, there would be a surge in new applications, website sign-ups, video games, and GitHub projects. He dedicated hours and hundreds of dollars examining publicly available data and found stagnation across the board.
“Shouldn’t this be trending upwards?” Judge questions. “Where’s the upward trend on any of these graphs? I thought everyone was extraordinarily productive.” His clear conclusion is that AI tools don’t provide much of a productivity boost for most developers.
Developers consulted by MIT Technology Review largely agree on the areas where AI tools excel: generating “boilerplate code” (reusable code snippets used in multiple contexts with minimal alterations), crafting tests, rectifying bugs, and elucidating unfamiliar code for new developers. Many pointed out that AI aids in overcoming the “blank page problem” by offering an imperfect initial draft that can inspire a developer’s creativity. It also allows non-technical colleagues to quickly prototype software features, relieving the burden on already stressed engineers.
These tasks can be monotonous, and developers typically appreciate delegating them. However, they only constitute a small fraction of an experienced engineer’s responsibilities. Regarding the more intricate challenges where engineers truly showcase their expertise, many developers indicated to MIT Technology Review that these tools encounter considerable obstacles.
Perhaps the most significant issue is that LLMs can only retain a limited amount of information in their “context window”—essentially their short-term memory. This results in difficulties when analyzing large codebases and leads to forgetfulness concerning longer tasks. “It becomes really short-sighted—it will only focus on what’s directly in front of it,” notes Judge. “And if you instruct it to complete multiple tasks, it might finish 11 of them but completely overlook the last one.”
The قصيرة attention span of LLMs can create difficulties for human developers. While a response generated by an LLM may appear functional in isolation, software consists of numerous interlinked modules. If these are not constructed with consideration for other software components, it can quickly result in a convoluted, inconsistent codebase that’s challenging for humans to understand and, more crucially, to maintain.
Developers have typically tackled this issue by adhering to conventions—informally defined coding guidelines that vary widely among projects and teams. “AI tends to not grasp the existing conventions within a repository,” states Bill Harding, the CEO of GitClear. “As a result, it’s very likely to propose its own slightly altered solution to a problem.”
The models are also prone to errors. Like all LLMs, coding models suffer from “hallucinations”—an inherent flaw in their design. However, because the code they produce appears so polished, detecting mistakes can be challenging, explains James Liu, director of software engineering at Mediaocean, a tech company. Combining all these shortcomings can create an experience akin to pulling a lever on a one-arm bandit. “In some projects, you achieve a 20x improvement in speed or efficiency,” adds Liu. “With others, it simply fails miserably, and you spend an inordinate amount of time trying to coax it into granting the desired outcome, but it simply won’t.”
Judge suspects this tendency is behind the common overestimation of productivity boosts among engineers. “You remember the big wins. You don’t recall spending hours just feeding tokens into the slot machine,” he comments.
This can be particularly detrimental if the developer is not familiar with the task. Judge recalls trying to use AI to establish a Microsoft cloud service called Azure Functions, which he had never used before. He anticipated it would take about two hours, but after nine hours, he conceded defeat. “It kept directing me down these rabbit holes, and I didn’t possess sufficient knowledge about the subject to tell it, ‘Hey, this is nonsensical,’” he recounts.
The debt starts to accumulate
Developers consistently navigate compromises between the speed of development and the maintainability of their code—creating what is termed “technical debt,” according to Geoffrey G. Parker, a professor at Dartmouth College. Each shortcut adds complexity, making the codebase more challenging to manage and accruing “interest” that must eventually be settled by restructuring the code. As this debt increases, adding new features and maintaining the software becomes progressively slower and more complicated.
Accumulating technical debt is unavoidable in most projects, but AI tools considerably simplify the process for time-constrained engineers to take shortcuts, asserts GitClear’s Harding. Furthermore, data from GitClear indicates this phenomenon is occurring broadly. Since 2022, the company has noted a significant uptick in copy-pasted code—an indicator that developers are increasingly reusing code snippets, likely based on AI suggestions—and a marked reduction in code reallocation, which is what happens when developers tidy up their codebase.
As models continue to advance, the code they generate is becoming more verbose and complex, notes Tariq Shaukat, CEO of Sonar, which produces tools for assessing code quality. This not only decreases the number of obvious bugs and vulnerabilities but also increases the prevalence of “code smells”—less readily identifiable flaws that lead to maintenance challenges and technical debt, he says.
Research by Sonar indicates that these issues represent over 90% of the problems detected in code produced by leading AI models. “Easy-to-spot issues are disappearing, leaving behind much more complex problems that take time to identify,” notes Shaukat. “That’s currently a major concern in this domain. You might find yourself lulled into a false sense of security.”
If AI tools complicate code maintenance, significant security repercussions could arise, warns Jessica Ji, a security researcher at Georgetown University. “As it becomes tougher to update and rectify issues, the likelihood of a codebase or specific code segments becoming insecure over time increases,” Ji cautions.
Specific security issues also exist, she adds. Researchers have uncovered a troubling category of hallucinations where models reference fictitious software packages in their code. Malicious actors can exploit this vulnerability by creating packages with those names, which may contain vulnerabilities that the model or developer may inadvertently integrate into their software.
LLMs are also susceptible to “data-poisoning attacks,” where hackers infiltrate the publicly available datasets that models are trained on with data that modifies the model’s behavior in undesirable manners, such as generating insecure code when prompted with specific phrases. A study by Anthropic published in October indicated that a mere 250 deceitful documents could introduce this type of vulnerability into an LLM, regardless of its size.
The transformed
Despite these challenges, a reversal seems unlikely. “It’s probable that the days of manually writing every line of code are rapidly fading,” states Kyle Daigle, COO of GitHub, a Microsoft-owned code hosting platform known for its AI-driven Copilot tool (distinct from Microsoft’s similar offering).
The Stack Overflow report indicates that despite an escalating distrust of the technology, its usage has been consistently on the rise for the past three years. Erin Yepis, a senior analyst at Stack Overflow, notes that this demonstrates engineers are leveraging the tools with an awareness of the risks. The report also revealed that frequent users exhibit greater enthusiasm, and more than half of developers are not utilizing the latest coding agents, which may explain why many remain unimpressed with the technology.
The latest tools can be transformative. Trevor Dilley, CTO at Twenty20 Ideas, a software development agency, says he found some benefit in the autocomplete features of AI editors, but anything more complicated would “fail dramatically.” In March, however, during a family vacation, he assigned the newly released Claude Code to assist with one of his hobby projects. It completed a four-hour task in two minutes, producing code of higher quality than he could have written.
“I was astonished,” he says. “For me, that was the defining moment. There’s no returning from this.” Dilley has since co-founded a startup named DevSwarm, which is developing software capable of coordinating multiple agents to work simultaneously on a software project.
The challenge, according to Armin Ronacher, an acclaimed open-source developer, lies in the fact that the learning curve for these tools is both shallow and extensive. Until March, he had remained unimpressed by AI tools, but after leaving his position at Sentry in April to launch a startup, he began experimenting with agents. “I dedicated a significant amount of months solely to this,” he states. “Now, 90% of the code I generate is AI-driven.”
Achieving that involved extensive trial and exploration to identify which challenges tend to disrupt the tools and which tasks they can efficiently handle. Today’s models can manage most coding jobs with appropriate guardrails, says Ronacher, yet these can be very specific to the task and project.
To maximize the use of these tools, developers must relinquish control over single lines of code and concentrate on the broader software architecture, advises Nico Westerdale, CTO of the veterinary staffing firm IndeVets. He recently constructed a 100,000-line data science platform predominantly by prompting models instead of coding manually.
Westerdale’s method begins with an extensive dialogue with the model to formulate a detailed blueprint of what to create and how to implement it. He then directs it through each phase. It seldom gets everything right on its first attempt and requires continual adjustment, but if it adheres to well-defined design principles, the models can produce high-quality, easily maintainable code, claims Westerdale. He scrutinizes every line, asserting that the resulting code is on par with his best work: “I’ve found it incredibly revolutionary. It’s also frustrating, difficult, and a new perspective, and we are just starting to adapt to it.”
While individual developers are discovering effective ways to utilize these tools, achieving consistent results across an extensive engineering team introduces substantially greater challenges. AI tools amplify both the strengths and weaknesses of your engineering culture, observes Ryan J. Salva, senior director of product management at Google. When equipped with robust processes, clear coding practices, and well-articulated best practices, these tools excel.
However, in disorganized development processes, they will only exacerbate existing problems. It’s vital to codify institutional knowledge so the models can leverage it successfully. “We need to put in considerable effort to help build context and extract the implicit knowledge residing within our organization,” he states.
Coinbase, the cryptocurrency exchange, has been outspoken about its implementation of AI tools. CEO Brian Armstrong made headlines in August by announcing that he had terminated employees who were reluctant to adopt AI tools. However, the company’s head of platform, Rob Witoff, shared with MIT Technology Review that while they have experienced sizable productivity gains in specific areas, the effects have been inconsistent. In simpler tasks like codebase restructuring and test writing, AI workflows have achieved speed increases of up to 90%. However, the advantages are more modest for other activities, and the disruption caused by revamping established processes often counters the accelerated coding speed, according to Witoff.
One factor affecting this is that AI tools enable junior developers to produce significantly more code. As is common in nearly all engineering teams, this new code must be reviewed by others, generally more experienced developers, to catch errors and ensure quality. However, the excessive volume of code being generated quickly overwhelms midlevel staff’s ability to review changes. “This is the cycle that seems to repeat almost monthly, where we automate a new task lower in the hierarchy, which creates increased pressure higher up,” he explains. “Then we start considering automating that higher-level function.”
Furthermore, developers typically spend only 20% to 40% of their time actually coding, states Jue Wang, a Bain partner, meaning that even a considerable speedup in coding doesn’t result in dramatic overall gains. The remaining time is spent analyzing software issues and addressing customer feedback, product strategy, and administrative responsibilities. For significant efficiency improvements, it may be necessary for companies to apply generative AI to these other processes as well, notes Jue, an area still in development.
Swift transformation
Working with agents represents a substantial shift from traditional practices, so it’s unsurprising that companies are facing some initial hurdles. These products are also evolving rapidly. “Every few months, the model advances, resulting in a substantial change in its coding capabilities, necessitating recalibration,” states Cherny from Anthropic.
In June, for instance, Anthropic launched a built-in planning feature for Claude, subsequently adopted by other providers. In October, the company enabled Claude to ask users questions when it requires additional context or faces multiple viable options, an enhancement that Cherny claims reduces its inclination to simply presume which direction is the best way forward.
Most notably, Anthropic has introduced features that enhance Claude’s ability to handle its own context. When approaching the limits of its working memory, it summarizes essential details and utilizes them to initiate a new context window, effectively providing it with an “infinite” working memory, explains Cherny. Claude can also invoke sub-agents to tackle smaller tasks, eliminating the necessity for it to retain all aspects of the project at once. The company claims that its latest model, Claude 4.5 Sonnet, can now autonomously code for over 30 hours without significant performance decline.
Innovative methods for software development could potentially circumvent the flaws associated with coding agents. MIT professor Max Tegmark has proposed a concept he refers to as “vericoding,” which might enable agents to create entirely bug-free code from a natural language description. This builds on a technique known as “formal verification,” wherein developers generate a mathematical model of their software that can definitively prove its correctness. This approach is utilized in critical fields such as flight control systems and cryptographic libraries, but it remains expensive and time-consuming, thus limiting its widespread application.
Rapid advancements in LLMs’ mathematical abilities have made it conceivable for models to generate not just software but also the mathematical proof confirming it’s bug-free, asserts Tegmark. “You merely provide the specification, and the AI returns provably correct code,” he explains. “You don’t even have to look at or interact with the code.”
In tests involving approximately 2,000 vericoding problems in Dafny—a language tailored for formal verification—the best LLMs resolved over 60%, according to non-peer-reviewed research by Tegmark’s team. This was accomplished using standard LLMs, and Tegmark anticipates that focused training for vericoding could rapidly boost performance.
Counterintuitively, the rapid pace at which AI generates code may actually help alleviate maintenance issues. Alex Worden, principal engineer at Intuit, observes that maintenance often grows complicated due to engineers reusing components across various projects, leading to a web of dependencies where altering one component induces cascading changes throughout the codebase. The necessity to reuse code used to save developers time, but in an environment where AI can fabricate hundreds of lines of code in a matter of seconds, that urgency diminishes, asserts Worden.
Instead, he advocates for “disposable code,” where each module is independently generated by AI without concern for adherence to design patterns or conventions. These modules are connected through APIs—rules that allow components to request information or services from one another. Each component operates independently of the rest of the codebase, facilitating easy removal and replacement without broader repercussions, according to Worden.
“The industry still worries about human maintenance of AI-generated code,” he says. “I question how long humans will care about or examine code.”
A reducing talent pool
For the time being, however, understanding and maintaining the code fundamental to projects will still require human intervention. One of the most troubling side effects of AI tools might be a diminishing number of individuals capable of performing this task.
Initial indications suggest that concerns about the job-displacing effects of AI are valid. A recent study from Stanford University revealed that employment among software developers aged 22 to 25 decreased nearly 20% between 2022 and 2025, coinciding with the emergence of AI-driven coding tools.
Experienced developers may face challenges as well. Luciano Nooijen, an engineer with Companion Group, a video game infrastructure developer, heavily utilized AI tools in his primary work, where they were provided for free. However, upon beginning a side project without access to these tools, he found himself grappling with tasks that once came easily. “I felt so foolish because what used to be instinctual became manual, sometimes even burdensome,” says Nooijen.
Just as athletes must regularly practice fundamental drills, he posits that the only way to maintain coding instincts is to consistently engage with the foundational work. That’s why he’s largely moved away from AI tools, although he acknowledges deeper motivations are also involved.
Some developers, including Nooijen, express concerns that AI tools are eroding the aspects of their jobs they cherish. “I entered software engineering because I enjoy working with computers. I relish creating machines that perform tasks at my command,” remarks Nooijen. “It simply isn’t enjoyable to sit back and watch my work unfold without my active involvement.”