
MIT Technology Review Clarifies: Allow our writers to unravel the intricate and chaotic landscape of technology to assist you in grasping what lies ahead. Explore more from the series here.
Each time OpenAI, Google, or Anthropic introduces a new groundbreaking large language model, the AI community pauses in anticipation. It only breathes out when METR, an AI research nonprofit whose acronym stands for “Model Evaluation & Threat Research,” releases an update on a now-famous graph that has significantly influenced the AI dialogue since its debut in March of the previous year. The graph indicates that specific AI capabilities are progressing at an exponential speed, with newer model releases surpassing that already remarkable trend.
This was undeniably the case with Claude Opus 4.5, the newest version of Anthropic’s strongest model, launched in late November. In December, METR declared that Opus 4.5 seemed capable of autonomously completing a task that would take a human roughly five hours—a significant enhancement over even the exponential forecast. One Anthropic safety researcher tweeted that he would shift his research focus following those findings; another company employee simply stated, “mom come pick me up i’m scared.”
However, the reality is more complex than those dramatic reactions imply. Firstly, METR’s assessments of individual models’ capabilities include considerable error margins. As METR clearly mentioned on X, Opus 4.5 may usually complete only tasks that take humans approximately two hours, or it may manage tasks that could take humans up to 20 hours. Given the intrinsic uncertainties of the method, certainty was unattainable.
“There are many ways people are interpreting the graph too optimistically,” says Sydney Von Arx, a member of METR’s technical team.
More fundamentally, the METR chart doesn’t gauge AI abilities broadly, nor does it purport to. To construct the graph, METR tests the models mainly on coding tasks, assessing the difficulty of each by gauging or estimating the time it takes humans to finish—an assessment not universally accepted. Claude Opus 4.5 might be capable of completing specific tasks taking humans five hours, but this does not imply it is close to replacing human workers.
METR was established to evaluate the threats posed by advanced AI systems. Although it is primarily recognized for the exponential trend chart, it has also partnered with AI companies to assess their systems in more depth and published numerous other independent research projects, including a widely discussed July 2025 study proposing that AI coding aids might actually be hindering software engineers.
Nevertheless, the exponential graph has bolstered METR’s reputation, and the organization seems to possess a complex relationship with the often fervent response to that graph. In January, Thomas Kwa, one of the principal authors of the paper that introduced it, wrote a blog post addressing some criticisms and clarifying its limitations, and METR is presently working on a more comprehensive FAQ document. However, Kwa is not hopeful that these initiatives will significantly alter the narrative. “I think the hype machine will essentially, regardless of our actions, strip away all the caveats,” he states.
Nonetheless, the METR crew does believe that the graph conveys something significant about the trajectory of AI development. “You should definitely not link your life to this graph,” asserts Von Arx. “But additionally,” she adds, “I’m confident that this trend will persist.”
A part of the difficulty with the METR graph is that it is considerably more intricate than it appears. The x-axis is straightforward enough: It tracks the release date of each model. However, the y-axis is where complexities arise. It notes each model’s “time horizon,” an unconventional metric that METR developed—and which, according to Kwa and Von Arx, is often misinterpreted.
To fully grasp what model time horizons entail, it helps to comprehend the extensive work METR undertook to compute them. Initially, the METR team gathered a set of tasks that ranged from brief multiple-choice questions to elaborate coding challenges—all of which were at least somewhat related to software engineering. They then had human coders attempt the majority of these tasks and assessed how long it took them to complete. In this manner, they assigned a human baseline time to the tasks. Some tasks took the experts mere seconds, while others demanded several hours.
Upon testing large language models on the task suite, METR discovered that advanced models could manage the quick tasks effortlessly—but as the models tackled tasks that took humans progressively longer to complete, their precision began to decline. From a model’s performance, the researchers calculated the point on the time scale of human tasks at which the model would successfully finish around 50% of the tasks. That point signifies the model’s time horizon.
All that information is available in the blog post and academic paper that METR published alongside the original time horizon plot. However, the METR graph is often circulated on social media devoid of this background, and thus the true significance of the time horizon metric can be overlooked. A frequent misunderstanding is that the figures on the plot’s y-axis—approximately five hours for Claude Opus 4.5, for instance—indicate the duration models can function independently. They do not. They represent how long humans spend on tasks that a model can accomplish successfully. Kwa has encountered this misconception so often that he made it a point to address it at the very beginning of his recent blog post, and when asked what additional information he would provide to the circulating plot versions online, he stated he would append the term “human” whenever the task completion time was referenced.
Although the time horizon concept is intricate and widely misinterpreted, it does have some fundamental rationale: A model with a one-hour time horizon could automate certain modest segments of a software engineer’s responsibilities, while a model with a 40-hour horizon could potentially finish days of work autonomously. Nevertheless, some experts question if the time humans take on tasks serves as an effective measure for assessing AI capabilities. “I don’t think it’s automatically a fact that just because something requires more time, it’s a tougher task,” states Inioluwa Deborah Raji, a PhD student at UC Berkeley who focuses on model evaluation.
Von Arx expresses that she, too, was initially doubtful that time horizon was the most appropriate metric to utilize. What swayed her was witnessing the outcomes of her and her colleagues’ analyses. Upon calculating the 50% time horizon for all the significant models available in early 2025 and plotting them on the graph, they observed that the time horizons for top-tier models were increasing over time—and indeed, the pace of progress was accelerating. Every seven months or so, the time horizon doubled, implying that the most advanced models could complete tasks that took humans nine seconds in mid-2020, four minutes in early 2023, and 40 minutes in late 2024. “I can theorize all I want about whether or not it is logical, but the trend exists,” Von Arx states.
This striking pattern is what made the METR graph such a sensational piece. Numerous individuals became acquainted with it through the viral sci-fi narrative, AI 2027, which combines fiction and quantitative predictions that claim superintelligent AI could eradicate humanity by 2030. The creators of AI 2027 grounded some of their forecasts on the METR graph and extensively referenced it. In Von Arx’s words, “It’s somewhat strange when many people know about your work through this quite biased interpretation.”
Of course, many individuals reference the METR chart without envisioning large-scale catastrophe. For some advocates of AI, the exponential development suggests that AI will soon usher in a period of substantial economic advancement. The venture capital firm Sequoia Capital, for example, recently published a post titled “2026: This is AGI,” leveraging the METR graph to assert that AI capable of functioning as an employee or contractor will arrive shortly. “The provocation was essentially, ‘What will you do when your plans are scheduled in centuries?’” remarks Sonya Huang, a general partner at Sequoia and one of the authors of the post.
However, simply because a model achieves a one-hour time horizon on the METR plot does not imply it can replace one hour of human labor in the actual world. First, the tasks utilized for model evaluation do not capture the intricacies and nuances of real-world employment. In their initial study, Kwa, Von Arx, and their colleagues evaluate what they term “messiness” of each task using criteria like whether the model is aware of how it is being scored and whether it can easily restart if it makes an error (for messy tasks, the answers to both questions would be no). They found that models significantly underperform on messy tasks, although the general improvement trend holds true for both messy and non-messy tasks.
Moreover, even the most complicated tasks considered by METR do not yield much insight into AI’s potential to undertake the majority of jobs, as the graph is predominantly based on coding-related tasks. “A model may become more skilled at coding, but it won’t magically enhance its performance in other areas,” comments Daniel Kang, an assistant professor of computer science at the University of Illinois Urbana-Champaign. In a subsequent study, Kwa and his colleagues found that time horizons for tasks in different fields also seem to follow exponential trends, but that research was far less formalized.
Despite these shortcomings, many individuals respect the group’s findings. “The METR study stands out as one of the most meticulously designed investigations in the literature for this type of work,” Kang stated. Even Gary Marcus, a former professor at NYU and known critic of LLMs, praised much of the research fueling the plot as “excellent” in a blog post.
Some will undoubtedly persist in interpreting the METR graph as a prediction of our AI-induced demise, but in actuality, it is something far more mundane: a carefully crafted scientific tool that quantifies concrete figures related to people’s intuitive sense of AI advancement. As METR employees will readily concur, the graph is far from a flawless instrument. However, in a new and rapidly evolving area, even flawed tools can possess substantial value.
“This involves many individuals striving to create a metric under considerable constraints. It is fundamentally flawed in various aspects,” Von Arx states. “Yet, I also believe it is one of the finest of its kind.”