AI benchmarks are flawed. Here’s what we require instead.

For many years, the assessment of artificial intelligence has revolved around whether machines surpass human abilities. Whether it’s chess, complex mathematics, coding, or essay composition, the effectiveness of AI models and applications is gauged against that of individual humans executing tasks.

This perspective is enticing: Comparing AI and humans on specific problems with definitive right or wrong outcomes is straightforward to standardize, compare, and refine. It yields rankings and generates news headlines.

However, there exists a complication: AI is rarely utilized in the manner it is benchmarked. Even though researchers and industries have begun enhancing benchmarking by transcending static tests in favor of more dynamic evaluation methods, these advancements only address part of the problem. This is because they continue to assess AI’s performance outside the human teams and organizational workflows where its practical efficacy ultimately manifests.

While AI’s evaluation occurs in isolation at the task level, it operates in chaotic, multifaceted environments where it often interacts with multiple individuals. Its effectiveness (or ineffectiveness) only becomes apparent over prolonged usage. This discrepancy leads to misconceptions regarding AI’s capabilities, neglecting systemic risks, and misestimating its economic and social impacts.

To address this, it is essential to transition from narrow methods to benchmarks that evaluate how AI systems perform over extended durations within human teams, workflows, and organizations. Since 2022, I have studied real-world AI implementation in small enterprises and health, humanitarian, nonprofit, and higher education sectors in the UK, the United States, and Asia, along with key AI design ecosystems in London and Silicon Valley. I suggest a different framework, which I term HAIC benchmarks—Human–AI, Context-Specific Evaluation.

What occurs when AI fails

For governments and businesses, AI benchmark results seem more objective than vendor assertions. They are vital in deciding if an AI model or application is “sufficient” for real-world implementation. Visualize an AI model that achieves remarkable technical scores on the most advanced benchmarks—98% accuracy, groundbreaking speed, captivating outputs. Based on these outcomes, organizations may choose to adopt the model, allocating significant financial and technical resources to its acquisition and integration.

However, once it’s adopted, the disparity between benchmark and actual performance becomes apparent quickly. For instance, consider the array of FDA-approved AI models that can interpret medical scans more rapidly and accurately than expert radiologists. In the radiology departments of hospitals from central California to the outskirts of London, I observed staff utilizing top-ranked radiology AI applications. Consistently, they required additional time to interpret AI’s outputs alongside hospital-specific reporting standards and national regulatory requirements. What was perceived as a productivity-boosting AI tool in isolation introduced delays in practice.

It soon became evident that the benchmark assessments used for medical AI models do not reflect how medical decisions are genuinely made. Hospitals depend on multidisciplinary teams—radiologists, oncologists, physicists, nurses—who collaboratively review patients. Treatment planning seldom relies on a fixed decision; it evolves as new information surfaces over days or weeks. Decisions often arise through constructive discussions and compromises among professional standards, patient preferences, and the collective objective of long-term patient welfare. It’s no surprise that even highly rated AI models encounter challenges delivering the expected performance when faced with the intricate, collaborative processes of real clinical care.

The same trend is observable in my research across diverse sectors: When integrated within real-world working environments, even AI models that excel in standardized assessments often do not fulfill their promises.

When elevated benchmark scores fail to translate into real-world effectiveness, even the most lauded AI finds itself relegated to what I refer to as the “AI graveyard.” Consequently, the losses are substantial: Time, effort, and financial resources are squandered. Over time, this repeated phenomenon diminishes organizational trust in AI, and in critical domains like healthcare, it may compromise broader public confidence in the technology as well.

When existing benchmarks provide only a limited and potentially misleading indication of an AI model’s readiness for real-world application, it creates regulatory gaps: Oversight is influenced by metrics that do not accurately reflect reality. Additionally, it places organizations and governments in a position where they must handle the risks of trialing AI in sensitive real-world contexts, often with inadequate resources and support.

How to create improved tests

To bridge the chasm between benchmark and real-world performance, we must pay attention to the actual conditions in which AI models will function. The key inquiries: Can AI act as a productive team member? And can it produce lasting, shared value?

Through my investigations on AI deployment across various sectors, I have observed that several organizations are intentionally and experimentally shifting toward the HAIC benchmarks I advocate.

HAIC benchmarks reshape current benchmarking in four significant ways:

1. Transitioning from individual and single-task performance to team and workflow performance (altering the unit of analysis)

2. Moving from one-off assessments with right/wrong outcomes to long-term impacts (broadening the time horizon)

3. Shifting from correctness and speed to organizational results, quality of coordination, and detectability of errors (expanding outcome measures)

4. Shifting from isolated outputs to upstream and downstream effects (system effects)

In the organizations where this methodology has arisen and begun to be enacted, the initial step is altering the unit of analysis.

For instance, in one UK hospital system between 2021 and 2024, the focus shifted from whether a medical AI application enhances diagnostic accuracy to how the integration of AI within the hospital’s multidisciplinary teams influences not only accuracy but also coordination and deliberation. The hospital specifically evaluated coordination and deliberation among human teams utilizing and not utilizing AI. Multiple stakeholders (inside and outside the hospital) determined metrics such as how AI affects collective reasoning, whether it brings to light overlooked considerations, if it improves or hinders coordination, and if it alters established risk and compliance practices.

This transformation is essential. It is critically important in high-stakes situations where system-level effects outweigh task-level accuracy. It also holds economic implications. It may help recalibrate inflated anticipations of sweeping productivity increases that have so far been predominantly based on the promise of enhancing individual task performance.

Once this groundwork is established, HAIC benchmarking can commence to incorporate the time aspect.

Contemporary benchmarks resemble school examinations—singular, standardized assessments of accuracy. However, true professional aptitude is evaluated differently. Junior doctors and lawyers are assessed continuously within real workflows, under supervision, with feedback mechanisms and accountability measures. Performance is evaluated over time and within a specific context, because competence is relational. If AI systems are designed to collaborate alongside professionals, their effects should be appraised longitudinally, reflecting how performance unfolds across repeated interactions.

I observed this facet of HAIC applied in one of my humanitarian-sector case studies. Over 18 months, an AI system was appraised within actual workflows, with specific emphasis on how identifiable its errors were—that is, how easily human teams could recognize and rectify them. This long-term “record of error detectability” enabled the participating organizations to formulate and test context-specific safeguards to foster trust in the system, despite the inevitability of sporadic AI errors.

A broader time perspective also reveals the system-level repercussions that short-term benchmarks overlook. An AI tool may surpass a single physician on a limited diagnostic task yet fail to enhance collaborative decision-making. Moreover, it might introduce systemic distortions: prematurely anchoring teams in plausible but incomplete answers, escalating individuals’ cognitive workloads, or causing downstream inefficiencies that nullify any gains in speed or efficiency at the point of AI’s application. These ripple effects—often invisible to current benchmarks—are crucial for grasping actual impact.

The HAIC method, it must be noted, promises to complicate benchmarking, requiring more resources and becoming harder to standardize. Nevertheless, continuing to assess AI in sanitized settings, detached from the workplace environment, will lead to misconceptions about what it genuinely can and cannot achieve for us. To implement AI responsibly in real-world contexts, we must measure what truly matters: not merely what a model can accomplish independently, but what it enables—or undermines—when integrated into human interactions and teamwork in real-world scenarios.

Angela Aristidou is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes, and advises about the real-life deployment of artificial-intelligence tools for public good.

What occurs when AI fails

How to create improved tests

Our Company

About Links

Useful Links

Newsletter

Latest Posts

AI benchmarks are flawed. Here’s what we require instead.

What occurs when AI fails

How to create improved tests

Autonomous vessel startup Saronic secures $1.75 billion in competition to upgrade U.S. armed forces.

The Download: AI wellness instruments and the Pentagon’s Anthropic culture conflict

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Latest Posts