AI benchmarks are broken. Hereâ€™s what we need instead.

Mar 31, 2026 - 18:00

For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks.Â

This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines.Â

But thereâ€™s a problem: AI is almost never used in the way it is benchmarked. Although Â researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods, theseÂ innovations resolve only part of the issue. Thatâ€™s because they still evaluate AIâ€™s performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds.Â

While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AIâ€™s capabilities, overlooking systemic risks, and misjudging its economic and social consequences.

To mitigate this, itâ€™s time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment since 2022 in small businesses and health, humanitarian, nonprofit, and higher-education organizations in the UK, the United States, and Asia, as well as within leading AI design ecosystems in London and Silicon Valley. I propose a different approach, which I call HAIC benchmarksâ€”Humanâ€“AI, Context-Specific Evaluation.

What happens when AI failsÂ

For governments and businesses, AI benchmark scores appear more objective than vendor claims. Theyâ€™re a critical part of determining whether an AI model or application is â€œgood enoughâ€ for real-world deployment. Imagine an AI model that achieves impressive technical scores on the most cutting-edge benchmarksâ€”98% accuracy, groundbreaking speed, compelling outputs. On the strength of these results, organizations may decide to adopt the model, committing sizable financial and technical resources to purchasing and integrating it.Â

But then, once itâ€™s adopted, the gap between benchmark and real-world performance quickly becomes visible. For example, take the swathe of FDA-approved AI models that can read medical scans faster and more accurately than an expert radiologist. In the radiology units of hospitals from the heart of California to the outskirts of London, I witnessed staff using highly ranked radiology AI applications. Repeatedly, it took them extra time to interpret AIâ€™s outputs alongside hospital-specific reporting standards and nation-specific regulatory requirements. What appeared as a productivity-enhancing AI tool when tested in a vacuum introduced delays in practice.Â

It soon became clear that the benchmark tests on which medical AI models are assessed do not capture how medical decisions are actually made. Hospitals rely on multidisciplinary teamsâ€”radiologists, oncologists, physicists, nursesâ€”who jointly review patients. Treatment planning rarely hinges on a static decision; it evolves as new information emerges over days or weeks. Decisions often arise through constructive debate and trade-offs between professional standards, patient preferences, and the shared goal of long-term patient well-being. No wonder even highly scored AI models struggle to deliver the promised performance once they encounter the complex, collaborative processes of real clinical care.

The same pattern emerges in my research across other sectors: When embedded within real-world work environments, even AI models that perform brilliantly on standardized tests donâ€™t perform as promised.Â

When high benchmark scores fail to translate into real-world performance, even the most highly scored AI is soon abandoned to what I call the â€œAI graveyard.â€ The costs are significant: Time, effort and money end up being wasted. And over time, repeated experiences like this erode organizational confidence in AI andâ€”in critical settings such as healthâ€”may erode broader public trust in the technology as well.Â

When current benchmarks provide only a partial and potentially misleading signal of an AI modelâ€™s readiness for real-world use, this creates regulatory blind spots: Oversight is shaped by metrics that do not reflect reality. It also leaves organizations and governments to shoulder the risks of testing AI in sensitive real-world settings, often with limited resources and support.Â

How to build better testsÂ

To close the gap between benchmark and real-world performance, we must pay attention to the actual conditions in which AI models will be used. The critical questions: Can AI function as a productive participant within human teams? And can it generate sustained, collective value?Â

Through my research on AI deployment across multiple sectors, I have seen a number of organizations already movingâ€”deliberately and experimentallyâ€”toward the HAIC benchmarks I favor.Â

HAIC benchmarks reframe current benchmarking in four ways:Â

1. Â Â From individual and single-task performance to team and workflow performance (shifting the unit of analysis)

2. Â Â From one-off testing with right/wrong answers to long-term impacts (expanding the time horizon)

3. Â Â From correctness and speed to organizational outcomes, coordination quality, and error detectability (expanding outcome measures)

4. Â Â From isolated outputs to upstream and downstream consequences (system effects)

Across the organizations where this approach has emerged and started to be applied, the first step is shifting the unit of analysis.Â

For example, in one UK hospital system in the period 2021â€“2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospitalâ€™s multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices.Â

This shift is fundamental. It matters a lot in high-stakes contexts where system-level effects matter more than task-level accuracy. It also matters for the economy. It may help recalibrate inflated expectations of sweeping productivity gains that are so far predicated largely on the promise of improving individual task performance.Â

Once that foundation is set, HAIC benchmarking can begin to take on the element of time.Â

Todayâ€™s benchmarks resemble school examsâ€”one-off, standardized tests of accuracy. But real professional competence is assessed differently. Junior doctors and lawyers are evaluated continuously inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a specific context, because competence is relational. If AI systems are meant to operate alongside professionals, their impact should be judged longitudinally, reflecting how performance unfolds over repeated interactions.Â

I saw this aspect of HAIC applied in one of my humanitarian-sector case studies. Over 18 months, an AI system was evaluated within real workflows, with particular attention to how detectable its errors wereâ€”that is, how easily human teams could identify and correct them. This long-term â€œrecord of error detectabilityâ€ meant the organizations involved could design and test context-specific guardrails to promote trust in the system, despite the inevitability of occasional AI mistakes.

A longer time horizon also makes visible the system-level consequences that short-term benchmarks miss. An AI application may outperform a single doctor on a narrow diagnostic task yet fail to improve multidisciplinary decision-making. Worse, it may introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, adding to peopleâ€™sÂ cognitive workloads, or generating downstream inefficiencies that offset any speed or efficiency gains at the point of the AIâ€™s use. These knock-on effectsâ€”often invisible to current benchmarksâ€”are central to understanding real impact.Â

The HAIC approach, admittedly promises to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to evaluate AI in sanitized conditions detached from the world of work will leave us misunderstanding what it truly can and cannot do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not just what a model can do alone, but what it enablesâ€”or underminesâ€”when humans and teams in the real world work with it.

Â Angela Aristidou is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes, and advises about the real-life deployment of artificial-intelligence tools for public good.