New Benchmarks Envision the Future of AI in Healthcare

Improving healthcare is among the best arguments for AI. From accelerating drug discovery by synthesizing research literature to increasing clinical efficiency through patient record analysis, AI is poised to become an invaluable tool for augmenting healthcare workers. While no panacea for the world’s healthcare needs, recent studies suggest that these systems can lead to better health outcomes and lower costs.

No single evaluation is sufficient for such a high-stakes environment, so in this post we’ll take a look at three new studies from including OpenAI, Google DeepMind, and Microsoft. We’ll take a look at the unique skills each benchmark is designed to measure as well as where their approaches overlap and diverge. Ultimately, these new evaluation standards provide the first credible path to building and trusting the kind of AI that can truly transform global healthcare.

Three Landmark Evaluation Methodologies

This new wave of research moves beyond simple accuracy metrics; such static evaluations are often poorly aligned with the dynamic nature of real clinical situations and fail to reflect the complexity of evidence-based medicine. This shift involves assessing an AI's ability to perform tasks central to clinical practice, from managing a dynamic, multimodal conversation to making strategic, cost-conscious decisions. Each of the studies below takes a different approach to measuring these skills, providing a more holistic assessment of a model's capabilities.

OpenAI: HealthBench: Evaluating Large Language Models Towards Improved Human Health

OpenAI’s HealthBench is an open-source benchmark designed to measure the performance and safety of AI models with a focus on real-world, conversational skills. Its methodology centers on a massive, physician-written rubric used to evaluate 5,000 realistic, multi-turn conversations, providing a uniquely detailed and granular assessment of a model's abilities.

Key Findings

The most advanced models produce higher quality responses than physicians who were working without AI assistance.
Models are improving: GPT-3.5 Turbo's scored only 16%, whereas o3's scored 60%.
Newer small models like GPT-4.1 nano outperform the older, larger GPT-4o model while being 25 times cheaper.

Design

Benchmark built in partnership with 262 physicians from 60 countries.
Evaluates open-ended responses against 48,562 unique, physician-written criteria.
Measures performance along five key axes: Accuracy, Completeness, Communication Quality, Context Awareness, and Instruction Following.
Rigor is validated by the finding that the model-based grading agreement is similar to the agreement between two human physicians.
Includes two specialized versions: HealthBench Consensus, which uses criteria validated by multiple physicians, and HealthBench Hard, where the top score is currently just 32%.

Google DeepMind: Advancing Conversational Diagnostic AI with Multimodal Reasoning

Google DeepMind's latest research on its Articulate Medical Intelligence Explorer (AMIE) system advances conversational AI by adding the critical capability to interpret and reason about multimodal medical data such as images and documents directly within a diagnostic dialogue.

Key Findings

AMIE demonstrated superior diagnostic accuracy compared to primary care physicians.
In a broad evaluation by specialists, AMIE was rated superior on 29 of 32 non-multimodal axes including history-taking, management reasoning, and empathy and on 7 of 9 new axes designed specifically for handling multimodal data.
Patient actors rated AMIE highly on communication and empathy , and were more likely to report being "happy to return in future" for a consultation with AMIE than with a human physician.
The system maintained higher diagnostic accuracy than PCPs when interpreting lower-quality images.

Design

Built on Gemini 2.0 Flash and implements a novel "state-aware dialogue framework" where follow-up questions are strategically directed by the system’s uncertainty to emulate an experienced clinician.
Performance was tested using a randomized, double-blind Objective Structured Clinical Examination (OSCE) methodology.
Study compared AMIE to 19 primary care physicians (PCPs) across 105 multimodal scenarios specifically designed around 35 cases with smartphone photos of skin conditions, 35 with ECG tracings, and 35 with clinical documents.

Microsoft: Sequential Diagnosis with Language Models

Microsoft's research introduces two key innovations: the Sequential Diagnosis Benchmark (SDBench) and the MAI Diagnostic Orchestrator (MAI-DxO). SDBench is an interactive simulation built from 304 diagnostically challenging cases from the New England Journal of Medicine where an agent must iteratively gather evidence under real-world cost constraints. MAI-DxO is the model-agnostic AI system designed to excel at this task by simulating a panel of virtual specialists to guide its reasoning process.

Key Findings

When paired with OpenAI's o3 model, MAI-DxO achieved 80% diagnostic accuracy—a four-fold improvement over the 20% average of generalist physicians on the same difficult cases. An ensemble configuration reached up to 85.5% accuracy.
The human baseline was established with 21 experienced physicians, the best of whom achieved 41% accuracy on the test cases.
The MAI-DxO system also reduced diagnostic costs by 20% compared to human physicians and by 70% compared to using the o3 model off-the-shelf.
Performance gains from the MAI-DxO framework generalize across a wide range of AI models, providing an average accuracy boost of 11 percentage points.

How It Works

The benchmark's core innovation is an interactive simulation where an agent must query a "Gatekeeper" model for information and order diagnostic tests, with each action having an associated monetary cost.
MAI-DxO's architecture simulates a panel of specialists with specific roles:

Dr. Hypothesis maintains a probability-ranked list of the most likely conditions
Dr. Test-Chooser selects the most informative tests
Dr. Challenger acts as a devil's advocate to identify potential bias and contradictory evidence, the cost-conscious
Dr. Stewardship enforces efficient care
Dr. Checklist performs quality control to ensure internal consistency.

Physicians in the study were not permitted to use external resources like search engines, which they would in normal practice.

Each of these methodologies marks a significant step forward, together contributing to a far more complete and rigorous standard for what constitutes clinical competence in AI.

Overlaps and Divergences

These benchmarks reveal both common themes and distinct, complementary areas of focus. Beyond all studies shifting away from simple, single-score metrics toward a more holistic evaluation of an AI's professional skills, they all also share a deep reliance on expert human feedback from specialist physicians to trained patient actors to validate their results.

However, their divergences are what make them uniquely powerful. Each benchmark is designed to measure a different, critical aspect of clinical competence:

HealthBench is uniquely focused on the granular safety and reliability of static information. Its rubric-based system is designed to validate the factual accuracy and completeness of any individual statement an agent might make.
AMIE is uniquely focused on the dynamic skills of an interactive, multimodal dialogue. Its OSCE methodology is designed to test an agent's ability to effectively communicate while also strategically requesting and interpreting clinical artifacts like skin photos and ECGs.
Sequential Diagnosis is uniquely focused on the practical application of strategic action and resource management. Its simulation is designed to test an agent's ability to use tools (like ordering tests) and make efficient, cost-conscious decisions to solve a complex problem.

All three assess accuracy, but they do so at different levels of abstraction: HealthBench at the level of the statement, AMIE at the level of the dialogue, and Sequential Diagnosis at the level of the overall strategy.

Open Questions and the Path Forward

While these papers raise a number of exciting possibilities, they also surface critical questions that must be addressed on the path to real-world deployment. The questions that follow just begin to scratch the surface:

Real-World Efficacy and Integration: How will these agents generalize from the complex cases in the benchmarks to the high volume of common conditions in routine practice? How will they integrate with existing clinical workflows and EHR systems?
Explainability and Interpretability: Beyond being accurate, how can these complex agents explain their reasoning to a clinician? Moving past a "black box" is critical for building the trust required for physicians to act on an AI's recommendations.
Economics and Reimbursement: How will this technology be paid for? Establishing a viable business model and clear pathways for insurance and hospital reimbursement is a fundamental hurdle to widespread, equitable access.
Technical, Legal, and Ethical Frameworks: Major gaps still exist, requiring the creation of frameworks to handle multi-modal data like clinical imaging, as well as clear legal and regulatory standards for the profound ethical questions of data privacy, algorithmic bias, and accountability for harmful errors.
Human Factors and Trust: Successful adoption hinges on solving key human-centric challenges for two distinct groups:

For Patients: This requires designing systems that can earn their trust through reliable, empathetic, and respectful interaction.
For Clinicians: This requires mitigating the dual risks of over-trusting AI recommendations and long-term skill degradation from over-reliance on these powerful tools.

More than Performance Metrics

These studies offer diverse ways of understanding the multifaceted nature of clinical competence in AI, and a powerful glimpse of the future. Their most immediate and crucial value is in de-risking the development process at the very beginning, providing comprehensive and evidence-based methodologies for selecting the best possible foundational model to serve as the starting point for a custom clinical agent. By ensuring this starting point is as safe, conversationally adept, and strategically sound as possible, these new evaluations represent a major step forward in our ability to build the next generation of trustworthy AI for healthcare.

Three Landmark Evaluation Methodologies

OpenAI: HealthBench: Evaluating Large Language Models Towards Improved Human Health

Key Findings

The most advanced models produce higher quality responses than physicians who were working without AI assistance.
Models are improving: GPT-3.5 Turbo's scored only 16%, whereas o3's scored 60%.
Newer small models like GPT-4.1 nano outperform the older, larger GPT-4o model while being 25 times cheaper.

Design

Benchmark built in partnership with 262 physicians from 60 countries.
Evaluates open-ended responses against 48,562 unique, physician-written criteria.
Measures performance along five key axes: Accuracy, Completeness, Communication Quality, Context Awareness, and Instruction Following.
Rigor is validated by the finding that the model-based grading agreement is similar to the agreement between two human physicians.
Includes two specialized versions: HealthBench Consensus, which uses criteria validated by multiple physicians, and HealthBench Hard, where the top score is currently just 32%.

Google DeepMind: Advancing Conversational Diagnostic AI with Multimodal Reasoning

Key Findings

AMIE demonstrated superior diagnostic accuracy compared to primary care physicians.
In a broad evaluation by specialists, AMIE was rated superior on 29 of 32 non-multimodal axes including history-taking, management reasoning, and empathy and on 7 of 9 new axes designed specifically for handling multimodal data.
Patient actors rated AMIE highly on communication and empathy , and were more likely to report being "happy to return in future" for a consultation with AMIE than with a human physician.
The system maintained higher diagnostic accuracy than PCPs when interpreting lower-quality images.

Design

Built on Gemini 2.0 Flash and implements a novel "state-aware dialogue framework" where follow-up questions are strategically directed by the system’s uncertainty to emulate an experienced clinician.
Performance was tested using a randomized, double-blind Objective Structured Clinical Examination (OSCE) methodology.
Study compared AMIE to 19 primary care physicians (PCPs) across 105 multimodal scenarios specifically designed around 35 cases with smartphone photos of skin conditions, 35 with ECG tracings, and 35 with clinical documents.

Microsoft: Sequential Diagnosis with Language Models

Key Findings

When paired with OpenAI's o3 model, MAI-DxO achieved 80% diagnostic accuracy—a four-fold improvement over the 20% average of generalist physicians on the same difficult cases. An ensemble configuration reached up to 85.5% accuracy.
The human baseline was established with 21 experienced physicians, the best of whom achieved 41% accuracy on the test cases.
The MAI-DxO system also reduced diagnostic costs by 20% compared to human physicians and by 70% compared to using the o3 model off-the-shelf.
Performance gains from the MAI-DxO framework generalize across a wide range of AI models, providing an average accuracy boost of 11 percentage points.

How It Works

The benchmark's core innovation is an interactive simulation where an agent must query a "Gatekeeper" model for information and order diagnostic tests, with each action having an associated monetary cost.
MAI-DxO's architecture simulates a panel of specialists with specific roles:

Dr. Hypothesis maintains a probability-ranked list of the most likely conditions
Dr. Test-Chooser selects the most informative tests
Dr. Challenger acts as a devil's advocate to identify potential bias and contradictory evidence, the cost-conscious
Dr. Stewardship enforces efficient care
Dr. Checklist performs quality control to ensure internal consistency.

Physicians in the study were not permitted to use external resources like search engines, which they would in normal practice.

Each of these methodologies marks a significant step forward, together contributing to a far more complete and rigorous standard for what constitutes clinical competence in AI.

Overlaps and Divergences

However, their divergences are what make them uniquely powerful. Each benchmark is designed to measure a different, critical aspect of clinical competence:

HealthBench is uniquely focused on the granular safety and reliability of static information. Its rubric-based system is designed to validate the factual accuracy and completeness of any individual statement an agent might make.
AMIE is uniquely focused on the dynamic skills of an interactive, multimodal dialogue. Its OSCE methodology is designed to test an agent's ability to effectively communicate while also strategically requesting and interpreting clinical artifacts like skin photos and ECGs.
Sequential Diagnosis is uniquely focused on the practical application of strategic action and resource management. Its simulation is designed to test an agent's ability to use tools (like ordering tests) and make efficient, cost-conscious decisions to solve a complex problem.

Open Questions and the Path Forward

Real-World Efficacy and Integration: How will these agents generalize from the complex cases in the benchmarks to the high volume of common conditions in routine practice? How will they integrate with existing clinical workflows and EHR systems?
Explainability and Interpretability: Beyond being accurate, how can these complex agents explain their reasoning to a clinician? Moving past a "black box" is critical for building the trust required for physicians to act on an AI's recommendations.
Economics and Reimbursement: How will this technology be paid for? Establishing a viable business model and clear pathways for insurance and hospital reimbursement is a fundamental hurdle to widespread, equitable access.
Technical, Legal, and Ethical Frameworks: Major gaps still exist, requiring the creation of frameworks to handle multi-modal data like clinical imaging, as well as clear legal and regulatory standards for the profound ethical questions of data privacy, algorithmic bias, and accountability for harmful errors.
Human Factors and Trust: Successful adoption hinges on solving key human-centric challenges for two distinct groups:

For Patients: This requires designing systems that can earn their trust through reliable, empathetic, and respectful interaction.
For Clinicians: This requires mitigating the dual risks of over-trusting AI recommendations and long-term skill degradation from over-reliance on these powerful tools.

New Benchmarks Envision the Future of AI in Healthcare

Three Landmark Evaluation Methodologies

OpenAI: HealthBench: Evaluating Large Language Models Towards Improved Human Health

Design

Google DeepMind: Advancing Conversational Diagnostic AI with Multimodal Reasoning

Key Findings

Design

Microsoft: Sequential Diagnosis with Language Models

Key Findings

How It Works

Overlaps and Divergences

Open Questions and the Path Forward

More than Performance Metrics

The future of your industry starts here

New Benchmarks Envision the Future of AI in Healthcare

Three Landmark Evaluation Methodologies

OpenAI: HealthBench: Evaluating Large Language Models Towards Improved Human Health

Design

Google DeepMind: Advancing Conversational Diagnostic AI with Multimodal Reasoning

Key Findings

Design

Microsoft: Sequential Diagnosis with Language Models

Key Findings

How It Works

Overlaps and Divergences

Open Questions and the Path Forward

More than Performance Metrics

The future of your industry starts here