In a blinded evaluation across 100 three-visit scenarios spanning cardiology, pulmonology, neurology, OBGYN/urology, and GI, each grounded in NICE and BMJ Best Practice guidelines, AMIE's investigation and treatment recommendations were consistently more precise. Specialist graders found AMIE's plans were exact matches significantly more often: 99% vs. 84% for investigations in second visits and 100% vs. 88% in third visits, with explicit citations to guideline sources .
MIRA, a separate AI agent, takes a different approach. Rather than focusing on conversation, it operates within electronic health records themselves. According to a LinkedIn post from the Else Kröner Fresenius Center for Digital Health, a research team led by Jakob Nikolas Kather developed MIRA to autonomously evaluate medical information, order tests, and prepare diagnostic and treatment decisions within a simulated hospital information system using real patient cases .
In retrospective simulations, MIRA achieved an overall diagnostic accuracy of 87.8%, compared with 78.1% for board-certified physicians, with particularly notable improvements in specific diagnoses like pancreatitis (95.2% vs. 78.6%) and appendicitis (100% vs. 88%) . MIRA ordered more blood tests than physicians (51% vs. 28%), but partially offset this by ordering substantially fewer imaging scans
. The system had access to over 85,000 possible clinical actions and was tested against 880 adversarial prompts designed to trick the AI, with no cases showing data leakage
.
Despite these numbers, both systems face the same sobering reality: they have not been tested on real patients in real clinical settings. This gap between simulation and reality is the dominant theme in independent expert reviews of healthcare AI agents.
A comprehensive scoping review in PMC identifies the primary challenge bluntly: "the lack of robust clinical validation in real-world environments." Evaluations frequently rely on idealized, synthetic datasets rather than handling the sparsity, imbalance, and missing values characteristic of real-world patient records. The review concludes that the actual clinical efficacy of these systems remains largely unconfirmed by prospective studies .
Other major limitations include:
Scientists commenting on the two Nature studies acknowledged their significance while emphasizing the distance to clinical use. An expert reaction gathered by the UK Science Media Centre described the work as "a qualitative leap forward" but noted that the results come from controlled settings .
Broader research on AI agents in healthcare identifies several persistent concerns:
Interpretability and trust. When AI agents generate diagnostic or therapeutic recommendations, the underlying reasoning often lacks transparency, making it difficult for clinicians to trace how a conclusion was reached. This undermines trust and limits adoption .
Accountability ambiguity. In the event of an erroneous outcome, it remains unclear who bears legal and ethical responsibility—the AI developer, the deploying institution, or the overseeing clinician. Current regulatory frameworks, including those from the FDA and EU AI Act, do not yet address the unique governance needs of agentic medical AI .
Generalizability. Systems trained and evaluated on data from single medical centers or specific datasets (like MIMIC-IV) may not perform reliably across different demographic populations, administrative protocols, or healthcare systems .
Over-reliance risk. Even researchers developing guardrailed versions of AMIE (called g-AMIE) acknowledge the need for explicit constraints. g-AMIE is designed to perform history taking while abstaining from giving individualized medical advice—an explicit acknowledgment that unrestricted autonomous use is premature and potentially dangerous .
Neither AMIE nor MIRA is ready for routine clinical deployment. AMIE has taken early steps toward real-world validation with a prospective, single-arm feasibility study conducted in partnership with Beth Israel Deaconess Medical Center, which explored how AMIE could gather patient information before ambulatory primary care visits . A separate urgent-care feasibility study with 100 patients evaluated AMIE's pre-visit history-taking and found its differential diagnoses included the final diagnosis in 90% of cases, with 75% top-3 accuracy—though primary care physicians still outperformed AMIE on the practicality and cost-effectiveness of management plans
.
These are promising early signals, but they are feasibility studies—not the randomized controlled trials that would be needed to demonstrate improved patient outcomes. The healthcare AI agent field broadly faces a set of unresolved challenges before deployment becomes realistic:
The June 2026 studies represent genuine progress. AMIE has evolved from single-encounter diagnostic conversations to multi-visit disease management with guideline-grounded reasoning. MIRA has demonstrated that an autonomous agent can navigate electronic health records and order tests and treatments at levels comparable to—or exceeding—physicians in carefully controlled retrospective simulations .
But the gap between these demonstrations and real-world medicine remains wide. As the scoping review on healthcare AI agents concludes, most evaluations rely on datasets and settings that do not reflect the noise, ambiguity, resource constraints, and accountability demands of actual clinical environments .
The next critical milestones will be prospective studies on real patients, clear regulatory frameworks, demonstrated safety across diverse populations, and evidence that these systems improve outcomes rather than just matching clinician performance in test scenarios. Until then, AMIE and MIRA remain impressive research prototypes—not replacements for human physicians, but potential future tools that require substantial further validation before they belong in any clinic.
Comments
0 comments