On May 4, 2026, Indian Express reported on a Harvard Medical School and Beth Israel Deaconess study showing OpenAI’s o1 and GPT‑4o models correctly identified exact or near‑exact diagnoses in 67% of emergency‑room cases at initial triage, versus 50–55% for attending physicians. With more patient data, o1’s accuracy rose to 82%, matching or beating doctors across later diagnostic stages.
This article aggregates reporting from 1 news source. The TL;DR is AI-generated from original reporting. Race to AGI's analysis provides editorial context on implications for AGI development.
This study is a milestone because it evaluates frontier reasoning models in the messiest possible setting: a live emergency department with incomplete, noisy data. Unlike benchmark leaderboards or synthetic vignettes, these cases force models like OpenAI’s o1 and GPT‑4o to reason under real‑world uncertainty, with minimal prompt engineering and no hindsight. Outperforming ER physicians at initial triage doesn’t mean the systems are ready to practice medicine independently – but it does show they’ve crossed a threshold from ‘good at exams’ to ‘plausible second opinion at the bedside.’([indianexpress.com](https://indianexpress.com/article/technology/artificial-intelligence/harvard-study-ai-doctors-emergency-room-trial-findings-10671938/))
For the AGI race, this matters on two fronts. First, it validates that deeper reasoning architectures can generalize from textbooks and guidelines to complex, time‑pressured environments, a core capability we’d expect on the road to AGI. Second, healthcare is one of the few verticals with both rich data and strong regulatory guardrails; if o1‑class models can be made safe enough for triage support there, similar architectures will quickly migrate into other high‑stakes decision pipelines – finance, security operations, even parts of governance.



