ChatGPT Health was launched in January 2026 as OpenAI’s consumer health tool and has reached millions of users. A study evaluating its reliability in clinical practice has raised concerns. Published recently in Nature Medicine, the study found that the system struggled to distinguish emergency situations from routine clinical cases reliably.
Researchers conducted a structured stress test of triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions, yielding a total of 960 responses.
Although ChatGPT performed reasonably well in moderately severe cases, its accuracy declined in clinically extreme scenarios. Among the clinical extremes, only 48% were correctly classified with recommendations for immediate emergency care. Classic emergencies, such as stroke and anaphylaxis, were appropriately identified. However, in the remaining 52% of emergency cases, the system underestimated the severity of the condition. Patients with diabetic ketoacidosis or impending respiratory failure were advised to seek evaluation within 24-48 hours instead of going directly to the emergency department.
The model also overestimated severity in nonurgent situations. In 65% of cases that physicians considered appropriate for home management, the system instead recommended an in-person medical evaluation.
Researchers have found that the system’s handling of possible suicidal ideation is particularly concerning. Recommendations for contact crisis intervention services, including the Suicide and Crisis Lifeline, were triggered unpredictably and did not consistently reflect the severity of the situation.
Training Limitations
The authors suggested that clinical extremes, including both true emergencies and low- acuity “code white” cases, may be underrepresented in the datasets used to train the AI model. Instead, the system appears to be optimized for more “average” clinical presentations.
However, ChatGPT Health may be flawed for another fundamental reason: it lacks true reasoning ability, much like its predecessor, ChatGPT. The training data may differ for each case. ChatGPT Health draws on scientific articles, educational resources, licensed datasets, databases, and data provided by experts to reinforce learning, while ChatGPT was trained on books and vast amounts of internet content. However, the underlying mechanisms remain the same. Both systems generate responses by identifying statistical patterns in a language rather than through genuine reasoning. In practice, ChatGPT Health generates answers based on the likelihood that certain words and phrases logically “fit well” together.
A recent study published in JAMA Network Open supports this concern. Researchers evaluated 21 generative AI models, including OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok, using 29 clinical cases obtained from widely used medical textbooks. The models were assessed across various stages of the clinical decision-making process, including differential diagnosis, test selection, final diagnosis, and treatment management.
The most significant weaknesses emerged during the early stages of the reasoning process, particularly differential diagnosis, where error rates exceeded 80% across all models. Accuracy improved once the systems were given the full clinical picture, and final diagnoses were sufficiently accurate.
The findings suggest that these models tend to reach diagnostic conclusions too quickly, without adequately addressing uncertainty or before considering a comprehensive differential diagnosis. This differs substantially from the traditional clinical methodology used by clinicians, which relies on stepwise hypothesis testing and systematic exclusion of alternative explanations.
The problem is compounded by what some experts describe as an epistemological illusion: Generative AI can produce text that appears fluent, coherent, and authoritative, even when the underlying reasoning is incomplete or incorrect. The smooth flow and narrative consistency of these responses can create an impression of knowledge without guaranteeing factual accuracy.
Taken together, the evidence suggests that despite rapid advances and the emergence of models optimized for reasoning, generative AI systems, including those designed specifically for healthcare, still fall short of the level of clinical intelligence required for safe implementation. However, their ability to perform advanced clinical reasoning remains limited.
Therefore, what can be done beyond approaching these tools with caution? First, greater transparency is needed. The datasets used to train these systems should be made publicly available to independent experts so that they can assess what information was included, whether biases were addressed or could persist, and how well the data represent real patient populations. In addition, tools such as ChatGPT Health should undergo rigorous scientific validation and comply with medical device regulations before being adopted in clinical settings.
Most importantly, prospective validation is required before AI-based triage systems are deployed on a large scale, a standard that should apply to any technology in medicine. These evaluations should more closely reflect real-world clinical practice rather than relying on multiple-choice style testing, which cannot fully capture the complexity of diagnostic reasoning and clinical decision-making.
Clinicians should avoid relying on these tools whenever possible or restrict their use to the most straightforward clinical scenarios. Even in lower-risk situations, the results generated by these systems should be reviewed carefully and independently verified before being used in clinical decision-making.
Eugenio Santoro is a digital health researcher at the Mario Negri Institute for Pharmacological Research IRCCS in Milan, Italy, where he works in the Laboratory of Medical Informatics and conducts research focused on digital health and digital therapeutics.
https://www.medscape.com/viewarticle/chatgpt-health-ready-emergency-triage-2026a1000ffi
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.