Use of artificial intelligence (AI) platforms to answer typical questions about menopause and hormone therapy revealed low accuracy rates across four different large language models (LLMs), according to research presented at The Menopause Society 2025 Annual Meeting in Orlando, Florida.
Though the well-known ChatGPT 3.5 platform had the highest accuracy for typical patient questions, it still answered barely more than half of them correctly and only answered a third of typical clinician questions correctly. Google’s Gemini platform performed even worse for questions that both patients and clinicians might ask.
“Generative artificial intelligence has rapidly advanced and is now explored in healthcare as a resource for both patient and clinician education,” Jana Karam, MD, postdoctoral research fellow at the Mayo Clinic, Jacksonville, Florida, told attendees. “As large language models are increasingly used to answer medical queries, evaluating their performance in providing accurate and reliable information is essential.”
Mindy Goldman, MD, the chief clinical officer at Midi Health and clinical professor emeritus at the University of California, San Francisco, was not involved with the study, but she told Medscape Medical News she was not surprised by the findings.
“Although most everyone in medicine now uses AI in some contexts, my understanding has been that one cannot always be sure of the accuracy of responses, and clinicians should always check the references,” Goldman said. “Even when using OpenEvidence, my usual way of assessing responses is to take the references and do a PubMed search for similar articles to confirm any findings.”
Goldman even conducted her own test by asking an LLM about its accuracy and received a response saying that “generative AI’s accuracy is highly inconsistent and varies drastically by domain, task complexity, and the specific model used.” The response she received went on to mention that AI can generate “plausible but potentially inaccurate information” as well as hallucination, which refers to the wholesale creation of false, or “imaginary,” information.
For providers, “this study highlights the need to check the references and additional sources, such as The Menopause Society and ACOG [American College of Obstetricians and Gynecologists], as well as doing their own PubMed searches before assuming something is true,” Goldman said.
It is important for providers to educate their patients about not assuming that whatever answers they are getting from AI are 100% correct, she said. They should also look to information provided by organizations like The Menopause Society and ACOG. “Just don’t accept a simple response from an AI tool!”
Karam and her colleagues input 35 questions — 20 that patients might enter and 15 that clinicians would — into four different AI systems: the free ChatGPT 3.5, the paid ChatGPT 4.0, Gemini from Google, and OpenEvidence. For the test with OpenEvidence, the researchers only entered the clinician-level questions.
Then four expert reviewers, who did not know which LLM was used for each set of questions, evaluated the answers provided by each AI platform and compared how well those responses aligned with clinical guidelines. Responses considered fully correct received 2 points, those that were incomplete or missing key information received 1 point, and those that were incorrect received 0 points.
The researchers also assessed the readability of the patient-level responses using both word count and the Flesch Reading Ease score, which gives a rating from 0 to 100, with lower ratings indicating more complex responses that are less accessible for patients.
For patient-level questions, ChatGPT 3.5 showed the highest accuracy but still had only 55% of responses judged as correct. The paid ChatGPT 4.0, meanwhile, had only 40% correct answers, and Gemini performed most poorly, with less than a third of the answers (30%) deemed accurate.
Although the responses from all three LLMs had similar word counts (P = .12), readability differed significantly across the platforms. Though it was least accurate, Gemini scored best in readability, with a score of 38.9. ChatGPT 4.0, meanwhile, had the most complex responses, with a score of 26.5.
Gemini was even less accurate for clinician-level questions, answering only 20% of them correctly. And while ChatGPT 3.5 did best for patient-level questions, it only answered a third (33%) of clinician-level questions accurately. ChatGPT 4.0 did only slightly better, with 40% of the responses judged as accurate, the same rate as OpenEvidence. More than half the answers provided by OpenEvidence (53%) were incorrect, and both ChatGPT 3.5 and Gemini gave 40% incorrect answers. ChatGPT 4.0 was only slightly better, with a third (33%) of its responses being inaccurate.
The responses that were incomplete ranged from 7% to 40% across different platforms for both the patient-level and clinician-level questions. For example, Gemini did not have any incomplete responses for patient questions — all of them were either correct or incorrect — but it had just as many incomplete answers (40%) as correct ones for the clinician questions.
“Because this is all still relatively new, the models themselves are still learning,” Goldman said. “I would expect if this study were repeated in the near future, the findings might differ somewhat.”
Goldman also noted that the study is fairly small and that it makes the assumption that the expert reviewers would all provide answers similar to one another in their own responses to the questions used in the testing.
The research did not use external funding, and Karam reported having no disclosures. Goldman reported having no disclosures beyond her employment at Midi Health.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.