Credit: Unsplash/CC0 Public domain
If ChatGPT were removed from the emergency department, it could suggest unnecessary X-rays and antibiotics for some patients and admit others who do not need hospital treatment, according to a new UC San Francisco study.
The researchers said that while the model can be triggered in a way that makes its responses more precise, it still does not match the clinical judgment of a human doctor.
“This is a valuable message for clinicians not to blindly trust these models,” said postdoctoral researcher Chris Williams, MB BChir, lead author of the study, which appeared October 8 in Natural communications. “ChatGPT can answer medical exam questions and help write clinical notes, but it is not currently designed for situations that require multiple considerations, such as situations in an emergency department.”
Recently, Williams showed that ChatGPT, a large language model (LLM) that can be used to research clinical applications of AI, was slightly better than humans at determining which of two emergency patients was in more severe pain, a simple choice between patient A and patient. b.
With the current study, Williams challenged the AI model to accomplish a more complex task: providing the recommendations a doctor makes after initially examining a patient in the emergency room. This includes deciding whether to admit the patient, have X-rays or other tests, or prescribe antibiotics.
AI model is less accurate than a resident
For each of the three decisions, the team compiled a set of 1,000 emergency department visits to analyze from an archive of more than 251,000 visits. The sets had the same ratio of “yes” to “no” answers for admission, radiology, and antibiotic decisions seen in the UCSF Health emergency department.
Using UCSF’s secure generative AI platform, which offers broad privacy protections, researchers entered doctors’ notes on each patient’s symptoms and exam results into ChatGPT-3.5 and ChatGPT-4. Then, they tested each set for accuracy with a series of increasingly detailed prompts.
Overall, AI models tended to recommend services more often than necessary. ChatGPT-4 was 8% less accurate than medical residents and ChatGPT-3.5 was 24% less accurate.
Williams said AI’s tendency to overprescribe could be because the models are trained on the Internet, where legitimate medical advice sites are not designed to answer emergency medical questions but rather to send readers to a doctor who can.
“These models are almost fine-tuned to say ‘seek medical advice,’ which is absolutely fine from a general public safety perspective,” he said. “But exercising caution is not always appropriate in the context of emergencies,” he said. where unnecessary interventions could harm patients, strain resources, and result in higher costs for patients.
He said models like ChatGPT will need better frameworks to evaluate clinical information before they are emergency-ready. The people who design these frameworks will need to strike a balance between making sure the AI doesn’t miss something serious, while also preventing it from triggering unnecessary reviews and expenses.
This means that researchers developing medical applications of AI, as well as the broader clinical community and the general public, must consider where to draw those lines and how much caution to exercise.
“There is no perfect solution,” he said, “but knowing that models like ChatGPT have these trends, we are responsible for thinking about how we want them to work in clinical practice .”
More information:
Chris Williams et al., Natural communications (2024). www.nature.com/articles/s41467-024-52415-1
Provided by University of California, San Francisco
Quote: Study reveals that in emergency care, ChatGPT overprescribes (October 8, 2024) retrieved October 8, 2024 from
This document is subject to copyright. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for informational purposes only.