Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Millions of people are now turning to AI chatbots for answers about their health — but a major new study warns this trust may be misplaced. The largest user study to date examining how large language models (LLMs) support real people making medical decisions finds that these systems can provide inaccurate, inconsistent, and potentially dangerous advice when users seek help with their own symptoms.

Frustrated sad woman feeling tired worried about problem sitting on sofa with laptop

The largest user study of large language models (LLMs) for assisting the general public in medical decisions has found that they present risks to people seeking medical advice due to their tendency to provide inaccurate and inconsistent information.

A new study from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, carried out in partnership with MLCommons and other institutions, reveals a major gap between the promise of large language models (LLMs) and their usefulness for people seeking medical advice. While these models now excel at standardised tests of medical knowledge, they pose risks to real users seeking help with their own medical symptoms.

Key findings:

  • No better than traditional methods

Participants used LLMs to identify investigate health conditions and decide on an appropriate course of action, such as seeing a GP, or going to the hospital, based on information provided in a series of specific medical scenarios developed by doctors. Those using LLMs did not make better decisions than participants who relied on traditional methods like online searches or their own judgment.

  • Communication breakdown

The study revealed a two-way communication breakdown. Participants often didn’t know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations, making it difficult to identify the best course of action.

  • Existing tests fall short

Current evaluation methods for LLMs do not reflect the complexity of interacting with human users. Like clinical trials for new medications, LLM systems should be tested in the real world before being deployed.

'These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,' said Dr Rebecca Payne, GP, lead medical practitioner on the study, Clarendon-Reuben Doctoral Scholar, Nuffield Department of Primary Care Health Sciences, and Clinical Senior Lecturer, Bangor University.  'Despite all the hype, AI just isn't ready to take on the role of the physician.

Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.'

Real users, real challenges

In the study, researchers conducted a randomised trial involving nearly 1,300 online participants who were asked to identify potential health conditions and recommended course of action, based on personal medical scenarios. The detailed scenarios, developed by doctors, ranged from a young man developing a severe headache after a night out with friends for example, to a new mother feeling constantly out of breath and exhausted.

One group used an LLM to assist their decision-making, while a control group used other traditional sources of information. The researchers then evaluated how accurately participants identified the likely medical issues and the most appropriate next step, such as visiting a GP or going to A&E. They also compared these outcomes to the results of standard LLM testing strategies, which do not involve real human users. The contrast was striking; models that performed well on benchmark tests faltered when interacting with people.

They found evidence of three types of challenge:

  • Users often didn’t know what information they should provide to the LLM
  • LLMs provided very different answers based on slight variations in the questions asked
  • LLMs often provided a mix of good and bad information which users struggled to distinguish.

'Designing robust testing for large language models is key to understanding how we can make use of this new technology,' said lead author Andrew Bean, a doctoral researcher at the Oxford Internet Institute.

'In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.'

Download a copy of the study, ‘Clinical knowledge in LLMs does not translate to human interactions’, published by Nature Medicine.  

 

Contact our communications team

Opinions expressed are those of the authors and not of Oxford University. Readers' comments will be moderated - see our guidelines for further information.