Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Objective: Abstract screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening. Materials and Methods: LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695). Results: On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%. Discussion: Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records. Conclusion: LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.

Original publication

DOI

10.1093/jamia/ocaf050

Type

Journal article

Journal

Journal of the American Medical Informatics Association

Publication Date

01/05/2025

Volume

32

Pages

893 - 904