Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

OBJECTIVES: The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost. METHODS: The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians. RESULTS: The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports. CONCLUSION: The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.

More information Original publication

DOI

10.1136/bmjhci-2025-101521

Type

Journal article

Publication Date

2025-09-21T00:00:00+00:00

Volume

32