Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Electronic health records (EHR) hold great potential for improving the understanding of cancer care by containing high-resolution real-world data for large numbers of patients. This dissertation explores the application of data science and machine learning (ML) methods to EHRs for the purposes of translational colorectal cancer (CRC) research. I first explore the challenges in using EHRs throughout the data life cycle. I present a lightweight information extraction pipeline that retrieves TNM staging scores---common descriptors of cancer severity---from free text clinical reports with high sensitivity and precision, and also retrieves information about the presence and recurrence of CRC. These data items are essential to CRC research, for identifying cases, studying treatment variation, and comparing treatment outcomes. The pipeline was developed using data from Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), and supported the establishment of the National Institute for Health Research (NIHR) Health Informatics Collaborative (HIC) CRC database. I then focus on a specific application: combining the faecal immunochemical test (FIT) results with routinely collected data to predict CRC in symptomatic patients. The current practice is to refer patients with FIT above 10 μg/g for invasive endoscopic investigations, but only one in six investigated have CRC, motivating prediction model development. I demonstrate that an externally-derived model does not outperform FIT in the Oxford University Hospitals FIT dataset (OUH-FIT), and highlight the importance of clinically-relevant performance measures. I then show that employing more predictors, a spectrum of ML models, and novel training methods, was not sufficient to outperform FIT on OUH-FIT data. Finally, I build on and incorporate an existing sequence analysis method into an interactive app that allows to explore and cluster thousands of medical event sequences, such as visualising treatment patterns of CRC patients. The principal contributions are: a holistic discussion of EHR data quality; a staging extraction algorithm that facilitates further research/audits; a comprehensive pipeline for developing/evaluating FIT-based CRC prediction models; and a fast medical sequence exploration app that can help check data quality and identify treatment variations. There is considerable potential to use these tools on larger datasets to understand if FIT-based models are bound to fail (or if they may work on subgroups with more severe disease); and to contrast different treatment patterns employed for subgroups of CRC patients with complex disease, such as those with liver metastases.

Type

Thesis / Dissertation

Publication Date

14/04/2025

Keywords

TNM staging, faecal immunochemical test, natural language processing, clustering, data quality, colorectal cancer, electronic patient records, classification, machine learning