Prediction modelling in big data: exploring methodological challenges and optimising approaches
Over the past ten years, efforts have been made to establish gold standard methods for the development of clinical prediction models for use in routine clinical practice. Led by the PROGnosis RESearch Strategy (PROGRESS) partnership, guidelines have been developed clearly defining methods for investigating and reporting prognostic risk factors and developing, validating and implementing new prognostic models. These methodological approaches can be easily implemented using in a variety of statistical software packages.
The availability of large datasets of routine electronic health records such as the CPRD and QResearch has led to an increase in the number of researchers wishing to develop clinical prediction models using large scale data. This has benefits in providing a broad population with detailed phenotyping and large statistical power to explore associations. However, these are accompanied by problems which occur when trying to implement complex and statistically intensive procedures such as imputation, fractional polynomials, bootstrapping and modelling with competing risks. When used in conjunction which is usually the case they become computationally infeasible. In the context of big data, it is currently unclear what is the most efficient way of implementing these methods, how much bias we introduce by not using them and which methods should be prioritised over others.
This ARC funded DPhil studentship will be based within the Medical Statistics Group and work with the CPRD and Stratified Treatments Research groups to explore in detail the challenges associated with using big data for the development of clinical prediction models. Specifically, they will seek to answer the following questions:
- According to previous literature, to what extent have statistical methods for prognostic research been optimised for use with large databases of electronic health records?
- What are the benefits of using the largest sample size possible in prognostic research and is it necessary?
- What is the added value of implementing complex statistical methods to develop clinical prediction models in large databases of routine electronic health data? To what extend is model performance affected by not using them?
- Which methods should be prioritised above others in order maximise model performance without compromising the feasibility of whole modelling procedure. Is it possible to rank approaches according to importance when working with large databases?
This work will extend that of the PROGRESS partnership and provide applied health researchers with a framework for implementing gold standard methods for prognostic research in large scale databases of routine electronic health records.