Completed PREVENTION AND POPULATION RESEARCH COMMITTEE - PROJECT Europe PMC

Sample size considerations for oncology studies developing and validating clinical prediction models using machine learning for binary outcomes in low dimensional data settings

Funder	Cancer Research UK
Recipient Organization	University of Oxford
Country	United Kingdom
Start Date	Aug 01, 2022
End Date	Jul 31, 2025
Duration	1,095 days
Data Source	Europe PMC
Grant ID	`PRCPJT-Nov21\100021`

Grant Description

Background: Clinical prediction models (CPMs) are used to predict the probability of cancer patients’ current and future health status.

They are used widely in healthcare and have potential to reduce disease burden, save money for the NHS, and improve patient care.

Advancements in healthcare research, such as the availability of ‘big’ data and increasing computational power, has seen a proliferation of CPM models developed using machine-learning (ML) methods. ML methods offer greater modelling flexibility and the ability to model non-linear data and interactions.

Though ML has shown promise in high dimensional settings (e.g., imaging), several systematic reviews have shown ML-based CPMs to be at high risk of bias in low dimensional oncology settings; developed using uninformed, too-small sample sizes, using inefficient data-splitting approaches to develop and internally test the CPM.

Evidence-based guidance is needed for developing ML-based CPMs.

Aims: The research aims of this study are to define and validate sample size criteria for studies developing and validating ML-based CPMs for binary outcomes in low dimensional oncology settings and to develop an interactive graphical user interface with worked examples.

Methods: We will conduct a simulation study to define sample size requirements for five common ML techniques used to develop oncology CPMs: decision trees, random forests, gradient-boosting machines, support vector machines and neural networks. We will simulate data using realistic clinical prediction modelling scenarios.

We will conduct resampling studies on lung, breast, and colon cancer to validate findings from the simulation study, using a real-world clinical cancer cohort, the Surveillance, Epidemiology, and End Results (SEER) Program. For both studies we will vary design factors, such as sample size and outcome rates, in a fully factorial method.

A CPM will be developed for each scenario and its performance assessed using discrimination and calibration performance measures.

We will assess hyperparameter tuning methods using cross-validation using default values, grid search or random search for hyperparameter tuning. We will use findings to develop an R Shiny app to create an interactive graphical user interface.

How the results of this research will be used: We will develop formal sample size guidance, complemented with an R Shiny web app, for researchers developing and validating ML-based CPMs in oncology.

The results will give model developers evidence-based guidance to better design their prediction modelling development and validation studies in oncology. Funders will benefit from better use of their resources and clear guidance for evaluating proposed studies.

All Grantees

No grantees listed

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Sample size considerations for oncology studies developing and validating clinical prediction models using machine learning for binary outcomes in low dimensional data settings

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants