Protecting Health Information. Ensuring Data Integrity.
Protecting Health Information. Ensuring Data Integrity.

Webinar: Sample Size Calculation for Training Ensemble ML Models on Health Data

Date: April 22, 2026
Time: 1:00 pm - 2:00 pm EDT
Registration: Eventbrite

Health research studies often suffer from small sample sizes, and training of machine learning (ML) models needs sufficiently large datasets. However, there is a substantial dearth of literature on determining the adequate sample size for using machine learning (ML) models.

Dr. Nicholas Mitsakakis and Dr. Dan Liu will introduce a novel sample size calculator for ensemble ML models: Random forests and two gradient boosted models. The proposed calculator was built upon the certainty curve, analogous to statistical power, but designed specifically for prognostic ML models. With a few data characteristics easily computed from the given dataset a priori, it can predict the minimum sample size required to achieve a pre-defined level of prognostic performance with a certain probability. Our estimation method was trained extensively on 13 large-scale real health datasets, covering a wide range of heterogeneous domains. The accuracy of our calculator has been shown to be significantly better, achieving much smaller prediction errors, when it was compared against three common heuristics and a popular statistical method. This work provides a novel innovation of an estimation method to determine the minimum sample size required from ensemble ML models.

SPEAKER BIOS:

Dr. Nicholas Mitsakakis is a Senior Biostatistician and Associate Scientist at the Clinical Research Unit, CHEO Research Institute. He is also an Associate Professor in Biostatistics at the Dalla Lana School of Public Health, University of Toronto. Dr. Mitsakakis holds a PhD in Biostatistics from the University of Toronto and Master's degrees in Artificial Intelligence (University of Edinburgh, UK) and Mathematics (University of Athens, Greece). He is an accredited professional statistician by the Statistical Society of Canada. His interests and expertise include applied machine learning, clinical biostatistics, health data science, and statistical methods for health economics and health related quality of life. 

Dr. Dan Liu is a Postdoctoral Fellow at EHIL, where she is dedicated to solving practical problems in clinical research using advanced AI models, with a primary focus on tabular synthetic data generation. Her research centers on the use of generative AI and machine learning models to enhance data privacy and sharing, clinical prediction models and drive innovative clinical trials and broader health research. Prior to her postdoctoral work, Dan contributed to methodology development in precision medicine and earned her Ph.D. in Statistics from Western University.

To register free of charge, visit our Eventbrite page.