Data synthesis is a privacy enhancing technology aiming to produce realistic and timely data when real data is hard to obtain. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning and by the increased demand for fast and inclusive data sharing. However, empirical evidence of their utility has not been fully explored. Investigators assess the utility of synthetic data by calculating a statistical distance between the original and synthesized datasets or more commonly, by measuring the differences in specific models generated from the original and synthetic data. The choice of the measures/models is guided by the application of interest and the provided conclusions apply to that specific context. None offered any guidelines or criteria that synthetic data should satisfy in general when released for public use.
In this webinar, Fida Dankar presents her research work on synthetic data utility from two perspectives:
First, to inform on the best strategies to follow when generating synthetic data, an analysis of the effect of various data generation settings on the utility of the generated data is presented.
Second, in relation to the lack of consensus on synthetic data assessment, Fida presents her work on the classification of existing utility metrics into different categories based on the properties they try to preserve, and how this categorisation was used to construct a new utility measure that combines all dimensions of utility.
Dr. Fida Dankar is a Senior Research Associate at EHIL. She received her PhD in Computer Science from the University of Ottawa in 2008. Prior to joining CHEO RI, she was a Visiting Associate Professor at NYU Abu Dhabi. Fida’s research interests focus on developing technologies for the private and secure analysis of personal data.