A recording of the session is available here (at the 12:06 mark).
Opening statement before the House of Commons Standing Committee on Access to Information, Privacy and Ethics (ETHI)
February 7, 2022 by video conference
Ottawa, Ontario
Statement by Dr. Khaled El Emam
(Check against delivery)
Good day, Mister Chair and members of the Committee.
The purpose of my remarks is to offer an overview of de-identification. As someone who has worked in this area for close to 20 years in both academia and industry, perhaps this is where I can be helpful to the Committee’s study. I cannot comment on the specifics of the approach taken by Telus and PHAC, as I do not have that information. My focus is on the state of the field and practice.
It is important to clarify terminology. Terms like anonymization, de-identification, and aggregation are used interchangeably and don’t mean the same thing. It is more precise to talk about the risk of re-identification. And the objective when sharing datasets for a secondary purpose, as is the case here, is to ensure that the risk of re-identification is very small.
There are strong precedents on the definition of very small risk which come from data releases by Health Canada, from guidance by the Ontario privacy commissioner, and by applications by European regulators and health departments in the US. Therefore, accepting a very small risk is typically not controversial as we rely on these precedents that have worked well in practice.
If we set the standard as zero risk, then all data would be considered identifiable or considered personal information. This would have many negative consequences on health research, public health, drug development, and the data economy in general in Canada. In practice a very small risk threshold is set, and the objective is to transform data to meet that threshold.
There are many kinds of transformations to reduce the risk of re-identification. For example, dates can be generalized. Geographic locations can be reduced in granularity. Noise can be added to data values. We can create synthetic data, which is fake data that retains the patterns and statistical properties in the real data, but for which there is no one-to-one mapping back to the original data. Other approaches that involve cryptographic schemes can also be used to allow secure data analysis.
All that to say that there is a toolbox of privacy enhancing technologies for sharing of individual-level data responsibly. Each has strengths and weaknesses.
Instead of sharing individual-level data, it is also possible to share summary statistics only. If done well this has a very small risk of re-identification. Because the amount of information in summary statistics is significantly reduced, it does not always meet an organization’s needs. But if it does, it can be a good option. This is how we tend to define “aggregate data.”
In practice, for datasets that are not released to the public, additional security, privacy, and contractual controls must be in place. The risk is managed by a combination of data transformations and these controls. There are models to provide assurance that the combination of data transformations and controls has a very small risk of re-identification overall.
There are other best practices for responsible reuse and sharing of data, such as transparency and ethics oversight. Transparency means informing individuals about the purposes for which their data are used, and can involve an opt-out. Ethics means having some form of independent review of the data processing purposes to ensure they are not harmful, surprising, discriminatory, or just creepy.
Especially for sensitive data, another approach is a Whitehat attack on the data. Someone is commissioned to launch a re-identification attack to test the re-identification risk empirically. This can complement the other methods and provide additional assurance.
All this means is that we have good technical and governance models to enable the responsible reuse of datasets, and there are multiple privacy enhancing technologies, mentioned above, available to support data reuse.
Is everyone adopting these practices? No. One challenge is the lack of clear pan-Canadian regulatory guidance or codes of practice for creating non-identifiable information that take into consideration the enormous benefits of using and sharing data and the risks of not doing so. This, and more clarity in law would reduce uncertainty, provide clear direction for what reasonable, acceptable approaches are, and enable organizations to be assessed or audited to demonstrate compliance. While there are some efforts, for example by the Canadian Anonymization Network, it may be some time before they produce results. I have written a white paper with 10 recommendations for regulating non-identifiable data, which the Committee may wish to review.
To conclude, while I have not assessed the measures taken in this situation, I hope my comments assist the Committee’s work.
Thank you and I welcome your questions.