Using synthetic data as stepping stones in health care innovation
Synthetic data enables researchers and health professionals to work with health data that is not otherwise available. Synthetic data is however still a novel concept, but findings from Denmark’s first synthetic health data hackathon now put a spotlight on the potential.
How to bridge the gap between data privacy and utility
Denmark is home to some of the best and most integrated health data in the world. But in order for it to be used to developing personalised and innovative healthcare solutions, the data has to be available to researchers, health professionals and companies in a format that does not compromise data privacy regulations and a wide range of legal, ethical and organisational constraints.
A new report conducted by Copenhagen’s main university hospital, Rigshospitalet, Deloitte and Digital Hub Denmark summarises the findings from a virtual synthetic data hackathon conducted in November 2020. These findings show how synthetic data can be used as a valuable privacy-preserving technique – and not least as an alternative to real health data.
One of the key findings was that it was valuable to use synthetic data in combination with real data in terms of increasing the data set size and thereby improving the performance of artificial intelligence models.
Lasse Westergaard Folkersen, Genetics Expert at The Danish National Genome Center was one of the mentors in the hackathon.
“Though I have worked a lot with data, I was probably one of those who learned the most about synthetic data. Being able to generate synthetic genomes could break a lot of regulatory barriers for genome centres around the world, so I really see the usability of it and hope to see synthetic genomic data soon”, Lasse says.
Denmark’s first synthetic health data hackathon
The huge unexplored potential in synthetic health data was one of the reasons why Rigshospitalet hosted Denmark’s first synthetic health data hackathon in partnership with Digital Hub Denmark and The Ministry of Industry, Business, and Financial Affairs.
79 researchers and students gathered virtually from all over the world to participate in the Synthetic Health Data Hackathon 2020. In order to create awareness about the new possibilities, the participants had to work on diabetes and Alzheimer’s challenges based on synthetic data sets.
According to Professor Henning Langberg, Chief Innovation Officer at Rigshospitalet and the PI of SHARED an international project on synthetic health data, it is important to raise more awareness about synthetic data’s potential for organisations and governments in Denmark but also on a global scale:
“I was impressed by the level of knowledge and innovation that was present at the hackathon. The findings from the hackathon indicate that synthetic data can supplement real health data and thereby improve AI predictive performance. Providing Danish health data as synthetic data sets has a huge potential and should be showcased to attract more research and industry to Denmark.”
Langberg is not the only one who can see the value of the method. One of the hackathon participants says:
“I definitely see the value in providing synthetic data for privacy preservation, as well as accelerating development and augmenting scarce data sources. I learned a lot about health data and how important domain knowledge is in order to validate the quality of the data”.
Synthetic data is on the rise
Corti.ai, who is an industry leader in healthcare AI, is one of the Danish healthtech companies that use more complex datatypes and has experimented with the use of synthetic data.
"The "cutting-edge" within machine learning research is to define models that understand how the data is generated, contrary to the neural network models that are merely able to discriminate between data-points, e.g., an image being of a dog or a cat. These so-called generative models have enormous potential in generating synthetic data that is very similar to the complexity of the data generated in the real world”, says Lars Maaløe who is the Co-founder and CTO of Corti.ai.
One of Corti’s recent published technologies is BIVA, which is a novel variational autoencoder that was also tested for generating synthetic text.
But even though synthetic data can enrich real data by imputing missing variables or increasing the amount of data, it cannot mimic the real data 100%. That means it becomes essential that you are able to identify if any variable of the synthetic data is less reliable than other variables.
“I can see value in the ability to begin the analysis before the real data is available. However, one should be aware of flaws in the data set: The synthetic data did not have the same attributes as real data – and without a higher educational background in biological, you would not be able to spot this difference. That being said, I have become motivated to become better at working with synthetic data”, says one of the participants from the hackathon.
Ultimately, the hackathon findings showed that synthetic data can be useful as a way to allow people to share and work with health data. However, because the method is fairly unexplored and is still new to many, it is also important to underline the limitations when working with synthetic data sets.
Advancing research and innovation within healthcare is a continuous journey
The findings from the hackathon study are clear: The exploration of synthetic data is just getting started, but we need to continue exploring the method with more use cases and more types of data sets. Especially as the advancement of artificial intelligence in healthcare is evolving and beginning to unlock new potentials, there is a need for better access to data.
“I am excited about the increased AI performance we can obtain when augmenting real data with synthetically generated data. But we need to continue to explore the opportunities within the field of developing pattern recognition with machine learning or even predictive purposes with deep learning. I hope that organisations can be inspired by this hackathon and that we will see many more initiatives like this in the future”, says Camilla Rygaard-Hjalsted, CEO of Digital Hub Denmark.
If you want to learn more about synthetic health data, go to the official website of The Synthetic Health And Research Data (SHARED) Project. SHARED is an active research project exploring opportunities in developing methods for and raising awareness of synthetic data in healthcare.
For more information about the study, you may contact:
Rigshospitalet Henning Langberg, Professor & Chief Innovation Officer, email@example.com
Deloitte Thor Hvidbak, Healthcare Client Relationship Executive, firstname.lastname@example.org
Martin Closter Jespersen, Senior Data Scientist, email@example.com
Digital Hub Denmark Louise del Rosario Jensen, Marketing & Communications Specialist, firstname.lastname@example.org