How Synthetic Data is Used in Healthcare
Artificial intelligence and machine learning technologies are revolutionizing healthcare research, particularly in early indication clinical trial reporting, diagnostics remote delivery, and analysis of medical imaging data.
To produce groundbreaking insights, artificial intelligence models require massive amounts of unbiased, statistically significant data. In healthcare, this can mean using patient data and the use of patient data leads to privacy concerns. Regulations like Health Insurance Portability and Accountability Act (HIPAA) prohibit the unauthorized use and disclosure of protected health information, which is any information that could be directly connected to a unique individual.
Information covered under HIPAA includes diagnostic imaging, genetic data, medical histories, Social Security numbers as well as credit card or other financial information. For instance, HIPAA prohibits the release of a cancer diagnosis to an employer without the patient’s consent.
Thus, HIPPA and other data regulations make it difficult to process and utilize patient data, especially across organizational and national boundaries — even though the use of that data could lead to groundbreaking therapies.
One solution to this situation is the use of artificially produced data that is designed to avoid any connections to real-life people, which is termed synthetic data. Even though a synthetic dataset consists of “fake data” — it is built to resemble a real dataset, so that it can be used for artificial intelligence and other applications.
In the healthcare industry, synthetic patient data can allow for sharing among healthcare providers, researchers, and private companies, such as technology companies creating AI technologies for use in the healthcare industry. But although this technology can help facilitate data collaboration in healthcare, it is not without its drawbacks.
Synthetic Healthcare Data is Common in the Medical Industry
One of the most common ways to create synthetic data is through the use of neural networks. Real data is fed into a system of neural networks and eventually they produce a set of synthetic data that very closely resembles the real dataset.
Importantly, the neural network system is designed to produce synthetic data that does not violate data privacy regulations. The system does this by avoiding the passage of any real-life patient data from the training data set to the synthetic data set.
Once the synthetic dataset has been created and determined to be fit for purpose, it can be used to train artificial intelligence models. Researchers can also share this synthetic dataset with significantly less concern about compliance violations.
One example of synthetic data in healthcare is a mobile app called M-sense. This app is designed to help migraine patients track their condition, gain a deeper understanding of it and reduce migraine symptoms. The app collects data from patients, and that data is used to create synthetic clinical data that migraine researchers can then use for their studies.
Clinical synthetic data has also been applied in research involving recently discovered or rare diseases. These diseases have very few patients, making data on these diseases relatively scarce. In these situations, synthetic health data can supplement real data collected by scientists. These researchers can then create control groups for these rare diseases for important clinical trials. This is a similar application to using synthetic data for machine learning, but the results are more focused on specific rare diseases.
Another benefit of synthetic data in healthcare is that it is reproducible. Reproducibility is critical when conducting experiments as part of the typical scientific method. However, reproducing patient data can be difficult or impractical, particularly where patient privacy is involved. In these situations, it is beneficial to be able to produce additional datasets.
Official government agencies have also been using synthetic data. The Office of the National Coordinator for Health Information Technology (ONC) has an open-source project focused on creating superior synthetic data that can facilitate scientific research. The project is focused on producing high-quality synthetic data related to pediatrics, opioid addiction, and other complex healthcare situations.
Problems with Using Synthetic Data
Synthetic data does have limitations when used in the healthcare space.
First and foremost, it isn’t as useful as real data. The quality of clinical synthetic data is highly dependent on the quality of the training data and the data synthesis system. A 2017 study on the quality of synthetic data from MIT involved two groups of data scientists conducting an analysis — a control group using real data and an experimental group using synthetic data. The study team found that the experimental group was only able to match the control group results with 70 percent accuracy, which may not be acceptable in some situations.
Another problem with synthetic clinical data is the potential to omit outliers that would otherwise appear in a real dataset. Neural networks used to generate data are inefficient at producing unusual-but-possible data points. Importantly, outliers can often be more important than typical data points.
While desirable for use cases, the passing of outliers from a “real data” training set to a synthetic dataset could translate to privacy concerns. If the training dataset of patient information holds outliers that are passed through into synthetic data by a neural network system, these distinct data points could potentially be used to identify individual patients.
Additionally, neural network systems that produce synthetic data are vulnerable to cyberattacks and these networks must base their work on real private data. If a hacker can access the data production system, they may be able to reverse engineer private data. While some synthetic data systems use extremely restricted access to prevent this kind of attack, it is impossible to completely prevent.
TripleBlind’s privacy-enhancing solution addresses many of the shortcomings of synthetic data.
- Quality is maintained. Our solution allows for data to be kept in its original form. This means outliers are not lost in translation.
- Better AI/ML modeling and better analysis. In leveraging superior privacy, data partners can alleviate compliance concerns, and this opens up access to even more data than would be otherwise available.
- Avoids unauthorized use. When a data holder uses a third party to generate synthetic data, it must turn over sensitive data to that third party and this opens the door to unauthorized use. With TripleBlind’s privacy solution, data holders never have to turn over their sensitive data.
If your company is currently considering the use of synthetic data, contact us today to find out how our next-generation approach to privacy technology compares.