Privacy Enhancing Technologies for Data Analytics and ML
If your work includes machine learning or analytics, you’re likely facing serious data challenges. When we speak with C-Suite leaders, compliance officers, data scientists, and even with cloud architects, there are three major themes around data that come up most often:
- Data Access
- Data Prep
- Data Bias.
On top of this, there’s a host of compliance requirements that they need to meet.
But what if emerging Privacy-Enhancing Technologies (PETs) could reshape and catalyze your organization’s data-based innovations? PETs are designed to leverage the most valuable parts of your data — to unlock its full potential — without creating privacy or security risks. But conventional PETs are unfortunately not always enough.
This article will help you understand the major data challenges facing healthcare and finserv professionals in 2022, and familiarize you with the strengths and drawbacks of how each privacy approach addresses these challenges.
The 3 biggest challenges for healthcare and finserv professionals in 2022
Challenge #1: Data Access
Data access, which we translate to data interoperability in healthcare, is the number one bottleneck that is stopping healthcare insurance providers, hospital systems, and other regulatory agencies from being able to provide good analytics.
Typically, if you’re working in something like a hospital system, you’re going to have multiple EHR records. Each wing of a department could be using a different system. Clinicians might be using something like Epic for their EHR records or electronic health records, whereas you might have another wing that’s behavioral health focused, and they might be using an entirely different tool to collect behavioral health questionnaires and information on all of this.
This leads to organizations using a ton of different EHR systems that don’t talk to each other. For instance, the State of California needed a system to handle Medicare, Medicaid, and Tricare claims, and then aggregate all that information in a third-party cloud environment. This type of consolidation takes a lot of extra labor, and causes huge bottlenecks within the data access space for healthcare agencies.
Maybe you’re not a giant conglomerate like the California government, or a major hospital insurance system. But you might have a local business or a local hospital where you have different cloud environments set up, or you might have different EHR systems set up, where data’s coming in from multiple sources.
You could try to create a central place to put all this data, but this is an expensive process, so data movement takes the lion’s share of the budget for any data initiative. But what if you didn’t have to move the data from where it is? More on that later.
Challenge #2 Data Prep
Data prep is a largely manual resource-intensive task, so it requires a lot of skilled labor. You need somebody who has knowledge of data — how to wrangle it, how to “munge” it, and how to scale the dataset effectively.
Working with data can be really challenging because it has all kinds of different shapes and structures. As soon as you start adding external entities, like other businesses, it becomes a really complicated endeavor, and this kills the majority of projects before they even get off the ground.
Challenge #3: Data Bias
Data bias is a huge issue for anyone working in analytics. It’s also an issue where it’s incredibly important to address problems early on. Small problems early on can grow much bigger, and the costs of a biased data set can scale with your business.
This is something you’ll see when you’re working with various data sets that haven’t been vetted properly. If you are constrained to using only the data you have in-house, or if you’re using one of many commonly used example data sets for training your models (and the data scientist hasn’t excluded the problem columns), you’ll end up with inherent biases. These biases could cause problems that require increasingly more substantial fixes down the line.
The 3 major use cases for Privacy Enhancing Technology
Privacy Enhancing Technologies are a relatively new concept, and though many of these tools have been used within the space since the early days, their business applications are especially new. Gartner has identified 3 major use cases for PETs that the market is finding most important right now.
Use case #1: AI model training and sharing models with third parties
If you’re training an AI model, you want to get access to the best possible data out there, but you may not already have that. You may need to look to third parties for that data. You also want your data to be unbiased, because you don’t want the model to become biased. So you need to source from multiple locations.
Those locations may be subject to different data laws, residency restrictions, and competitive pressures. And business reasons may keep you from getting access to that data, so Privacy Enhancing Technologies are addressing this problem.
Use case #2: Usage of public cloud platforms amid data residency restrictions
As we think about pushing more data and more information to the cloud, lots of resources become shared. That means companies are trying to figure out how to create scenarios where they can use sensitive data to its best utility, without adding the risk and liability that they’re going to expose any sensitive information.
So some Privacy Enhancing Technologies (we’ll go through these all shortly) are addressing this problem: how do we store data in encrypted states and actually operate on it without encrypting it, without decrypting it, without moving it? How do we add trust in a kind of “trustless” or “semi-trusted” environment?
Use case #3: Internal, external, and business intelligence activities
This is a very broad term to basically say: sharing, using data, analytics, getting insights from various data sources. This would touch on data access, prep, and bias (all three challenges we mentioned above).
MITRE, which operates 42 federally funded R&D centers (and provides a lot of expertise on data and security), says
“The most valuable insights come from applying highly valuable analytics to shared data across multiple organizations, which increases the risk of exposing private information or algorithms. This three-way bind – balancing the individual needs of privacy, the analyst’s needs of generating insight and the inventor’s needs of protecting analytics – has been hard to balance…”
So how do we better operationalize data within an organization? When do we identify opportunities to bring in third-party data or to leverage our first-party data with third parties for potential partnerships, additional revenue, or other opportunities?
From an analyst perspective, how do we generate insight, and what are the issues around that? This has to be balanced of course with the individual needs of privacy, while also protecting the algorithms. That’s where PET comes into play.
Note: MITRE has independently done an extremely thorough and exhaustive review of TripleBlind’s technology. You can get the public report for that on our website.
The 7 major types of Privacy Enhancing Technology
We have resources about the major forms of PET on our page about Competing Privacy Enhancing Technologies, which includes definitions and links to full explanations of each technology. If you’re unfamiliar with the strengths and drawbacks of each, we suggest reading that.
- Differential privacy
- Federated learning
- Homomorphic encryption
- Secure enclaves
- Secure Multiparty Computation (SMPC)
- Synthetic data
How useful is each type of PET for common use cases?
Each of these PETs offers something a little different, so some of them are better than others at addressing the major use cases Gartner described. In the chart here, you can see how well each technology holds up in a comparison.
The 11 major factors good privacy technology should address
Let’s take a deeper look at how all of these PETs stack up, compared across the 11 most important factors. The chart below measures the extent to which each PET meets key criteria, and below the chart you can get more context about what each criteria refers to.
As you can see in the chart above, TripleBlind has a lot of green checked boxes, but that’s because TripleBlind is designed specifically for these needs — to fill the gaps, and address the red circles that are left by other techniques.
This is why we call ourselves the most complete and scalable solution for privacy enhancing technology, because we aim to take a holistic view when solving these problems.
Let’s take a quick look at each of these factors so you understand how well each PET fits with your organization’s use cases:
- Degree of privacy: some of the considerations here are, “is there a description key, is the data being moved, and how much raw data is seen by the end user?”
- Ability to operate at scale means “how easy is this to scale horizontally into different business problems, but also vertically to really scale within an organization?”
- Types of data: we want to work on more than just tabular data. This is really important. We want to work with image data and genomics and large files and voice data and everything. We want to be able to keep all their data private and compute on it.
- Speed is really important. Data expires really quickly in a lot of spaces, so the faster we can use it and the less burden we add to the process, the better.
- Supporting training new AI and ML models: not every solution will offer this, as it’s pretty unique. And to be able to leverage data from multiple locations to train an AI model, that’s something we really wanted to provide.
- Digital rights is also a very unique aspect of our solution, because we’re able to allow our customers to permission how and why their data is used, and how often.
- Algorithm encryption: increasingly, there is intellectual property wrapped up in some of the models and algorithms that people are developing. We want to actually protect algorithms and usage as well.
- Compliance with laws like GDPR and HIPAA is a baseline requirement for any PET.
- Eliminate masking, synthetic data hashing, and accuracy reduction, basically to preserve the full fidelity of data and eliminate having to make that trade off between utility and privacy. We aim to maximize for both.
- Hardware dependencies really slow down data usage when everything is being virtualized. Why should we design a solution that requires specific hardware?
- Interoperability with third parties. Like we said, we want this to scale both within an organization and externally. How easy is it for your data partners to get up and running?
The TripleBlind Solution
At TripleBlind, we use the terms complete and scalable. What this means is we’re addressing these problems from multiple angles, and we’re providing the best privacy solution for any given scenario. We do that by leveraging some of the novel advancements that TripleBlind cryptographers and engineers have made on top of existing solutions.
Most of the time, companies will move data and run analytics. But this is costly, labor intensive, and is riddled with data access issues. Instead, we one-way encrypt the data and run the AI or analytics using resources within the firewalls of trusted parties only. That way, you don’t have to worry about piping the data anywhere, and you skip all that plumbing and the mess of cobbled systems. And that’s really the core value of what TripleBlind is doing.
In a nutshell, TripleBlind lets you collaborate with data privately behind your firewall, right where it’s generated. One of the advantages of the TripleBlind solution is that you can run studies on this data and train models on this data, without moving it from its source.