Frank LaDonna (00:04):
All right guys. We’ll get started. Thank you very much for joining TripleBlind Live our webinar here. My name’s Frank LaDonna. I’m the Sales and Marketing Coordinator for TripleBlind. I’ll introduce some of the panelists that we have today. So we have Chad Lagomarsino, he’s our Partnership Engineer. David Almeida, he’s our Senior Customer Success manager, and Alex Koszycki, sorry I butchered that, he’s our Product Manager. So I’ll let them kind of give a short bio and introduce themselves.
Chad Lagomarsino (00:38):
Hey, everybody. So my name is Chad. I’m an engineer here at TripleBlind. I’m going to be guiding you today through two different demos, just giving a little overview of the product and what we can do here at TripleBlind. We are going to have a Q and A session at the end of each demo part. So I’ll ask that you wait until the end of that session to ask questions. You can just type them into the chat here, there’s a little Q and A button. So if you have any questions as you’re going through, feel free to type it in Q and A and we’ll come back to that as we finish each portion of the demo.
David Almeida (01:19):
Hello, everyone. I’m David. I’m the Customer Success Manager with TripleBlind. I’ve spent the last 10 years in software development in the health and wealth industries. And so I bring that experience to bear supporting our clients through implementing our solution and through ongoing relationship management.
Alex Koszycki (01:38):
Morning, everybody. My name is Alex Koszycki. I’m the Product Manager here. I figure out what we’ll be building next and work with our engineering team to make that vision come to fruition. So I have some experience with business intelligence tools and analytics software companies. I’m bringing that to TripleBlind and helping us make it ever more user-friendly and delightful. But I’ll hand it back over to Chad.
Chad Lagomarsino (02:18):
Thanks, Alex. My personal background is with healthcare analytics, specifically making data models with insurance payers and providers all throughout the US, both on the enterprise level, as well as with some government agencies. So we will be going over two portions of demos today. The first is going to show our user interface from our web application UI. The second is going to dig into our SDK a little bit, so more of our developers’ interface, and we have two interfaces that interconnect with each other because we want to make it as flexible as possible for different types of users to use the platform effectively. Part one is going to focus more on a user, for example, who is still technically sophisticated but maybe they don’t need to go into depth into making a custom query. Maybe they just need to do something more simple such as look at some assets and then be able to run a joined operation, for example.
Chad Lagomarsino (03:22):
So we have some features built into our UI to do that quickly. And the second portion is going to focus a little bit more on getting in-depth with how you can actually run a particular model that you’ve architected and brought to TripleBlind. So this first page here says part one, we’re going to focus on the user interface. So if I’m coming to TripleBlind, I own data, understanding what TripleBlind has access to understanding how you can present your data to other users of the platform without exposing raw data, understanding how to edit what data appears through mock data, and then also managing Access Requests. That’s going to incorporate the first part of this demo then we’ll have a little break for Q and A. Part two –– I’m going to dig into the SDK and I’m going to focus more on model building. So I’m going to work as an analytics firm that does not own a data set, but they have built a model and they want to run training and inference on that model, using data that they do not own.
Chad Lagomarsino (04:37):
So we’ll be going over that in part two. And again, following up with more Q and A to wrap things up. We’re going to be using two different browsers for this demonstration. The data owner in this case is a health insurance payer. That means they have claims data on patients. That claims data contains PII, personal identifying information that we do not want exposed to the general public, but there’s also a lot of useful columns and predictors in there for analytics. That user is going to be on our Chrome browser.
Chad Lagomarsino (05:18):
Our data user is a health analytics firm that is being contracted. They’re working with the health insurance company. So they are coming to TripleBlind with a model and they’re using TripleBlind to train and infer their model on data they do not own. An overview of what’s going on here. You can imagine a situation where perhaps you have a health insurance payer, maybe something connected to hospital systems such as the Mayo Clinic, and your analytics firm is connected to a pharmacy such as Walgreens or CVS. The first section we’re going to be showing Blind Join the second session. We’re going to be showing linear regression.
Frank LaDonna (06:12):
All right, thank you, Chad. And I’ll hand over the screen share to you to kick us off with our demo here.
Chad Lagomarsino (06:26):
Pop into our web UI. And again, please feel free if you have any questions as we’re going through this demo, put them in the Q and A and we will get back to them at the midway point of our demo. The interface that I have in front of you is the TripleBlind UI. This is a web application being accessed by a user of an organization that is behind the firewall of the user of TripleBlind. So in this case, I’m working as organization one, which is my healthcare payer. This is the insurance company, they’re coming to TripleBlind with a set of data. They have on their local machines or database store, excuse me, a database behind their firewall, a series of claims information on patients. They don’t want to publicly expose all of that claims data. They cannot legally share that claims data because of HIPAA, because of other privacy acts like CCPA, for example.
Chad Lagomarsino (07:37):
So they need to keep that on-site. However, with TripleBlind, they are able to access that data for using models. Key thing I want everyone to recognize with TripleBlind is that with TripleBlind you’re not sharing the data. You’re only sharing the model output. So, you can run models on data using TripleBlind without actually exposing that raw data. You only return the output of whatever model or operation you’re running on that data. The first screen we have here is an example of a data asset. This data asset is a data set that is on a local machine or database behind the firewall of the insurance company. And what you’re seeing in front of you is a display of the index for this data set. What’s going on is a user for the insurance company has positioned this data set on TripleBlind.
Chad Lagomarsino (08:44):
That means that our TripleBlind router is able to point to where this data set is sitting on the local directory of a computer behind the firewall of our insurance company and they’re able to generate some basic reports on that data. We’re able to generate mock data as well as an example, EDA (exploratory data analysis) report, which just gives us an idea of the shape of the data. This means that TripleBlind does not access the raw data at any point, it’s just using the raw data as a reference for generating a display view of this data using mock data. So this first tab that you see here, this contains our column names and an example of the shape of the data. One thing to note is that this is fictitious information. This is not the raw data rows itself, but what you are seeing is a representation of how the data, inter-column data is related to each other.
Chad Lagomarsino (09:56):
So one column, the values of one column’s data will attempt to mirror the distribution of that data for that particular column. Now, this tab is really just designed to give you an idea of what is in the data set and what you might want to dig into a little bit deeper. If you’re thinking about how a data scientist typically runs a data science pipeline, the first steps are they’re going to visualize the data. They’re going to get columns. They’re going to understand data types. They’re going to try to understand distributions of the data. We do that here through our data profiling tab. This is going to displace some metadata about the data table itself as well as some characteristics of the variables in our data set. Here if I scroll down a little bit, I’ll be able to see some more specific information about our variables in aggregate.
Chad Lagomarsino (10:58):
You can understand what the count is. I can understand some basic descriptive statistics such as the mean, maximum, minimum. These are the same types of operations that a data scientist would typically do in something like a Python package, like a Scikit-learn. They might run our pandas. They might run pandas.head, pandas.shape, to try and get a better idea of what those categories and data tables look like. If I hit toggle details here, I can go into a more in-depth view of statistics. I can look at an expanded histogram and critically because this is a health insurance claims data set, I might be interested in things such as extreme values to see if I have outliers.
Chad Lagomarsino (11:44):
For example, having a lot of 18, 19-year-olds in a health insurance data set has a pretty good chance of influencing any models that I’m going to use on this dataset. I might skew my dataset quite a bit. On that note, we also have the ability if you scroll down to the bottom here to quickly see a correlation matrix, and that’s going to give you a little bit more insight as you’re considering how you might want to model with the data set as to which variables to include, which ones to omit.
Chad Lagomarsino (12:21):
That’s a quick overview of the EDA report in the UI. You’re able to do these operations through our Python or our SDK as well. So the programming interface allows you to do that, but the UI is also a quick way just to get an insight into how a data set may be useful. Of course, as the owner of this data set, you have the ability to choose what is displayed whether this indexed data set is public or not. If you want to show actual values for certain columns and your organization has given you the administrative rights to do so, you are able to do that. Let me go ahead and show a little bit more about how this mock data is edited. So I’m going to go to the top right hand of the screen here. And because I own this data set, I can go to my edit mock data tab.
Chad Lagomarsino (13:20):
The edit mock data tab is going to display all of the columns and their masking behavior. If you look at the right-hand side, the columns that have this little masking field here, this little bandit-looking mask, little raccoon guy here, that is going to be a column that is being displayed through mock data. If it does not have that, it’s actually showing raw data values. So in this case, we’ve chosen to show about half of our data as mock data and half of the data as raw data values. You could think of the justification for that as “I have a data set that contains PII. I want that to be masked and maybe I have other columns in that data set that my organization is comfortable with sharing legally.” So then, they can reveal those data columns. You’re also able to edit the type of masking that you have. For example, here I have random. So it’s just going to present random numbers or strings as my masking.
Chad Lagomarsino (14:29):
Here under the name column, I’m going to look at a database of names and pull names out for the specific values of the name column. Critically with that style of masking, the same name will show up as the same masked name as well using this type of masking. However, if you do the random masking type, it’ll be entirely random. Now, do you have any questions about mock data? It’s definitely something that we usually spend a good amount of time discussing. So if you have any questions feel free to put them in the Q and A chat, we will get back to those questions in just a moment. As the data owner, I’m also able to go in and manage, meaning I can edit a few characteristics about how this data set is being displayed on the new TripleBlind exchange here. You’re able to see, for example, the name of the data set, the description of the data set, what it’s saved as critically here you can do agreements.
Chad Lagomarsino (15:50):
So every operation in TripleBlind by default, when you kick off an operation, you get a new set of encrypted data. So the data’s encrypted at runtime for every operation and you need to approve each and every operation for every other organization that wants to access data they do not own. So if, for example, analytics firm wants to use this data set, they need to run through a approval request. If I am interested in working with an organization that I’ve talked to beforehand, I can set up an automatic approval based on the organization name, based on the actual organization operation type, or even based on the users for that organization.
Chad Lagomarsino (16:41):
So in this case here, I’ve set up a couple of things that you will be seeing later in this demo, a blind sample with another organization, Globex, that’s going to be our analytics firm which I will be jumping into shortly, but I’ve already set those up. So those particular operations will be automatic. That way you see a little bit of automatic approvals as well as the manual approval process. Speaking of which, let’s say I’m interested in using this data set for something. If I want to augment this data set, I can look for another data set on the TripleBlind exchange that I do not own that contains a common key with this data set. For example, I may be interested in seeing if I have any data sets that match on name. If I look at my explore assets tab, let’s say I’m working with an analytics firm, I know they have information on cigarette purchases. This analytics firm is partnered with the pharmacy.
Chad Lagomarsino (17:50):
I can look for a keyword, for example, cigarettes purchases that is owned by another organization. I can click in on that and I’ll be able to see whatever they’ve chosen to display for this data set. Now here critically, you can see that price, date, and name, these fields are mask. Item is not mask. So they are revealing specific items, so that brand of cigarettes. If I’m interested in running a quick interjoin operation and seeing where name is matched between both data sets, which brand are being used for cigarette purchases, I’m able to do that by going to the create new process tab. Here, we have a list of processes that we have built into the UI. Of course, the SDK is a much more expansive feature set, but these are just quick visualization tools that we’re using. So for a Blind Join here, I’m going to do what’s effectively a SQL query an interjoin. Go ahead and continue. And then I’m going to look for the two different data sets on our index.
Chad Lagomarsino (19:15):
First, I will use the insurance claims data we were looking at just a moment ago. Next, I will use this cigarette purchases data. I will hit continue. I’m going to match on name. We have fuzzy matching enabled. So I’m going to go ahead and set that to a certain percentage. Fuzzy matching is a bit challenging to describe quantitatively. We have a little widget here that basically displays how the degree to which you are fuzzy matching would affect things like casing, using hyphens versus commas, et cetera. I’m going to return the item. I want a list of the cigarette brands that are being used and the quantity being used in my data set. Wait in line on this operation. Now critically here, you’re going to see this process is a waiting permissions with Globex. I do not have permission to run a Blind Join on this dataset automatically. So now I’m going to briefly switch into another browser and that browser is going to display the analytics firms that I’m working with, Globex. Here I’m logged in as my analytics firm, Globex, you can see down here.
Chad Lagomarsino (20:54):
I’m going to go to the Access Requests tab and this is going give me a display of the exact query that the other organization is requesting. So here I can see Blind Join. I’m going to do a join between this data set that they own, and this data set that I own. And it’s going to expose name equals name as a key and it’ll expose item. So if I agree with this, I can hit approve then I’m going to switch back into my insurance payers account, and I will be able to see my completed job here now that I’ve had the approval. Jump into this briefly, you’ll be able to see what it returned. If I hit retrieve we are going to get a local CSV copy of the output of my model. This will return actual data if the user has chosen to make this data row unmasked. If the data row was masked, it will show a mock data representation of the data. You’ll be able to see this is where the two tables intersected. The specific rows that are displayed.
Chad Lagomarsino (22:29):
You could also return information from the table that you own, for example, name, to get an individual record. But in this case, we wanted to keep that private. So we are just returning the overall list of brands used, might be able to use that in something like an analytics model to get a better understanding of which brands, which cigarette brands are being commonly used by our patients. All right. And that wraps up the first portion of the demo. So I really want to break down some of our Q and A questions here, see what we can answer, and then we’ll move into the second portion which is going to be about using our SDK to train a model on data you don’t own.
Chad Lagomarsino (23:18):
Let’s pop in here. So we don’t have any open questions through the Q and A, but if anybody has any questions that they want to address, we can definitely take a look. Let me make sure that there’s nothing here in the chat. All right. So it says, “Trying to understand that the main function of TripleBlind is to allow data analytics from different data sets of different organizations.” Yes. So TripleBlind allows a… Let’s say you come to TripleBlind with a model that you’ve created, a linear regression model, and you have this model you’ve already built. You want to train it and run inference on health insurance information. And you do not own that information you are trying to work with, for example, Mayo Clinic or CVS Pharmacies. And they have information that is behind their firewall.
Chad Lagomarsino (24:23):
They legally can’t share it with you unless they have a long, extensive process of a legal agreement, getting lawyers involved, a three to six-month lead time at least. What this allows you to do is you bring your model to TripleBlind and you can run your training and inference steps on that analytics model without actually needing to go through the process of putting that data somewhere else, of storing it somewhere, of going through the legal process of a access request. This allows you to basically build the plumbing to pipe data into your analytics model. If you have any other questions after that, feel free to type a response into the Q and A.
Alex Koszycki (25:14):
There’s another question that came across about masking. And the question is, “If the owner were to unmask some fields, would a user be able to use the unmasked fields to re-identify that individual record?”
Chad Lagomarsino (25:37):
So there’s a couple of safeguards we have in place for that. One of which is something called K-grouping. Effectively, what that means is that for operations in TripleBlind, you will not be allowed to use a operation in TripleBlind on a column that does not have a certain number of return values, meaning that you cannot isolate an individual value as long as you set your K grouping to higher than one, for example. So every operation in TripleBlind has a minimum number of return values which is designed to help protect the anonymity of the individuals that are being returned as a response.
Alex Koszycki (26:33):
Yes, that’s one of our safeguards that we’re enabling in a lot of operations. Another factor about unmasking a specific field, it will sample randomly from that field. So assuming it’s not a sensitive data set, you won’t be able to match it up line-by-line to what that record was identifying on the individual level. I believe that there’s another question here. It says, “Can you elaborate more on the security side of TripleBlind besides the data are secured from different companies?” So I suppose it’s as the question is asking, “How do we securely enable collaboration on analytics and models?”
Chad Lagomarsino (27:25):
Right. So both data and models, they never move where they were built. The data and the model will always stay behind the firewall of the organization that owns that data or model, meaning that at rest security is being fulfilled because that you will have the same security features as the rest of the data and model security that you have in your organization. The critical thing is in a conventional method of sharing data, you typically create some kind of third-party secure enclave or just a cloud space where you pull data together from multiple organizations which is a hotspot for attempted breaches. It makes it very easy for somebody who’s a dedicated adversary to break into and extract information from all of the organizations involved.
Chad Lagomarsino (28:29):
And regardless of how practical that is, that means that there’s a lot of legal barriers that come into play before you’re able to get any kind of project off the ground with that information. So at rest security is achieved because you’re behind the firewall of the data user, excuse me, the data owner, the model owner. The protocol that we use for actually transferring the model output so our network security is based on HTTPS and then our algorithm security, whenever we’re running kind of an operation, let’s say you have something like a SQL query you’re trying to run on a data set you don’t own, effectively what’s going to happen is the TripleBlind router will receive a request to run a job, in this case, a SQL query, that request is going to be translated using a security protocol called SMPC that is secure multi-party compute.
Chad Lagomarsino (29:37):
Effectively what it does is both parties have a little bit of information, but neither have enough information to reconstruct the data that’s being sent over. So it’s going to use some information from both of the machines involved in an operation and get you your aggregated output. There’s definitely a lot we could talk about with SMPC and I highly recommend learning up about that. We have a lot of information that we could follow up with and send SMPC.
Chad Lagomarsino (30:12):
We have a blog post on SMPC. We also have access to videos and resources that explain how this works in depth. We’d love to have more of a conversation about that, a little deep to go into in a call though. So that is for data and models. They all run through our SMPC encryption process. For models specifically, we also have something called Blind Learning, which effectively means that if you take a layered model, something like a neural network, you’re able to split that model into pieces and run part of the model on the data owners machine and part of the model on the data users machine and that gives you model security so that the, machine, the organization that owns a data set will never be able to see your full model.
Chad Lagomarsino (31:07):
So that protects the IP of an analytics firm running a model on the data. So full lengthy answer, but there’s those three points, right? The at-rest security is based on your organization’s security and how your team chooses to handle data security. The all data and all models that are used in TripleBlind go through our SMPC-based security and model IP is preserved using our Blind Learning technique. Those last two are proprietary to TripleBlind.
Alex Koszycki (31:51):
There’s a bit of a related question there that I wanted to chime in on. So it says, “SMPC has different security models, which is TripleBlind using and can it support various models? Which NPC algorithms are supported, is TripleBlind based on open source algorithms?” So to answer this, this is really the cutting to the proprietary secret sauce kind of question. TripleBlind uses a variety of techniques to enable our functionality. Some are based on open source inventions and others are internal inventions. And to learn more, I would suggest doing a POV or POC, just because it’s a bit more detailed to answer that question, but yeah, there are a variety of techniques both in multi-party compute and sort of on the more federated aspects of doing model computations that we make use of to enable our analytics.
David Almeida (33:05):
Thanks, Alex. I’ve got one more question here that I see. I’ll take this next question around describing our typical site users. And then we’ll move on to the second part of the demo where Chad’s going to take us into more of the developer side, which this feels like it leads very well into. So we really have 10 to 2 classes of users that we are seeing actively engaging with our product. We’re seeing the machine learning engineers who have been tasked by their corporation to produce models that they can go use to derive greater insights into their own data sets. They need larger data sets against which to train, so we’re seeing that. We’re also seeing kind of researchers. So data scientists who are going out, who are trying to understand within larger environments, trends that are occurring, where that data is being owned by other organizations, they don’t have access to it. They would benefit from that larger view across multiple, multiple data sets outside of what they currently own.
Chad Lagomarsino (34:21):
Okay. So now we’re going to move on to the second portion of the demo. This is going to focus on using the SDK, and we are going to work as an analytics firm that is trying to access information from the health insurance payer data set that we were just looking at. So now we are not going to be the owners of that data. We are going to be the users of that data. We are bringing a model to TripleBlind. Let me jump into our SDK. And I’m going to show you a series of operations that mirrors the pipeline of how a data scientist would typically approach a linear regression problem. All right. Let’s see. Here, I have Visual Studio Code. This is my code editor and I have a terminal window that I will be running on the background. Let me get a script running quickly while I’m talking. This is going to be included with the SDK. So the software development kit that comes with every access point at your key into TripleBlind fold [inaudible 00:35:37] container that you can install on your local machine.
Chad Lagomarsino (35:40):
And with that, you’re able to run code through TripleBlind. The first thing I’m going to do, if I’m an analytics firm and I have been charged with coming up with a linear regression model to describe predicted cost values for a patient in a health insurance claims table, I’m going to try and get a better understanding of the table that I’m working with and understand what valuable predictors we might have. Now. Remember, I don’t own this table. I don’t have access to the raw data. So the first thing I’m going to do is I’m going to run a script here to get better insight into mock data. I’m going to create a representation of the data, do a little bit of visualization on. So you can imagine I’ve already gone through the UI. I’ve looked at the EDA report for the data set that I’m interested in. This health insurance payer database. Now I want to dig a little bit deeper, maybe the features on the UI weren’t quite as custom as I wanted. I wanted to integrate this into my modeling pipeline. That’s what the SDK is for.
Chad Lagomarsino (36:55):
The way that TripleBlind works with code, whether it be Python or R, is that we import it as a package into those base languages. So, we effectively work in the code as a wrapper that wraps around existing Python functions. In this case, you’ll see throughout this example that we are going to be creating one layer of abstraction above the modeling pipeline and use that to package information to TripleBlind. Now I’m not going to go on a code-by-code review of the specific scripts here. They are included with our SDK, but I’m going to highlight a few key features. So the first thing I’m doing is I am importing TripleBlind as TB. I am initializing a user session. So this is my analytics firm. And this part of the script is just looking for a run ID that I have locally just to make sure that I’m running the correct example. Here’s the part that I want to highlight.
Chad Lagomarsino (38:04):
You import TripleBlind as TB. So this is our package, TripleBlind. I’m going to make a table asset by finding the insurance forecast asset that we were just looking at in the UI. I’m going to save that locally as table then I’m going to run some methods that are specifically created for TripleBlind. I am going to get mock data and then display a sample of that data which will be here locally on my machine. So this is a representation of the data that’s going to carry over a lot of the characteristics of the distribution of the data itself. I can go ahead in my next script and run my normal Python pipeline using this mock data if I so choose.
Chad Lagomarsino (39:04):
That’s going to enable me to do things like use visualizations in Matplotlibs, work with Seaborn, pandas, et cetera. So effectively what I’m doing in this script now that I’ve created a mock representation of the data, you can see another output of that here in my terminal console is I’m going to run a couple of basic operations for visualization on that mock data. This is going to allow us to do something like understand what variables are categorical in our dataset which will help us with our regression model and let us know, for example, what do we need to encode for, what do we need to bin for in our regression model.
Chad Lagomarsino (39:59):
Here, you can see some standard output for the types of tools that you would normally be using in a Python pipeline. Smoker, region, sex, et cetera, as well as using the standard methods you’d see like a head shape, et cetera, with pandas. Now I highlight that because you could either have gone into the EDA report and get a distribution chart from the actual raw data without accessing the raw data. But being able to look at that using TripleBlind’s UI, or you can come in here, get a locally stored mock representation of the data and use that for your visualizations. TripleBlind gives you the flexibility to incorporate the tool into your pipeline as you need it.
Chad Lagomarsino (40:50):
Let me try another example here. If I go into a query, let’s say I’m interested in doing a summation of all of the males and all the females in my data set. I’m interested in doing a sum of the total charges. So basically creating a custom SQL query and running that query you’re able to run that query on the raw data without exposing the raw data. You just get the output of the SQL query. So let me go ahead and run that script in my terminal here. I’m going to hit clear just to give myself a little more real estate.
Chad Lagomarsino (41:34):
So I’m going to create a job. That job is going to go to the TripleBlind router and then it’s going to look for permission. And in this case, we did not preapprove permission for this type of a query. So we’re actually going to jump into our UI and give permission to the UI, or excuse me from the UI. Before I do that, let me briefly show you what that actually looks like in the code. So I created a query here. I am selecting all columns. I’m selecting SQL transform, and this is the point in which you can set SQL syntax to create a specific query request. So it has the flexibility to run various forms of SQL queries. I’m going to take this query A object that I’ve created and pass that as a parameter inside of a job. So I’m creating TB.createjob, and then creating this job object.
Chad Lagomarsino (42:41):
That is the wrapper for the SQL query. And that is what is getting passed to the TripleBlind router. None of the data is getting passed to the TripleBlind router. This job just informs the router what type of operation is going to take place and which machines to connect together. And we’re just waiting for permission at this point. So I’m going to jump back into the owner of this data set, the insurance payer. I’m going to go to Access Requests. I can see the specific SQL query that’s being asked. And at this point, I can go ahead and hit approve, and then my operation will continue here and you’ll be able to see the summation request.
Chad Lagomarsino (43:32):
So that gives you the flexibility to run specific SQL queries. And remember, you cannot use this to obtain individual rows if the K grouping behavior is not met. So if you have a K grouping set, for example, five, and you would return less than five rows, it will fail. It will not allow you to run that operation, that is to protect the anonymity of the specific rows in any query. So with this information in hand, now we’re going to train a linear regression model and store a little local copy of that linear regression model. Let me go ahead and run this again in the background as we’re talking. So I’m importing TripleBlind as TB.
Chad Lagomarsino (44:31):
This is just local getting the directory in the correct place and looking for local files. I am initializing as my TB user. I’m looking now with this TB asset find for a specific data set that I’m going to be working with. I’m going to perform my pre-processing here, selecting which columns that I want from this database. If you recall when we were doing our visualizations a little bit earlier, I was looking for which variables were categorical, that’s because I’m about to do encoding here, that is necessary for a linear regression model.
Chad Lagomarsino (45:09):
You need to encode categorical variables because it’s expecting continuous output. You recall a linear regression creates a equation which predicts the value of your output. So here, we’re trying to understand, hey, what’s the predicted cost of this particular patient for the health insurance company based on these different predictors, their age, the sex, whether they’re smokers or not. So for the categorical variables, sex, male, female, smoker, yes, no, the region that they’re from, this autoencoding here allows for those categorical variables to be changed to continuous variables for a model. So with all of those model parameters defined, I’m going to, again, wrap my specific model in a job object. This job object is going to get passed to the TripleBlind router.
Chad Lagomarsino (46:14):
I’m not passing the model to the router. I’m not passing data to the model to the router, just the operation type, the specific data, excuse me, the specific parameters for the operation, and which machines we’re going to connect together assuming they have permission. Now, I’m going to be waiting for permission from the owner of this data set. So once more, I will go into my Access Requests. [inaudible 00:46:52]. I’ll see that I am attempting to run an operation on this train data set here. I’m interested in doing linear regression training. I agree with this, I can hit approve. I’m going to get some standard output such as the coefficients of that data training.
Chad Lagomarsino (47:20):
And here’s the critical thing, I’m saving a local copy of that model here on my local machine. So what I just did is I just trained a linear regression model that I can then take outside of TripleBlind and put it anywhere that I want. Meaning, if I decide at this point I don’t want to use TripleBlind anymore. I have local test data, I, or excuse me, local, yes. I have local test data for model inference. I can pull that model out, use it outside of TripleBlind and do all of my inference steps that way. However, if for example, I don’t own any test data. I don’t own the train data, then I will need to do a remote inference. And that’s what this next step is going to show.
Chad Lagomarsino (48:12):
So my last script here, I’m going to go through a very similar process, but instead of working on the training portion of my data set, now I’m going to work on the testing portion of my data set. So you recall with the model, this linear regression model has been fed a bunch of data to create a regression equation. And now it’s going to be fed a new set of data to test that regression equation and give us a sample of what it predicts, the expected cost is going to be for our linear regression model, excuse me, it’s variable cost for each patient using our linear regression model. I’ll run this guy so that we may see how that’s running. So at this point, it’s going to be clear.
Chad Lagomarsino (49:25):
So what happened here is that I am taking a copy of the regression model that has been positioned in the TripleBlind meaning TripleBlind’s router has access to the location of that model and then it’s going call on that model. It’s going to call on data that is owned by another organization. It will combine those two things together and run a linear regression inference. And that allows us to get the predicted cost for each patient in that data set, the model output without revealing any of the raw data of those patients themselves. So this information tells me the first row of the data set.
Chad Lagomarsino (50:17):
For the insurance payer. We’re expecting it’s going to cost them $32,000 this year based on predictors such as their age, their sex, their region, whether they smoke or not, et cetera, but I never have to see any of that information. I’m able to run this model without actually ever exposing any of that raw data, without moving it anywhere, without having any of those added security risks. So I can take this inference results and use it for population, health studies. I can potentially sell it back to the insurance company. There’s a whole multitude of uses for that. So I know we’re coming close to time. So I’m going to go ahead and open the floor. See if we can answer any more of our Q and A questions. Yeah, let me go ahead and just jump back to our screen here and pop open our chats. All right. Let’s see.
Chad Lagomarsino (51:14):
I keep hearing TripleBlind router, is it a special networking router? So the TripleBlind router is hosted in an AWS instance that TripleBlind maintains. It just acts as a mediary between our users of TripleBlind. So everyone on TripleBlind has an access request or excuse me, everyone on TripleBlind has an access point and that access point communicates with our router. Our router just manages the traffic and makes peer-to-peer connections between access points. So organization A and organization B communicate using the router as the go between and it’s just an AWS cloud instance. All right, we have about five minutes left. So any additional questions we can address or anything anyone wants to see again, we can briefly jump into.
David Almeida (52:21):
I think that I see a question that says, “I saw a lot of Python in some SQL in the second part of the demo. Is it possible to use TripleBlind with R?”
Chad Lagomarsino (52:37):
It is. So TripleBlind is able to run in a variety of different coding languages that are common for analytics purposes such as Python, R. We’re also able to import different model formats, such as PMML, Onyx. We have quite a lot of variety in how you’re accessing the SDK in TripleBlind.
Alex Koszycki (53:07):
There’s a question here. “Can we go from mock data creation? What is the difference from other synthetic data platforms?”
Chad Lagomarsino (53:16):
Right. So the mock data is specifically used, there’s two things here. Mock data is specifically used for displaying in our UI and then you have sample data. They’re slightly different. The sample data is what you use if you want to download a local copy of the data and run things such as visualizations or some descriptive statistics locally. In both cases, they’re actually generated using AGAN. So a generational adversarial network, which is a modeling technique basically, that’s going to, in a nutshell, look at the distributions, spread of the data, and attempt to replicate what it sees. Now that is different than a lot of synthetic data companies. They don’t typically use that particular algorithm. We could definitely go in deeper and talk about some of the merits of using synthetic data versus using the raw data. Critically what’s important here is that TripleBlind employs both techniques as needed. So we have the flexibility of being able to use mock data when it’s appropriate and use raw actual data when it’s not. Let’s see here. Next question.
David Almeida (54:47):
I can actually take this one, Chad.
Chad Lagomarsino (54:49):
David Almeida (54:51):
So we’ve got a question here, an anonymously submitted question. Practically speaking, you might want to run data or your model against dataset many times in order to get an accurate result. In the demo, Chad did show asking for permission for each discrete run that goes back to the agreements that can be created between your organizations where you as the data owner working with the model owner, you can agree that the model owner can run against that data set one time, which is what we demonstrated today or many times. And that’s all again done within the user interface within setting the agreements between the organizations specifically for the data set and specifically for the model that’s being used. So that gets into access rights, which we have a very robust feature set around, granting those access rights.
Alex Koszycki (55:52):
Additionally, say if you are working on a project that is a time box to a certain period of time or under a certain agreement, the agreements in our system can support expiration dates or specific usage counts. So you can give access exactly to the limit that you’ve developed your legal agreement between organizations to a high level of specificity.
David Almeida (56:27):
And I can see that we have a business-minded person on the call who figured out how the money works, yes.
Frank LaDonna (56:41):
All right, guys. If there’s no other questions, we’ll start to wrap here. So thank you guys for joining. I will be sending out a follow-up email to everyone here that includes all of our panelist information. As you can see here, some relevant content and links for you guys to interact with, and also a convenient way for you guys to follow up with any one of us to ask any more questions or schedule a kind of one-on-one demo session together. So, and also access to this webinar and the slide deck. So you’ll get the recording and the slide decks, take a look at those as well, but we thank you guys for joining and appreciate your time. Have a good rest of your day.
In our past webinars, we heard professionals ask questions like, “Will you be presenting a demo in this session?”, and, “Can we interact with the TripleBlind product live?”?”
Based on these questions and positive feedback we’ve received, we’re excited to announce “TripleBlind: Live!”
Have a specific and data-centric use case in mind? Our solution can help! Join us for a real-time demo session of our complete and scalable solution for privacy-enhancing computation (PEC). Hear TripleBlind’s expert engineers discuss some of the most common issues behind data
access, data prep, data bias, and how our solution solves these challenges — 100% live! We will also be taking live suggestions and comments on features, functionalities, and capabilities to enhance your experience while leveraging our solution, so don’t stay mute!
- Permanent & Irreversible One-Way Encryption
- Robust Digital Rights Management System
- Collaborative Peer-to-Peer Environment for Data Providers & Data Users
- Chad Lagomarsino – Partnership Engineer, TripleBlind
- Samir Mohan – Partnership Engineer, TripleBlind (tentatively out)
- Alex Koszycki – Product Manager, TripleBlind
- David Almeida – Senior Customer Success Manager, TripleBlind
Date/Time: Wednesday, July 27th, 2022, 11:00am CT / 12:00pm EST
Moderator: Frank LaDonna