How TripleBlind can Help Produce Better Models
The Power of TripleBlind Technology as demonstrated through the Blind Learning Model
It’s an age-old rule of AI: an algorithm is only as good as the data it is trained on. If the training dataset is small or biased in some way, the model will not be able to produce accurate results when new scenarios are presented. To ensure a model’s accuracy and effectiveness, it is imperative that the model be trained on a large quantity of accurate and unbiased data. However, this is often difficult to accomplish due to data scarcity. Even if an organization is able to acquire a large amount of data, this data is likely skewed in some way (location, ethnicity, gender, etc.) as it only comes from the organization’s own customers and/or users. Therefore, the organization’s models, trained only on its own data, are not universally applicable. This poses a large problem for any organization that wishes to deploy an algorithm for wide scale use. There is, however, a very clear solution: find a way to train the model across multiple datasets, provided by other organizations.
Here, we present an example of how the TripleBlind technology can do just that: solve the problem of data scarcity while maintaining privacy.
Example with Publicly Available Data
To demonstrate the power of TripleBlind technology, we developed a test case with the following characteristics. We used portions of the MNIST* dataset held in two different organizations to train a convolutional neural network (CNN).
After separating 29,000 MNIST images into 2 datasets of 6,000 images and 23,000 images, we handed ownership of each dataset to two different organizations, Organization A and Organization B. We first trained a CNN model using only Organization A’s dataset. The resulting model, trained with only Organization A’s 6,000 images, proved to be 91.60% accurate when tested against a separate set of 1,000 MNIST images.
Then, using TripleBlind’s privacy ensuring technology, we used the same scripts to train a CNN model over the datasets owned by both Organization A and Organization B. Because we used TripleBlind’s platform, we were able to drastically increase the size of the training dataset, resulting in a CNN model trained over 29,000 images. We tested the model in the same way, against the same set of 1,000 MNIST images as before. The result was an accuracy of 96.10%.
By using TripleBlind’s technology, we were able to both increase the size of the training dataset and increase the accuracy of the model. The training dataset increased from 6,000 images to 29,000 images (a 380% increase in size). This larger training dataset resulted in a model with increased accuracy- the new model was 96% accurate, as compared to the 91% accuracy of the model trained on less data.
Throughout the entirety of this process, privacy was ensured. Organization A was NOT able to see the data of Organization B. Likewise, Organization B was NOT able to see the data of Organization A. Neither organization could see the algorithm that was created. The training dataset was doubled, thus bettering the model’s accuracy, without compromising the privacy of either organization’s data.
One should note that this example was conducted with evenly split data- the datasets were not biased in any way. In the real world, Organization A’s 6,000 images would likely be biased in some way. For this example, perhaps 90% of their 6,000 images would come from the numbers 1-5. Similarly, 90% of Organization B’s images might consist of the numbers 6-9. With these biased datasets, the accuracy differences between the first model and the model trained using TripleBlind’s platform would be much more drastic.
*Modified National Institute of Standards and Technology