Discover more from The Analytics Engineering Roundup
Ep 44: Fun with Differential Privacy (w/ Ian Coe of Tonic.ai + Abhishek Bhowmick of Samooha)
It's 2023, and privacy is now fun! Ian + Abhishek talk synthetic data, differential privacy at the edge, and more.
Abhishek Bhowmick, Co-Founder of Samooha, was formerly Apple’s Head of ML Privacy and Cryptography, and deployed key technologies like secure multi-party computation (MPC), differential privacy, and federated intelligence.
Ian Coe, as the CEO and Co-Founder of Tonic.ai, has been leading the company toward synthetic data generation to maximize data utility, protect customer privacy, and drive developer efficiency.
Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts and support my work.
In this conversation with Tristan and Julia, they explore concepts like synthetic data generation, differential privacy, and how you can leverage privacy-oriented tactics to do better data work.
Listen & subscribe from:
Key points from Ian and Abhishek in this episode:
Why don't we have a conversation around privacy? Who's thinking about it today? Who's pushing the conversation forward? Do all companies make this their top level priority? Who does it concern today?
I really like what you said, seeing some of the trade offs. That's actually what Tonic was founded on and this idea that basically, in all the founder's prior jobs, what we kept feeling was data could either be secure, and not that useful or insecure and really powerful for the whole enterprise. And we had a ton of experiences like that. You know, I would be onsite at Big Bank not be able to debug something that I was trying to implement because developers back in Palo Alto couldn't see data. So, this was definitely a top priority for us as we were thinking, and brainstorming ideas.
I think today in terms of why people care about privacy, we see it as both privacy and data risk, to a large degree and that broadens the conversation a little bit. And obviously you have things that are classic like regulatory pressures GDPR, hipaa, and PCI.
But you also have data breach risk, which is coming up more and more, especially as people are using data for generative AI, developing software data driven products. As people need to start aggregating data into places where it can be accessed by many people. It just gets more and more complicated and it gets harder and harder to protect privacy.
The classic personas that you would think of are often the first ones to sound the alarm CISOs, compliance folks. But we're seeing more and more DevOps, data infrastructure seeing this as a first class responsibility for themselves.
And I couldn't agree more. Privacy is increasingly becoming more mainstream, as you can see, and privacy is about empowering these end users with more control over their data. More transparency. How is the data being protected, secured while at the same time driving customer value? The two main threats in which these conversations are pushed are one, privacy by product and by design, from the get go. And second is privacy by regulatory compliance, as Ian also alluded to, but they're both geared towards a common goal. More privacy, more security, and more governance around pi Data. Product-oriented offerings that I talked about are mostly championed by consumer companies like Apple and also enterprise clouds like Snowflake, AWS, GCP, Azure.
They all have their own privacy-enhancing frameworks that are rolled out for companies to build privacy into product. And compliance Driven Privacy, on the other hand, is mostly led by regulatory agencies, in Sovereign, both in the EU and the US. We'll talk about some of that in more detail and lead to regulation like GDPR or CTPRA.
Our observation has been that the privacy-enhancing frameworks that are being thrown out there, they're still not being made super easy to incorporate and build into products. Practically speaking, it might appear to be a luxury that only big tech companies can afford. So our vision at Samooha is to be that big easy button to bring these technologies into life, into end user facing products at these enterprises.
Just to add to that a little bit, it really does matter what your industry is to some degree. It matters and it doesn't. It's what we'd see in the long term. For most businesses that reach scale, we see them having to take this on.
What we've seen as we've been in the market is that depending on your vertical, you will adopt at different moments. So if you're a healthcare company, you might adopt a product like Tonic with 10 people because hipaa is so onerous that it really drives your whole data culture from day one.
A financial services company, maybe 50 people, if you're just an average B2B company, like 200 people, that's when you hire a Ciso. Potentially, that's when you maybe do a SOC2. Maybe you sign a contract with a data covenant in it, with a big customer. Then a B2C, more like a thousand.
Obviously, these are broad strokes, I definitely agree with what Abhishek was saying that depending on B2C, B2B all these things drive what your data culture and your privacy and data risk tolerance.
What is the use case that you see that's most exciting that people can accomplish using synthetic data versus real?
I think you bring up a really good point. There are a ton of applications out there. And certainly, regulated industries feel a lot of pressure to do this, but we have hundreds of customers at this point and many of them fall into that B2B or B2C category where they are not financial services in healthcare.
When we started, we actually assumed "Oh, that's gonna be where we grow". And a lot of our seed stage conversations were about that. But, depending on the maturity of businesses is really when that interest peaks. In terms of self-driving cars, obviously, that's a use case that's an example of synthetic data. We actually focus more on relational data. A use case that we're really excited about and is what almost all of our customers are doing is getting data to developers or data scientists so that they can do their work and leverage all the power of production data without actually having to bring in all the sensitive data to a lower environment where they can be productive.
I'll give you an example. EBay is a customer. They've been a customer for a long time, a really great partner. They use us to create staging data and that reflects the real complexities of their highly intricate and intertwined systems. They're a really talented engineering team, and they've built out a lot.
But basically, they have a lot of ontological complexity where the relationships within the data have grown as eBay's application has grown. They've been building for two decades and that has really created a complex data system that's enmeshed in a lot of interesting, intricate ways. To actually synthesize that requires a pretty flexible platform. And the data is across multiple databases, obviously, with lots of tables. So, we create realistic staging environments that are all of different sizes and for different use cases.
All of that is automated and updated, and that's really where the power of a devOps focused on synthetic data platform comes in, and that is very scalable. It's easy to update, we can protect the data, and report on all of that. There we're actually serving about 4,000 developers. We take eight petabytes of data, and we downsample and some things can fit in laptops, some things can't. And at this point, as I understand it, 90% of their automated testing happens on Tonic-generated data.
And it has resulted in an increase in integrations test passing by more than 50%. It's been huge savings for them. A really awesome achievement on the privacy and data security side. But, it's been a nice win-win that we really like to see where it's both a productivity boost and really helping on that security and privacy posture as well.
How are you able to preserve that utility while ensuring that it's anonymous? Are you keeping some of the real user data in the synthetic data but removing the ability to identify people? I'm curious about how synthetic data is created.
What you mentioned is really accurate. We have two products. One is really focused on developers, one is focused on data scientists. For the data science product, there are privacy applications.
There's also applications that we're seeing that folks are using to rebalance data and remove certain biases from data. And, there are rare events when you can synthesize more data and build a better model.
A lot of this is use case specific. But you're a hundred percent right. Synthetic data can help in some of those cases. In terms of the privacy utility trade off, it's hard. Generally, there is a tension there. The most useful data is production. The pragmatic thing that we do is provide a lot of reporting to our customers. It shows comparisons of the original data to new data. It's especially important for data scientists, who are really gonna be looking at that, and developers. The developers have a slightly different bar. Typically, it's like "Hey, the application runs, my test run." "Okay, I'm good." It's a little less important in a lot of cases to do the statistical comparisons. On the theoretical side, we never, as a practice, push production data to a stage environment.
So, you can use our product and do anything you want, you could do that. But, there's nothing that we do that explicitly does that by design. If we're gonna push noise data or something like that, what we'd be looking at is something more like differential privacy, which provides mathematical guarantees around the privacy of data.
That's one of the ways we measure privacy of the data, looking at Epsilon values and things like that. So there are things you can do. But, it is complicated. It's never easy to say "Oh, this data is like this percent secure" or something like that. As a differential, privacy, in our opinion, is one of the better ones in terms of trying to get you to that conclusion.
I know you led all of ML privacy and cryptography at Apple and they were very much leaders in differential privacy. First, what is differential privacy? Can you define that for our audience?
So, differential privacy or DP is a mathematical form of privacy that protects role level user data, while still enabling analysts to learn statistics, analytics to run ML on the population as a whole. And, this is done by adding carefully calibrated noise at various stages of the analysis.
I'm gonna give you a simple example. Say I wanna do a survey and I wanna know what fraction of the users are Democrat. And I picked a sensitive question because that is something my participants in the survey are not gonna be comfortable sharing with me. In a non-private world, what would happen is that everyone would share their response in the clear saying one for "yes" and zero for "not a Democrat".
I collect all the responses, take the average, I'm done. But, in the privacy world that we live in, this is where differential privacy will help and here's how it will work. Every participant, on a piece of paper, writes if they are Democrat leaning. One or zero. Then they pick a random number between minus 10 and plus 10, and add that to their true signal.
Let's say I'm a one and my random number I pick is plus five. I add plus five to one. I generate a response. Six is what I send back to the owner of the survey. From a privacy perspective, you can see how the participant is happy because all they sent was a number six. As an owner of a survey, I cannot tell which way they're leaning right.
From the utility perspective, it is not that much of a problem. The moment I add up everyone's responses across the population by the law of large numbers, the additive noise, which was equally balanced between minus 10 and 10, will eventually cancel out. And then the true response surfaces. And that's precisely what I need. I need an overall percentage.
I don't need to know individually who is leaning towards what. That's, at a foundational level, a very simple explanation of the underpinnings of differential privacy. I'm happy to talk about more details on some use cases that we did back at Apple as well, and also how we are using it in Samooha.
Should we be concerned about generative AI having user data? Should they have synthetic data? What is the latest perspective from two privacy experts on generative AI and privacy?
I feel like we should have had a contest to see who could do the most, hilarious image with the stable diffusion or something. Maybe we can do a podcast. We could submit the podcast audio only. The promise of generative AI is large. It would be a very significant understanding. But, it also presents some really significant risks. And we think that there are some really big risk vectors that enterprises are gonna have to start thinking about. I think one of the things that's interesting is that data is going to be very important. To make progress on future models. Everyone's excited because the models have gotten really good really fast.
Some of that is increased parameter size and complexity, some is increased data. Where we're at right now is that most likely future advances are gonna come from data, and that parameters themselves, just adding more and more, start having really big diminishing marginal returns.
This means that your machine learning teams, when they are trying to compete with other companies or just trying to make progress, are gonna come to the key holders of data and say "We need it." We are happy to get into more details. We actually have a webinar on some of the theory behind all of this that you can find on our YouTube channel. But, the three big vectors for risk that we see are: one, if you're not doing a federated approach, which is, there's a lot of complexity there. Obviously, Apple's talents can do it, but that's gonna be a big lift for a lot of companies. And a lot of companies are gonna be aggregating all that data for training and a central place, in an environment that doesn't have the same security that a production environment would.
That's classic. Moving data to low environments we saw that, with developers, a big opportunity there to help reduce the risk of breaches. But, we expect there to be some increased risk of breach as folks try to get this data altogether.
Another thing that we expect to start happening, and we're already seeing some examples of this, as the models get more complicated, they get better at memorizing data. Meaning that there's risk that your training data gets actually revealed by the model. And, I'll talk about how this has already caused some lawsuits.
Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts and support my work.