Ep 40: What can generative AI do for data people? (w/ Sarah Nagy & Chris Aberger)
Precise truth is critical for business questions in a way that it's not for a consumer search query. How will new types of applications solve for that?
Sarah and Chris are both at the forefront of bringing the promise of gen AI to our actual work as data people—which is a unique challenge!
Sarah Nagy is the CEO of Seek AI, a startup that aims to use natural language processing to change how professionals work with data.
Chris Aberger currently leads Numbers Station, a startup focused on data-intensive workflow automation.
In this conversation with Tristan and Julia, they dive into what this future might actually look like, and tangibly what we can expect from gen AI in the short/medium term.
Listen & subscribe from:
Key points from Sarah and Chris in this episode:
In the intros, we've talked about Generative AI, transformers, large language models, and foundation models. Can you give us a snapshot of where we're at today?
Sarah Nagy:
I would say what's interesting is, a few years ago, I remember this tweet, I don't know if you saw it, but it went viral where someone was like, I feel like most executives at big companies think that AI is just doing a group buy with SQL. We had this period of time where I think no one really knew what AI was, and sometimes people were conflating it even with just big data analysis.
So I think what's interesting about where we've come today is when people say AI, and now they're talking more and more about the Generative AI, which is, I think it's a catch-all for large language models, foundation models, I think even transformer models in a way.
But at least it's real AI in the sense that it's the most state-of-the-art, and people aren't mistaking it for something else, like big data analysis. What it really means is just very large models with billions and billions - now it's broaching trillions - of parameters that can actually oftentimes take natural language.
And when people say generative just colloquially, it's because it's generating stuff - language, images, code - whatever you tell it to generate in the natural language interface.
Chris Aberger:
And just to riff off of what Sarah's saying here, it's really interesting to see how we got to this point in generative AI where we're generating these things like images and text. And if you go back to when I was actually doing my Ph.D. at Stanford, the world was quite different, right?
At that time, most AI researchers were actually looking at selecting across a bunch of different models, and they weren't centered on a single model architecture. So Tristan, you brought up transformer….
Chris Aberger:
No, not so much ensembles. It was more like if I wanted to do something in the image space, I was using computationals.
Yes. And actually, in the language space, the state-of-the-art at the time was sequence-to-sequence models. So these recursive models like RNNs or LSTMs, the models themselves aren't important. But an important thing that happened around 2017-2018 was these transformer architectures that came out, which is really the basis of everything that these foundation models or even generative AI models are built off of.
So these just use a mechanism called attention. And the interesting thing here was it was really efficient to run on GPUs. They took off and took over the landscape and the ML community all collapsed in around 2018 to basically running around this architecture for a lot of different things.
For foundation models, the interesting thing that happened here was this technique called self-supervision. So basically what this enabled was large-scale training across Internet-scale data, where humans didn't have to go in and manually annotate the data that they were training on. It could just train over data sets that you'd feed into it.
And so this enabled this training of really large models on these platforms that learned basically all the content that there is out there today and led to this trend in Generative AI. So it's a really fascinating time and the ML community is all centered around these transformer architectures more or less right now.
And these foundation models are just trained on basically all the data that's out there and don't require any human input in that training process.
Can you take a minute to describe what the terms “transformer” and “foundation” specifically mean?
Chris Aberger:
I can start here. So transformers are that architectural piece. So it uses a mechanism called attention underneath, which basically allows you just to attend to certain parts of the data that you're feeding underneath.
So it's really the mathematical component or backbone of these architectures.
Iit's called Attention is All You Need. It was a really popular, actually provocative paper at the time. I remember when it came out, a lot of the ML community was split: is that true? No. We're still using LSTMs, this might not actually be the case. Turned out to pretty much be true and what we're seeing right now. But yes, that's exactly the paper that kind of ushered this transformer architecture in. Foundation models: this term was actually popularized by a Stanford lab that came out, and actually some of our co-founders were a part of popularizing this term, but basically foundation models are these models, usually transformers are powering what's running underneath.
But they're fed just a large amount of data on a single backbone and are able to run a variety of different tasks. So this can just be text generation, it could be image generation. It can apply to a wide variety of tasks because it's trained on this Internet-scale data that they have being fed into it.
At startups like Numbers Station and Seek, do you still first start with these large foundation models and then fine-tune them, or do you have to start from scratch?
Sarah Nagy:
We use foundation models that were pre-trained. These models know a lot of stuff that may not necessarily be relevant, like recipes and things like that. But that kind of knowledge is actually really helpful when you know both of us are applying these models to data applications within enterprises.
And, let me think of an example for a recipe, for example; say that you are a B2C tech company and recipes or food, say your food delivery, for example. That knowledge of recipes, that broad knowledge of just the food universe, for example, could be relevant.
What we encounter is sometimes our customers have all sorts of different data sets, and there might be one column that has something that's adjacent to food. Maybe the rest of the business and the rest of the data don't contain anything like that. That broad knowledge of the foundation model is what actually helps it perform so well.
And before these models existed, if you were just training from scratch, a small model only on very specific data for a certain domain, it would miss a lot of these fuzzy areas that are outside of the domain.
Chris Aberger:
Yes, I agree with Sarah 100%. And there's not a lot of need to reinvent the wheel.
And think about how expensive it is to pre-train these models from scratch and how much Open AI and other companies have put into this. So it's a great starting point. And we like to start off with these models. But the way I always think of it is it gets you 60 to 70% of the way there, and you usually need to bring in some other techniques, usually fine-tuning to get you to that last 90 to 95% where it's acceptable to actually deploy in the enterprise setting.
There are two, maybe three, solutions out there that can help you with these pre-trained foundation models. Are there more options in that or, if you need a big foundational model, you’re going to one of these three vendors?
Chris Aberger:
We've seen more and more options evolve, it seems each year, right? So I think you're right in terms of the really big popular models right now; Google Open AI, Facebook, the Anthropic model looks really exciting. We're seeing more startups pop up that are doing these model trainings as well, like Adept, Inflection, and AI21 Labs. There's also a community effort if you look at like HuggingFace BigScience, where they're doing the training of their own large language models or foundation models as well as Eleuthera which is another kind of open-source package that's out there for these models, which you can go download.
But to your point here, Julia, there is just a handful right now. The way that I see it is a lot of the community is actually moving towards training these things. The main thing is getting all the data to actually pre-train these models. The model architecture itself is pretty well known and commoditized at this point, and my belief is that these models will become more and more commoditized over time as more and more researchers are going in and pulling in these large swaths of data to train on these things.
But AGI in particular, so like the really ambitious, broad, general problem I still think will be contained to the companies that have the financial resources to embark on these really expensive pre-training endeavors.
What are some of the tasks that you both are focused on at your respective companies that will be a whole lot easier when we introduce Generative AI into data?
Chris Aberger:
So we're working on two main things at a Numbers Station right now. Although the vision and scope that we want to go after long term are much broader across the modern data stack. And really right now, actually interestingly for this podcast, it's all built on top of dbt. So we're always emitting dbt transformations.
It's just how can we accelerate those analytics engineers that are out there right now, as well as eliminate some of this bifurcation that occurs with data science teams inside of these organizations. So the two things that we have right now are these custom text-to-SQL models and techniques that I'll learn over your dbt scripts that you already have inside of your organization, as well as your data models and data that you have inside of your organization to help you write faster dbt code in the future.
And then the second thing is, right now AI is all locked up on the data science teams. And these foundation models make it really easy for anyone to do something like a simple classification or sentiment analysis. And sometimes you might want to do that inside of your data transformations. So what we're doing right now as well as bringing these AI techniques.
We're not claiming that you'd want to do something like make a lending decision inside of dbt, but for something simple, like I have a bunch of texts and there are customer reviews, and I would like to extract something from this or put the sentiment in another column. We're bringing these really small shippable models starting from foundation models that can then be deployed using dbt and then something like Snowflake external functions or Redshift external functions. So the two high-level things are accelerating that writing of dbt code right now, and then bringing AI capabilities and democratizing that to this group of analytics engineers.
Sarah Nagy:
Yeah, and so for Seek, our focus is also on SQL generation. Our focus is on providing a natural language interface to anybody in an organization whether they know SQL or not and allowing them to just ask questions that they'd normally ask the data team and be able to get results more quickly.
And also on the back end, help the data team, just be more productive in overseeing the generation of these answers.
Do you expect that the future will look extremely productive, or will the role of the data team shift more fundamentally?
Sarah Nagy:
Just speaking from my personal experience being a data scientist I always wish I had more time to focus on things that only I could do for the business. There are a lot of problems that I really wanted to work on that I was really excited about because I was like if I can just figure out this tough problem, I can help my business make a lot of money in a significant way, or save a lot of money based on the analysis that I wanted to be doing.
And so our vision is it will actually allow data teams to work on just more meaningful tasks. That's what our tagline is: seek what matters, focus on the things that matter. And that's the thing that used to frustrate me so much was I just wanted the answer. And so for Seek's interface, we're something that anybody can use, including the data team. And if you just want an answer, I want this insight so I can get it quickly and make a decision and then go deeper into my research. That's what we're providing. I don't think people understand how old-fashioned and tedious we're going to look back and be like “Wow, remember those days when I had to spend three hours writing one SQL query so I could make one very small decision in my research?”. So I think this is going to unlock a lot of productivity in that respect for sure.