Ep 57: AI's impact in the world of structured data analytics

Juan Sequeda joins to discuss his research on Semantic Knowledge and LLMs

Mar 10, 2024

We’ve covered Juan Sequeda’s paper on Semantic Knowledge and Large Lanugage Models and our own evidence that knowledge graphs and the dbt Semantic Layer provide a substantial boost in accuracy in our ability to correctly answer natural language questions about enterprise data, which then prompted Jason Ganz’s post on whether we should we even care about using LLMs to query enterprise data. It’s all garnered great conversation, so we invited Juan to chat more about where we’re headed.

Juan Sequeda is a principal data scientist and head of the AI Lab at data.world, and is also the co-host of the fantastic data podcast Catalog and Cocktails. Juan's expertise and passion go particularly deep at the intersection of two fields, semantic models and AI.

This episode tackles semantics, semantic web, Juan’s research in how raw text-to-SQL performs versus text-to-semantic layer, and where Juan and Tristan both believe AI will make an impact in the world of structured data analytics.

Listen & subscribe from:

Spotify
Apple Podcasts
Google Podcasts
Stitcher
TuneIn
RSS feed
Key takeaways from this episode.

Tristan Handy: When we're recording this, I am doing our company kickoff next week. And one of the things that I'm doing while I'm on stage is trying to paint a picture for what the life of data practitioners will look like in 2030.

I have this core assumption that there'll be a metadata platform that knows all the things and can supply that context in real time to the places it needs to be supplied.

Juan Sequeda: I think we're definitely on the same page. I just think about it like human nature is to always inventory things. We want to be able to categorize things. We want to be able to connect things. That's just ‌human nature. We want to do that. And then why do we do this? Because we want to organize stuff. I want to have an organized closet. I want to have organized stuff. Now, do I achieve that is another thing, but I want to organize things. I want to be able to go find things. I want to be able to go discover things.

I call it a shift from a data-first world to a knowledge-first world. So a data-first world is where we live ‌today. You're telling me you can't solve that problem because you lack data? So if I give you more data, you're going to solve the problem? That's the world that we live in, right? So then a knowledge-first world is one where people are first-class citizens, connections are first-class citizens.

So I start getting more context around that stuff. And why do I care? Why should we talk about data quality? Why does that need to be correct? Oh, so there's business value that’s associated with having the right quality, knowing the semantics. That's the world that we should go to. And we need to be able to start treating that, what I call that knowledge work. And this is really knowledge engineering work that people have been doing. Knowledge engineering was a big thing in the 80s and 90s.

I've been pushing and talking about that we need a resurgence of that knowledge engineering work, call it knowledge engineering 2.0, to with using all the technologies that we have today. We’re already doing that knowledge work. We just don't either realize it or it annoys us because we think we don't want to go do that. Documentation, in a way, is a type of knowledge work, but that I think we need to have that paradigm shift of saying, “Knowledge work is something that we need to be treat as a first-class citizen. And it's the reason we're not able to accomplish, fulfill all the promises that we've been trying to go do. And it's the reason why the problems that we're trying to go solve today are the same problems we were trying to go solve 30 years ago. For me, it's a social paradigm shift and we need to focus on knowledge work.

We've been talking around the edges of AI. Let's fully go there. You and a couple other folks at data.world wrote a paper that ended up‌ going around. Can you summarize the key results there?

The main takeaway there is that if you want to do question-answering over using large language models and question answering over SQL databases, you need to invest in Knowledge Graphs to have higher accuracy.

The folks at Snowflake were the ones who‌ challenged us to go do something like this. So we were at Snowflake Summit last, and they were like, “we get that the semantics and this knowledge graph stuff is, how does it do it?

So the question there is twofold. Number one is that everybody starts playing around with large language models and text-to-SQL. Oh, this is easy. We can go do this. The world's going to be much better. But wait a minute. First of all, you're all testing on things like very simple questions and on very simple database schemas. This is all very cute. But what does this mean for the enterprise perspective? That’s number one.

And second, if you put knowledge graphs, semantics in the middle, how much does that improve? It's going to improve. I just don't know how much. So the questions were always to what extent. Now, another thing that happens is that text-to-SQL is something that the research community has been working on for at least three decades or more. There are academic benchmarks around this stuff that are really disconnected from the enterprise world. So our motivation is to understand to what extent ‌large language models‌ can do this stuff, how much knowledge graphs can‌ improve this, and then‌ make something that is related to the enterprise. It's the framework that people should go reuse to go test on their own stuff that they're doing.

What we did is we put questions in four quadrants, two spectrums. The complexity of the question, easy questions to harder questions, harder questions being metrics and KPIs, and then complexity of the schemas. Do we need a couple of tables or do I need a lot more tables? And then we tested that.

And interestingly,‌ generally the system had an easier time dealing with question complexity than it had with schema complexity. Schema complexity was harder. Is that fair?

Yeah, exactly. By the way, why is this stuff happening? Good question. Nobody knows because I don't know what's happening inside of these large language models. So we can always completely speculate around this.

Also, what we tested was something super, super simple on purpose because I wanted the, the, the, uh, basic baseline that anybody can reproduce. It's just one zero-shot prompt and the prompt is literally, here's your schema, generate the query. And just by doing that, we saw a three times more accuracy if you are doing things in terms of a knowledge graph.

So the whole point there is that you have to have invested in that semantics, how much is that investment gonna cost? That's the next work that we're trying to go do is how do we reduce that investment?

Are you in process of that right now?

Yeah. And that's the knowledge engineering work they want to go do. And that's how do we use really all these LLMs, these GPTs as copilots for my knowledge engineering work. I want to be able to define all these mappings. I want to come up with these semantics. Like I should have my system all the time. And so I can reduce the time to go create that metadata, which at the same time, how do I accelerate the management and the creation of metadata in my catalog and do that in a scalable way? If I want to be able to have that accuracy for the enterprise for this question answering, that means I need to invest in data catalogs and metadata, but damn, that's still hard to go do. We need to make that process even easier. And that's a lot of work that we're doing now too.

What if we like skip five years and everything goes exactly as you planned? You do a bunch more studies and everything works well and you continue to productize stuff. Where does this get us?

Okay, so the vision I have is that, first of all, we’ll be able to interact not just with our data, but with our organization, our business. You can really ask any question about your business. The first thing I want to go do is understand what type of question is this? Is this a question that's related to the body of knowledge that I have or not? Are you asking a subjective question? Oh, am I trying to find where should I go invest next year? Are you asking a fact-based question that needs data ‌that's stored in a database or do you need something that is, is on more of the metadata side?

This is the type of stuff that these AI systems will be doing. And then you start asking these question and you get a response, “I can't answer that question because I don't know about this particular concept, but I believe it could be here.” So you have this back-and-forth conversation with the systems around it.

Or they say, hey, you're asking a question, guess what? Somebody asked that same question recently and they have an approved governed answer. So here's the answer ‌you want. Or here's related questions that have been asked. And maybe that'd be interesting.

And when you don't know something, then all of that itself can get back in catalogs. And here's the list of things that I don't know. And you figure out, well, that stuff can get automated. that part, we need some humans to go in and check that and so forth. So that's the vision that I see that at some point it'll start knowing very little, but then it'll start getting to know more and more.

And then at the end of the day, all of this is you need to make sure that it's just trust. And trust for me is accuracy. It's explainability and then the governance part goes in there because you basically need to have like some seal of approval around that.

You don't want to get an executive explanation that involves code.

Agents are interesting to me because there's been some writing about design patterns with LLMs and the idea that in the same way that there are design patterns in software engineering, there could be design patterns in LLMs. If you're designing a software system that can be in conversation with an LLM, you can actually get it to do much, much more interesting things.

Yeah, I think this is the point where I think computer science comes back in. So I actually look at writing code as being commoditized. Coming up with the algorithm to solve the problem, that is still the hard thing. If you break the problem into smaller pieces, and you know that this piece is input, this output here is the input to this thing here, and then the LLMs will write that code for you. But you did the work of breaking down the understanding like true computer science.

What's the thing you're most hopeful for in the data industry over the next five years?

Well, what I hope is that people start treating knowledge work, knowledge engineering work as a first-class citizen. It's that data cleanup that I have to go do, that data janitorial work. That’s critical business knowledge that needs to be managed. And I think there's going to be, I really hope, and I'm working with folks from the industry and academia, to make a shakeup. We need industry to realize that the reason we continue to have these problems is because we haven't been investing in managing that knowledge.

And so that's one shift that we need to go do and realize how much money we've left on the table, how much money we're losing because we're not doing that. So that's one thing.

And then what I want from the university academic point of view, having academic courses in bachelors, masters, and boot camps on how to become a knowledge engineer. People start learning these things because they want to get hired for this. Data scientist was the sexiest job in 2012, right? People are doing data engineering. My hope it's going to be more on the knowledge side. I hope five years from now, that's going to be the role. We're going to have the tools for that. And that's what's going to make our lives much easier. We’ll have much more progress on solving these data problems because we're investing in our knowledge.

This newsletter is sponsored by dbt Labs. Discover why more than 30,000 companies use dbt to accelerate their data development.

Book a demo

The Analytics Engineering Roundup

Discussion about this post