Ep 48: Bring your own data to LLMs (w/ Jerry Liu of LlamaIndex)
Jerry Liu is the CEO and co-founder of LlamaIndex. LlamaIndex is an open-source framework that helps people prep their data for use with large language models in a process called retrieval augmented generation. LLMs are great decision engines, but in order for them to be useful for organizations, they need additional knowledge and context.
In this conversation with Tristan and Julia, Jerry discusses how companies are bringing their data to tailor LLMs for their needs, as well as what’s happening with vector databases, fine-tuning models, retrieval augmented generation, agents, and more.
Listen & subscribe from:
Key takeaways from this episode:
When you want to ask a question of a LLM, there's only so much context that you can provide to the prompt. There's a limited amount of information that you can send. How do I make sure that I'm providing the right context to get the best results?
Jerry Liu: A very concrete use case I was trying to solve when I was building this application was: I was trying to build a sales bot that could ingest customer conversations within my previous company. Then I could have the LLM or GPT 3 ingest all this data and try to understand customer conversations. It could synthesize these insights and also summarize the to-dos for the next meeting for each customer. That was a very concrete use case I wanted to try out. But as I was feeding in these customer conversations, I quickly realized that you couldn't feed all this information into the context window of the language model, which was capped at 4,000 tokens.
If you had more data than that, then you need to figure out some strategy for being able to sequentially feed information or basically figure out how you retrieve this data to actually solve the task at hand.Â
The reason I think I talked a bit more broadly about the general CPU architecture was very much what inspired this idea of GPT Index (LlamaIndex’s previous name), it was actually like a grander theoretical vision of how to create this overall system around it? That was an initial stab towards that.
I can say though that the toolkit itself has evolved from an initial design of project beginnings into a very concrete and useful toolkit. I'm happy to chat a bit more about all the value it offers around connecting data to your LLMs.
How do you think about getting LLMs to access your data?
Jerry Liu: When I talk about framing LlamaIndex or this problem statement of "How do you get LLMs to access your data at a high level?" There's really two approaches. One is exactly what I talked about, which is this idea of setting up some sort of data pipeline for in-context learning, and in-context learning is just a fancy word for figuring out how to shove stuff into the input prompt so that you get back some output and creating a software system around that.
The next part is more on the model architecture side and training side. The reason tragedy is very powerful is this black box thing that uses some sort of like gradient optimization. Underneath the hood, it's just a bunch of numbers, like weights that power the model.
Because it learns over a data corpus, it can now answer a bunch of questions about this data corpus. These days there's a lot of growing interest in stuff like fine tuning, like training in general and also model distillation, if you think about ways to incorporate new knowledge into a language model.
Julia Schottenstein: Can you define model distillation and fine tuning for our users?Â
Jerry Liu: Sure. There's first model training, in which you have this model initialized with some random parameters. By default, it will just output random noise and then you have some sort of training process.
This is how GPT 3 is trained. You give it some data and you train it. I can now understand this data and take in some input, giving you back a comprehensible output. For fine-tuning there's a variety of ways to do this, but you take a model that's already been trained on some data and then you figure out a way to train it more on some new data. Let's say you could have this model be trained more on some data that you own, as an individual or an enterprise. That's what you call fine-tuning. There are different ways to do it. Let's just take GPT as an example.
You imagine the entire architecture could fine-tune all the weights that will be expensive and slow. Because then you're changing everything basically; you could fix part of the weights and fine-tune some sort of layer on top. That is another example where you take advantage of the fact that most of the representations have been learned during the pre-training process when GPT was trained.
Then you also want to fine-tune this additional step to actually learn over your own data representations. Then there are some other variants of what I just said, and that's basically what fine-tuning is. Model distillation is an approach of taking a bigger model and distilling into a smaller one. Training a smaller model to imitate a bigger model. This can be useful because it's a form of compressing a lot of information into something that's cheaper and faster. For instance, if GPT 4 has a trillion parameters, you want to distill GPT 4 into a smaller model that has context over your data. You could try to do that.
Some companies are very interested in saving on cost of hosting and latency, and also they don't want to pay OpenAI money. You can try to distill this bigger model into a smaller one, that you can then host or just run a lot faster and cheaper.
For people constructing systems and for people building tooling to allow people to construct systems, is there a map of what is most appropriate to do?
Jerry Liu: When I talk about these two approaches, the jury is still out because the research space is moving very fast in terms of the capability and power of fine-tuning, as well as new discoveries around this idea of retrieval augmented generation.
Tristan: Both sides of this coin are getting better at the same time.Â
Jerry Liu: People were creating new software systems, around this idea of in-context learning. A lot of agent-based stuff is just fancy ways of in-context learning. Then there have been a lot of improvements in terms of fine-tuning techniques to make it better, faster, cheaper, and more efficient.
If you look at a lot of open source models that have come out these days, like open source LLMs, a lot of them are just fine-tuned variants and are using state-of-the-art techniques like LoRA (low-rank adaptation) that basically make it a lot faster and cheaper for anyone to fine-tune.
There's a huge incentive from both sides to make both of these techniques better because the end goal is to make LLMs and access to your data more accessible to everybody.Â
What do you think needs to evolve to get to the promised land where these LLMs are not just in constrained environments, but helping solve very common tasks for work decisions?
Jerry Liu: I think there's a growing sentiment these days that the future of LLMs and agents is going to be multi-agents. You're actually cool with living in a world where you're going to have different LLMs and agents specialize in different tasks, as long as they can communicate with each other. Either peer-to-peer or in some sort of hierarchy.
That's an easier problem to solve than trying to have general things that can do everything. If you think about humans, we're general intelligence, like machines basically, but we tend to specialize in things.
A software engineer is going to learn coding skills. A musician is going to learn a certain instrument. It's easier for an LLM to be constrained and be very good at certain tasks. For instance, maybe this agent is very good at sending emails and scheduling things for your client. It can basically communicate with other agents through an API interface. That world seems more practical and feasible because it's realistically what a lot of people are going to build.
As users not only use technologies like GPT 4 or these general LLMs, they also try to fine-tune these LLMs on their own private data. They're going to start building these and baking in AI and LLM workflows for constrained tasks.
It's basically going back to your question. "What's going to create these improvements in reliability?" I think it's going to be twofold. One is that the model technology will get better, as the reasoning capabilities of these big models and open-source models get better. More and more people will be able to build reliable stuff pretty easily.
The other part is really a proper software and API design between different agents. If we can actually create these good interfaces where people are building specialized LLMs and agents for these different tasks, and also design good communication protocols between them, then you're going to start to see greater capabilities come about, not just through a single kind of LLM or agent, but through the communication system as well.Â
Tristan: Do the agents need APIs or is the API just text?Â
Jerry Liu: It's a good question. If you've heard about the open AI function calling API or even just a standard react agent, an agent is basically a reasoning prompt with an LLM that has access to a set of tools.
It can choose to call these tools and, by calling it, it just means that we'll be able to use the tool and infer the set of parameters that use this tool. That's basically an API call. By default, calling a tool could just be with text, but it could be with other parameters too.
It could pass in numerics, like integers, floats, booleans, and other strings. If you think about any sort of proto file, it can theoretically infer the right parameters to actually call this tool with.
 For instance, another example is an SQL agent will have access to an SQL database as a tool. To properly call the SQL database, it will be able to pass in, an SQL statement to execute. This query engine over the SQL database will execute the SQL statement and return the result.
That's just an example of how agents can communicate with each other through this full interaction. The thing is these tools can also be other agents. That's how you can get this communication interface between them.