Making data movement as reliable as electricity (w/ Taylor Brown)
Iceberg, unstructured data, and the data infrastructure needed for AI, with Fivetran's cofounder and COO Taylor Brown
Fivetran recently passed $300 million ARR and has over 7,000 customers globally. Taylor Brown, the cofounder and COO of Fivetran, joins the show to talk about Fivetran’s moat, the impact of AI on the data ingestion space, and open table formats and catalogs.
This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.
We need you—yes, YOU—to take this year’s State of Analytics Engineering Survey. The findings here guide product development and help us all understand where data teams are going.
Listen & subscribe from:
Key takeaways from this episode
The Fivetran mission is to make data movement as reliable as electricity. Is that right?
Taylor Brown: The thinking behind that mission statement is that when we think about Thomas Edison and what he did, he spent all this energy bringing electricity into the house to power light bulbs.
And then what happened after they had electricity in the house was an explosion of additional innovations—hairdryers and washing machines and all the electronics. One of the biggest challenges to innovation is just access to data. And BI, as we thought about it, was really the light bulb of the modern data stack.
I think especially with AI, the innovation set to come is still to be defined. We're at that light bulb moment still.
You've got to have the use case that drives the original infrastructure, but then who the hell knows what the infrastructure is going to be used for next.
Exactly, exactly. In the last 10 years, it's been a lot of light-bulb kind of BI stuff. And the last two years have been this new, fun, more exciting innovation around AI, which makes my life more fun. And, you know, I think what we're doing is more interesting.
You guys are big time. Where's the business? You guys have hit some milestones recently.
We recently passed 300 million in ARR. We have over 7,000 customers now globally. And we're growing at a great clip right now. The last two years were not maybe the best years for Fivetran. It was just a challenging time in the market. And I think there were a lot of folks who pulled back on any sort of innovation.
We've seen a resurgence of growth for ourselves this year, which has been great. A lot of folks are really starting to feel more confident in the market, which ultimately ends up in more spend on innovation, which means we all see more investment in data.
One of the interesting things about this space that you're in is that everybody thinks that they can build data pipelines. And in an environment where somebody up the org chart is looking to save money, there's probably somebody lower down in the org chart that says, “Screw it. I'll roll that myself.”
Is that a conversation that you've had over the years?
It's a conversation we've had and a conversation we continue to have. Especially when you think about the modern data stack where you do more ELT to extract the data and load it with a small amount of transformations directly into your cloud data warehouse. I think a lot of folks that are at senior levels at organizations say, “Hey, you're not even doing the hard part. You're not doing the transformation part. Why would I ever use a tool for that?”
When you get into the details, there's a lot of complexity, as you pointed out earlier, to moving data effectively, doing it accurately, doing it at scale, making sure that you don't miss any data, doing it incrementally instead of batch. There's all this complexity that we put into making it so that we can have this highly reliable replication and copy of your data within the warehouse.
That's a challenge that we have to face with buyers on a constant basis to help them understand why this is cheaper, better, faster than having their own team build it, where the quality can be all over the place. A lot of times engineers don't really want to do this. They see this as a shitty job for them to do.
Moving data from point A to point B is ’t how you build a career.
It's a demotion, really. We got this these pipelines and need you to go do this instead of doing this really important mission-critical stuff over here.
Do you have types of metrics that you show?
We have a lot of metrics that we show. We have uptime metrics. We're working on a bunch of latency metrics right now. I’d say for your average company, we can probably do it better and faster than you can.
But sometimes that one data engineer doesn't want to hear that.
One hundred percent. There's two aspects of this. For the really big companies like Facebook, data really is their business. Building the infrastructure around it is their business as well. If you're outside that, in the build-versus-buy scenario, you're going to end up with something that's better, faster, more reliable, because you have this crowdsource effect. We have 7,000 plus customers who are using the same infrastructure. We're able to really battle test it over a large number of customers and an even larger number of connectors. And we're going to catch all those edge cases, right?
The flip side of that is that you have some engineers who think they can still do it better or faster, or just want to have control over the overall pipeline. There's going to always be preference in any stack. And so we certainly see that, but we try to point to more of the objective outcomes that folks see when they set up Fivetran versus building it themselves. What ends up winning ultimately is that customers just try out Fivetran and see how great it is compared to building it.
There are two different motions that dbt is brought into an organization.
One of them is that somebody in the central IT org gets religion and then they push it out to the business units.
And the other way is that the central IT org is on some different version of the world and they don't get religion and yet one of the business units does. Then they adopt dbt and like a shadow IT are constantly trying to convince the central team to support this.
Do you see this?
We definitely see that same pattern. Most of these large organizations have a central data warehousing approach. They try to have a centralized approach towards data integration. When we see that, we typically have to go through central IT. There are times where you have a central IT team that has built something, but then as you said, you have a separate team—often the marketing team—with their own warehouse doing their own thing because they are moving so quickly. We get success there and we move our way into helping the central IT teams.
Every company has a different pattern for adoption, and we try to fit into as many different ways as we can. But since we touch a lot of the critical infrastructure for them, it's much harder to do shadow IT for that. There's just so much oversight on making sure that data is secure, protected, following governance, all that kind of stuff.
You have 500 connectors now. Do you make money from connectors number 21 through 500 or is it important to say that you have a thousand connectors in a year?
We do make money on the last 200 connectors we've added. Now the amount of money we make per connection is certainly lower. A lot of the systems of record for these older companies are on-premises systems. For them, a lot of the core information that they need to put into their cloud data warehouse comes from those particular systems.
More and more of these companies are starting to adopt additional cloud systems around that. Maybe they'll add Workday or they have Salesforce. They'll add Jira and they start to add some other cloud systems and those also have value to them. They may not be quite as valuable as their bedrock of data that they have in these older systems. I think the newer companies don't have that on-premises system of record problem, because everything they do is in some sort of cloud system.
Fivetran, for example, almost everything is in a cloud system that we decided to buy and that we run everything on top of. For us, and for a lot of our cloud-native customers, it is spread across multiple different sources. We need to have every one of those different connections. And so that's where valuable for a customer to have a single platform that they're getting all of their data.
Even if they theoretically could buy three different tools and combine the list of connectors together, folks really want to buy one data ingestion tool, right?
If you think about it, having three different tools, having three different support systems, having three different account managers, you need to train the team three different times on each of those things. And so if you can pick a single tool and be a standard across the organization, it just makes it a whole lot easier.
We have a new SDK coming out that’s in private preview right now for building custom connectors. Because one of the challenges we've faced is even though we have 600 connectors and we're building 100 or 200 connectors a year, there's just endlessly more. There's 6,000-plus SAAS connections. There's something close to 30,000 APIs available and different business-to-business applications. We're never going to get to 30,000, but if our customers have a platform that they can build on top of that has 70% of the replication built in and the core functionality is there, you know, that's where I think we start to really help our customers use us as the single platform for all their data movement.
A lot of smart people over the years have said ingestion is a commodity. But you folks are empirically proving that there is a really good, defensible business to be built in this category. How do you think about the Fivetran moat?
It's a question we've talked about for years, and certainly a lot of our early investor conversations ask this same question. Like, is this defensible? Right? Why doesn't one of the hyperscalers build all of the same stuff?
Our hypothesis was, which I think has turned out to be true, is that while yes, building these connections in theory is easy, in practice, there are 10,000 edge cases that happen and you only really get to a hardened state over many years and multiple different customers using the same code base where you run into all these same edge cases or different edge cases.
And so the defensibility is really time and bug fixes over a long period of time against the same code base.
The type of problems that we run into versus the type of problems that say an Amazon might be focused on is that an Amazon engineer is focused on the kingdom that they're building within, which is their own kingdom. Ours is completely focused on everything outside of our kingdom. We don't control anything. We're just dealing with APIs and databases and all the other things that we don't own.
And so it's a very different problem, and it takes a very different skillset and a very different group of engineers. And that's what we've optimized heavily on over the last 12 years. It's a combination of who's on the team and then also just a never-ending bug fix. When you set up a Fivetran connector, it's been hardened by thousands of customers and it's going to work.
George wrote a blog post called, “How Do People Use Snowflake in Redshift?”
It posited things like maybe we don't need to use massively parallel processing engines (MPP) for everything. And maybe vendors will supply their own compute for the workloads that they're responsible for. Have you guys gotten any blowback from this?
So far, we haven’t gotten that much blowback from it. George is obviously extremely bright and he has a insatiable appetite for reading and thinking about technology. I think the combination of both of those ends up leading to him being quite a visionary thinker. He thinks a lot about the data space, a lot about all the way down to the database level.
There are a lot of people who are like, “Hey, we need to use a cloud data warehouse because we have so much data.” But when you look at the actual data and the amount of data on average being run in Redshift, for example, it's not that big.
Our laptops have improved significantly over the last five years, 10 years. They can probably run a lot of this compute at the same or faster speed without spending any cost, right? And so these are controversial observations because they go the opposite direction of what we've been saying for a long time.
Ultimately, we care about what is right for customers and where the industry is going. And if something's right for our customers, even if we don't want it to happen, it's going to happen. And so it's better to just face reality and figure out how to live within this new world. Data lakes are a great example of what he was talking about in that blog post. He said you can use your own laptop instead of using a cloud-based warehouse.
I wanted to use the blog post to talk about Fivetran Data Lake Service. Can you tell listeners what that is and how is it’s different from the way that Fivetran worked in the past?
When we first started, it was integration with Redshift. We’d just take your Salesforce data and put it into Redshift in a very automated way. This includes the first sync of data, creating the tables, putting it all into your warehouse, and updating all that data. We effectively own that first layer of data within say Redshift.
And then Snowflake came along. The big innovation there was the separation of compute and storage, where you have this now elastic ability to grow both compute and storage within the cloud. And that was really the advent of the modern cloud data warehouse. I think that’s 100 times better than the previous version, which is the on-premises data warehouses.
Many customers want to be able to use their own S3. They don't want to have to take all the data in S3 and put it into Snowflake's S3, then create on top of it. A lot of customers and people have been thinking about this for a fair amount of time.
There was a previous version of just loading it into S3, which I’d call a data lake version one. This version was more like a data swamp where you just put a lot of data in and then you spend all your time trying to understand what data is in there and changing it to make it logical. The next version of this came through open file formats like Iceberg and Delta. These formats take the organizational style that you get in a data warehouse and use it in a data lake. And you have DDL statements and updates, inserts; it's organized in a logical way. So you get the best of both worlds having this organized data warehouse within your data lake.
And then you can put different query engines on top of that. And there are a few things that had to happen for this evolution to happen. There were a lot of large customers who had a ton of data within data lakes who wanted to access this within downstream warehouses but didn’t want to move the data. They were already paying for the storage here once. They don't want to pay for the storage again.
That customer-first approach really pushed data warehouses to now start to support this concept. And I think that also drove the innovation from the open-source Iceberg community to then build these capabilities and for folks to start to adopt them.
All of these things have come together in the last year. Now customers can load data directly into Iceberg in S3 and Fivetran Data Lake Service effectively does that for them. So instead of loading into Redshift or Snowflake or Databricks directly, we can load to a customer's Iceberg instance.
And this all relies on an open catalog, right? Are you folks using a particular catalog to support this?
Yes, that’s a big part of it. Once the data is in the warehouse, then the question is, well, how do you query within Databricks, Starburst, Athena or Redshift. You need to understand the actual metadata there. And so that's where these open-source catalogs have come out. Polaris is one of them. We’ll also support Unity from Databricks.
I believe this will become the sort of postmodern data stack or the modern datalake stack or something that everyone moves to over the next few years. but I think there's still a lot to sort of be figured out around how to make this more of a turnkey type offering for the ecosystem.
Is it your experience that there are more data leaders who are Iceberg and Delta curious than those who are using it in production today?
Yeah, the early adopters and the ones who drove the initial innovation is the phase that we're at right now. We are seeing a fair amount of folks using this service within Fivetran, but it's not everyone yet. I think part of that is because many people are not ready to use new technology right away. They will wait a while and then use it.
In the conversations I've had this year with data leaders, they're all thinking about it. One reason is that they want to be able to use data for many different things after they move it to a certain place.
There are also some costs to this. It might be cheaper for them to just load it into their own S3 bucket using cheap compute, rather than loading it directly into a warehouse.
I really agree with what you're saying on the turnkey part of this. If you are a data engineer and you try to roll out your own Iceberg support today, it’s really non-trivial.
We were able to ship some dbt functionality at Coalesce 2024 that you just flip a flag and all of sudden your model outputs to Iceberg. I think that’s the type of stuff that's gonna have to happen across the ecosystem to make this widely adopted, which I'm very excited about.
Yeah, totally. It’s very hard to roll it yourself. I mean, it's very hard on the ingestion side and then it's very hard on the bronze, silver, gold side. There's still a lot of pieces. I think what we've done helps the first part of it. What you've done helps the second part of it. There's still more around the catalogs and all of that. I think it’ll come together and it’ll be exciting, but it's still somewhat early days.
Snowflake popularized the notion of separation of storage and compute. I think about this as the separation of compute and compute. Just like multi-modality was never really a thing. You had to pick an engine and go all in on it because otherwise you were moving data around all over the place. And that's just not the case anymore.
Yeah, totally. In one sense, it's interesting because you'd say, well, this is probably worse for warehouses like Snowflake because they're getting less lock-in, right? At the same time, I think it's better in a way because customers don't necessarily want to.
Yeah, make that case to me. I can't see it.
The customer wants to have all the data within their own data lake. It’ll force people, companies like Snowflake to innovate a lot and continue to drive that customer value in the things that customers really care about.
From what I can tell, Snowflake is doing all the right things, focusing a lot on the AI layer on Cortex and building out the key functionality that customers want. If they do this right, they'll get more jobs over time. Many customers already have their own data lake strategy and asked for Snowflake to help them query a lot of data.
And so you forego this old world lock-in for a new world to compete on the things that customers really care about. And that's what makes a business much more lasting.
Let’s pivot to AI. If Fivetran is now landing data in a data lake, do you have any visibility into what people do with it? Are you able to observe folks using this in AI workloads?
Only through talking with them. Using the Edison analogy, we don't know what they're plugging into their outlets. We just know they're using energy; they're using the data that we're moving through it. The AI industry for B2B is still pretty early in a way.
Early on, we were building an internal chat bot. Let's make it super easy. Let's pull the data from all the different sources that we have, like Slack, our internal Wiki, our Docs, and our email and a bunch of other places. And let's just pull those in together and then make those available. We started talking to a couple of different vendors and the vendors asked us to send all our data in a CSV. And we're like, “What do mean send us your data in a CSV?”
We were just so surprised that it sounded very similar early days with BI. We've found that many people thought AI was its own industry and the infrastructure for it was its own industry. And the way we think about it more is that your BI stack and your overall data platform is the foundation that you build your AI on top of.
Now we have a lot of customers who have been successful in building out various AI platforms on top of the data that Fivetran delivers. And that is where I think things really start to get interesting. That’s when companies really think about them as a singular platform, like what we did for our internal chatbot.
I think where a lot of people are sort of going sideways is they're not just like thinking about reliable access to their own data. The difference between what companies can do within OpenAI and what they can build with their own data is that their own data gives them a competitive advantage. That's the thing that only they can access. A lot of folks are ’t thinking about it at that level yet.
We have not yet unlocked enough downstream use cases to make the infrastructure that both of us are powering have the level of attention on it that it needs to get to the 11 nines of reliability that S3 promises.
One of the things that's exciting to me about AI is that it is going to drive a lot more attention onto the quality of the infrastructure that Fivetran and dbt are providing.
I completely agree. Again, I think we’re still in early days where folks are still tinkering with it. Folks are investing a lot but haven't had real gains from it. And I think once it starts to get more traction over the next year or so with the actual applications that companies are building on top of data with AI, that's when the pressure starts to build around the infrastructure underneath it. And that's where it really starts to harden. I'm just not sure we're there yet.
And I think that's where we are seeing folks who are building on top of Fivetran infrastructure being successful with this. I can't talk a whole lot about it, but OpenAI is building on top of Fivetran. That's a pretty good AI use case. And now there's a lot of other companies as well.
It comes back to, as you said, it has to be reliable, it has to work, it has to scale out.
We’ll often get asked about unstructured data when we're in conversations with folks on the topic of dbt and AI. My answer is generally no. People are ’t transforming data from customer call WAV files or reading PDFs. Are you playing in the unstructured data world?
We're starting to. We just recently added support for PDFs. A lot of folks had a massive SharePoint with tons of emails. And that's the first step into it.
AI allows you to make unstructured data more structured. You can take all of this data from your email, for example, that's quite valuable to you, and apply some of the same concepts we've done in BI successfully now.
When you apply the right embedding and model on top of this unstructured data, then you can do a whole lot more with it. We're transcribing a lot of our sales calls into text to see what we can learn. And then those are fed into our internal chat bot that then helps us train and helps our internal team ask questions of like, “How does ”
I’d bucket the things that we've talked about so far as Fivetran for AI, but there's this whole other bucket of AI for Fivetran. How’s Fivetran’s product going to change as a result of AI?
Yeah, so it's funny because when AI really started to take off, we sat down with our CTO Meel Velliste, who's very smart, PhD in machine learning. And we said, “If AI is going to put us out of business, let's be the first to do it.”
We built an AI app that we can point to APIs. It will read the documentation and make a full application or a full connector for us. And then we have a human who goes and looks at it, mostly an analyst versus an engineer who then reviews and tweaks t. And then that's how we're building so many of these long-tail connectors through this process.
So I thought this was a cool new idea for your product roadmap, but you did this a year and a half ago.
Another one was looking at the logs for errors. You can imagine that across 600 different connectors, you get tons of different types of error messages for all different kinds of things. And so it was really hard to surface those errors appropriately to customers within our UI when something went wrong. A lot of them were very unhelpful. A lot of them, our customers couldn't do anything about them. And so we needed to surface those errors to Fivetran internally versus externally. And this has been a hard challenge for many years.
We built an AI app on top of all of our logs that then goes through, breaking them down into 51 different types. Whereas we had 350 before, but a lot of them were duplicates. It’s been hugely helpful for us to debug what's going on and make sure our support team is jumping on the right things.
Humans are really good at fixing things if they know what the problem is. And machines are really good at scanning through tons of data and understanding the patterns and what's happening.
I think a lot more of that will continue to happen, especially as like we add more and more data sources, and with more complexity we're really focused on making the latency as short as possible.
You folks have been on the record over the years as being a little contrarian on streaming. Streaming often gets a lot of tension. There’s a lot of hype, that faster is always better, but there's been some scrutiny around that too. What are you folks seeing now that's making you pay more attention to latency?
In general, 5-10% of organizations need streaming data where it's in real-time. There's some workload or on-the-floor dashboard that folks are looking at in their manufacturing plant or whatever.
But a lot of times, executives across organizations will say they need real time. But what is the actual outcome of this? The problem with real-time streaming is that there's a really high cost. It's a ton more data. There's a lot more tooling you have to build. It's a lot more complicated. Generally what we found is that for 90 % or more of cases, micro batches work quite well down to 15 minutes, or one minute.
Now we're realizing that if we can get down to five-second latencies, that may move away from needing to have streaming. Streaming may become 1% of your overall use case. Customers generally want things to be faster. We're doing the hard work to get us there.
What’s something that you hope is true of the data ecosystem over the coming five years?
I hope that the data lake ecosystem turns into the core ecosystem that people are building on top of. I think it’d be better for our customers ultimately. And I think it provides a lot of optionality and obviously the tooling has to all build around it as well. So I think, you know, in five years, that's what I’d hope for.
This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.