Ep 42: Cloud Warehouse Cost Optimization (w/ Niall Woodward + Brad Culberson)
Brad Culberson is a Principal Architect in the Field CTO’s office at Snowflake.
Niall Woodward is a co-founder of SELECT, a startup providing optimization and spend management software for Snowflake customers.
In this conversation with Tristan and Julia, Brad and Niall discuss all things cost optimization: cloud vs on-prem, measuring ROI, and tactical ways to get more out of your budget.
Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts and support my work.
Listen & subscribe from:
Key points from Niall and Brad in this episode:
How do you think this overall subject has changed in attention over the past 12 months, and why?
I think there's much more attention pivoting toward running systems economically.
I think we were in a phase of the economy where everybody was trying to grow at all costs for many years. Companies had what I'd call euphoric growth. They were growing massively fast. They were trying to basically throw money at problems, right? They didn't have enough engineering time or effort and weren't focused on running systems economically for years and years.
Most organizations' biggest goals for many years were growth and revenue. And we're seeing a rotation of that due to the economics we're in today of a higher focus on running systems more economically. I think even before the last year, we were seeing this trend, I'd say for the last decade, from people just moving from on-prem, fixed capacity systems into the cloud.
We were seeing this happening to customers moving from constrained environments (on-prem) into the cloud; they started getting a lot of access to things like unlimited compute. And that transition started raising some awareness of “we may need to start thinking about this cost monitoring thing because it's no longer a fixed budget that I just paid for in a data center”.
I think in the last 12 months we've seen a rotation even heavier toward it. But also I've been in architecture for many years and a lot of my time is around efficiency and cost monitoring of organizations and companies. I think a lot of the conversations I have with customers are focused on that now, but they have been like this for five or 10 years.
Data, as often happens, is experiencing things similar things to software engineering, but it's getting to them a little bit later. Do you think that's a fair read?
I'd say so, yeah. We need more cost governance and oversight for sure. And we also have more first-time database contract owners in there as well. And often people who are owning these vendor contracts haven't managed a data platform or a software platform to your point, which requires careful oversight.
Is that an unsolved problem today in getting the kind of visibility and monitoring that you would like in the warehouse? Or what are some ways that companies can do a better job of figuring out where there are opportunities?
I'd suggest starting with the native monitoring that's in your warehouse.
So Snowflake has resource monitors and some high-level cost exploration dashboards that you can use. The challenge then is how do you action that data? So say a warehouse becomes more expensive; actually working out why can be really hard. Especially in a warehouse like Snowflake, which builds a kind of compute level rather than something like BigQuery where you can easily see your cost based on the bites that have been processed.
So we've actually open-sourced a cost attribution model in the package I mentioned earlier, which is the dbt-Snowflake monitoring package, which attributes warehouse costs down to the query level. And once you have that kind of the finest grain that you can get to, you can roll those costs up to data assets based on the kind of metadata associated with those queries.
Do you think people are having the right conversations around cost optimization as an absolute scale, or should they have a different framework of thinking about ROI?
One thing I would think about, and I liked your position on ROI, is that almost any system can be optimized forever. And often, I'll tell customers I could probably work on this one query for the rest of my life.
And there are just so many ways to optimize things, right? But the real question here is: is this worth our time, is it worth our engineering effort to actually do this optimization and will we come out ahead, does this project already have good margins, is it already fulfilling the business's goal?
When we go into new projects, it's amazing when there is actually a budget and we know okay, what's the margin that's desired here? And what's the revenue this may drive? Or what's the business willing to spend to solve this shape of a problem? One thing that's amazing about us paying for specifically that project and seeing that cost accountability so clearly is we can see if they run over.
So organizations can start tracking this and say: Hey, we said we would spend this much, you're tracking towards it or you're not. Let's talk about this. And historically, that's been very hard because these have all been very highly shared systems and you may have some people in there that were using a huge percentage of it, affecting other users. But no one really knew that was happening.
So I like the clear visibility that I think the cloud has provided and what we now have around these cost controls. And I like the rotation of companies to start looking at what's the budget spend. Because any system can probably be optimized forever to get better and better margins, but it may take a thousand engineering hours to gain a 10th of a percent.
That probably isn't worth that spend. So it's worth thinking about that a little bit. If a customer is successful and they're happy with something, I often am like, why are we here talking about optimization? It sounds like you have great margins and it sounds like you're fulfilling your business need.
That's amazing. Is there something else? Can we find something else to drive more business and get you more revenue and be more valuable? We could be getting to a point of diminishing scale and returns and using engineering resources in ways that are not as efficient. Gaining a 10th of a percent of efficiency on one workload, when you could use that engineer to build a completely new product line or something, follow a completely new research project, those things need to be thought out at the organizational level.
Is the onus on the warehouse to try to help the customer make the best decisions, should the customer make the decision, or should a third party be the one that can help people decide on how to pick the warehouse sizes or where to make the investments?
I think the learning curve for teams building on Snowflake is a historical one where often they just point at one warehouse that had a fixed amount of compute and then the thing just ran right. Now there are a lot more options. So Snowflake's giving a lot of capabilities to those end users now to say; hey, do you actually want super high concurrency? Do you have a really complicated query you'd like to run extremely fast? Would you like to throw a lot of compute power all at that one query? And is it efficient to do?
Some queries do distribute really well and can run very well over thousands of cores, some can't. I think right now it is on the onus of the builder to make those decisions and those tradeoffs of saying: do I want to, is this a complicated enough thing than I want to run a larger warehouse or do I want to run this with higher concurrency, right?
I think in dbt’s case, I've worked with some people that have some projects where they actually want to spin out a fairly large amount of concurrency in there where they're building several different models and actually getting concurrency there, many clusters inside a warehouse is better than having a bigger one in that case.
Right now, I think that the onus is on the engineer. I do think third parties could probably help guide. Sometimes I'll look at a query and I'll see a really complicated query running on an extra small, and I was like, oh, obviously like that, it would've been very economical to run that up to a much, much larger warehouse. And you would've been happier with the performance. I see that because I've been using Snowflake for many years. It's very intuitive. I see it in a second and I'm like, oh obviously,
But some customers haven't used Snowflake for years and the right warehouse sizing isn't obvious to them. And right now, I think third-party tools or more mature teammates are helping them say: Hey, I've been using Snowflake for a while. Let's think about this other pattern and see if that actually can solve our use case better.
I think one thing I do like about Snowflake, in this case, very trivial to change your mind. In architecture, we talk about two-way doors versus one-way doors. It's very easy to run that project with different warehouse utilization and use different warehouse sizes and configurations and then click the go button and see the performance of it and see what your costs were.
And it should be as much as a configuration change to completely change the way it's executed, and the type of compute it's running on. And then the end user can decide if that is more economical for them. I can't say how fast that job needs to run. A customer may be like this has to run in 10 seconds, or this needs to run in 45 minutes. Those are two very different goals that would require a completely different plan of action.
Do you think the Elastic Cloud resource model makes any sense in a data context?
Yeah, for sure. There are other inputs to this equation; it's no longer just about spot versus reserve.
It's also about the architecture of those instance types. Different cloud providers with different performance characteristics in different instances. There are many levers in there that are happening for customers that are thinking about this instance sizing. Snowflake does try to simplify this dramatically for end users. We think our warehouses are fairly easy to understand in size. We'll deal with the complexity of picking the right instance families for customers, picking whether we should be using reserved or on-demand or spot instances for them.
And then we think we're economically driving that number to something that makes sense for end customers. In general, one thing Snowflake does, which is amazing is we actually have our own hot pools of resources: when you ask for a warehouse in Snowflake, you get it. That's phenomenal!
If we had to go procure an instance, spin it up, install some software, and attach it, it wouldn't be a second, right? You could not have the same dynamic access to compute as you have in Snowflake. And these pools allow us to have customers build things that have massive spikes in load: you can have something that comes online and ask for a thousand cores, have it very quickly execute, spin down, and be done.
All of that comes in and goes away very quickly. Very rare in our space to have that dynamic access to compute. And oftentimes, customers want to do this even more economically. And I'm like, let's put your workload on Snowflake; I'll prove to you that it's economical and it solves your business case cheaper than other technologies.
And then you don't have to worry about the complexity of all these other things which we are dealing with in the background for you.
Are there other things data teams should be thinking about in these uncertain economic times that are maybe above and beyond just focusing on how much they spend on a warehouse?
I think marketing teams are great examples of teams who do investment and return attribution really well. They measure everything often with help from the data team. But data teams often aren't so good at measuring themselves. I think part of that is because it's really hard. You can't say we sold this thing because of this dashboard. But there are other things that you can do to identify value business operations, I think, especially now when we're not just using data for BI, but for driving workflows. Identifying operations that data teams enable is a great way of identifying value within a business.
Reducing inefficiencies, just reducing people's time to act on support tickets and things like that. And I'd also recommend that businesses create a way for their users to share the value that they've achieved through the data platform back to the data team. And creating a feedback loop there to help people really focus on what's driving value is really important.
I have a few tips. I think as customers are more sensitive towards costs, the teams building these things should start focusing on a few different areas. I think one is to start with smaller projects that have smaller datasets but maybe don't train a model on the entire dataset to start with.
Start smaller and actually understand the costs of a smaller running system before you grow the dataset to petabytes. I think one thing about the cloud is we can help you solve anything to any size of the problem, but the challenge there is to understand the cost before you jump into that. And you can understand that usually really quickly on a small percentage of the dataset or a smaller piece of the dataset.
Also, assume the smallest warehouses and make some decisions here where you're just not throwing compute power at something. Don't go run a transformation pipeline and just choose a 4XL. Probably not the default choice.
Start with the extra small and then ramp up if that doesn't perform well, right? Because underutilizing those large warehouses is leaving a lot of wasted credits on the table that could have been used for other purposes. But I love the business visibility. I think early in the project setting, a budget or a margin goal or something helps out considerably here so everybody knows what that goal's tracking towards and is looking at a dashboard hopefully that visualizes so they can quickly make a decision by saying, is this going well? Is it not going well? Do I need to dig into something or do I need to stop a project or, revisit something? It's pretty important because often we have dashboard people look at and they're like: I don't know what the goal is, so I don't know if this is good or bad. The person looking at the dashboard should probably know if this is good or bad, whether it's tracking to the right number or not, and whether they need to dig in, and if something's urgent or not.