Data engineering at Snowflake (w/ Rahul Jain)
A look inside at the data work happening at Snowflake
Rahul Jain is a data engineering manager for Snowflake's internal data organization. He joins Tristan to discuss the Indian tech scene, Iceberg, streaming, AI, and how Snowflake’s data team does data work.
This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.
Listen & subscribe from:
Key takeaways from this episode
There's this funny thing when you are the user of a software product, but you work for the company that makes that software product, then you have this dual role. You have to be a data engineering manager, but then you also have to explain and advocate for the platform that you're using.
How do you balance these responsibilities? Are you mostly spending your time delivering data outcomes to the business, or are you mostly spending your time on stages in front of audiences?
That's one of the reasons I joined Snowflake. Before joining Snowflake, I was a Snowflake customer. My team was implementing a data platform on Snowflake. It may sound a little cliche, but when I got introduced to the Snowflake platform, it was love at first sight. The ease of use and so many things.
At Snowflake, my core responsibility is building data products and data-driven solutions, which help Snowflake’s internal businesses across different verticals. But additionally, one of the roles I play here is talking about use cases in the platform. I work closely with the marketing team, sales engineering team, and sales team. I give many keynotes, breakout sessions at global events and with the developer community I'm close to.
That's explicitly a part of your role?
That's not part of my role. My role is evolving into that. If you say on the books, the title I own, it's not part of my role. But I love doing it. And the leadership here is very, very appreciative. If you are a proud user of something, whether it's a tech product or any day-to-day utility product, then you automatically try to market it. What I do here at Snowflake, those use cases which I build, I just go and talk about it to the world, to the data community.
Let's do it. Tell me, how does Snowflake do data engineering?
First of all, before I jump into it, I just want to mention that I'm here at my personal capacity. This is not sponsored by Snowflake. Since we’re a cloud data platform, we take data very, very seriously. This is my sixth organization where I've been working in the last 14 years. And it is truly a data-driven organization.
We practice data. We live and breathe data. Not only the data engineering team, all the functions, be it sales, sales engineering, marketing, finance, workplace, every function tried to have this data-driven mindset. My team is a horizontal team within Snowflake. And my team supports different verticals-GTM, finance, legal, and other verticals.
We have a centralized repository of data where all the data which belongs to Snowflake comes to a single-tenant, single platform. And then based on the domain, when I say domain, you can call it the verticals, we cater to them. Most of the time we create data models.
My team spends 80% of the time in analytics engineering creating data models, common data models, some aggregations, and then using data quality, observability, and data governance. We then share this with these business units so that they can create their own analytics if they have their own analytics team. Or sometimes we are engaged in enriching their source system data. We reverse it here or push back this golden data to their source systems, let's say Workday, Salesforce, ServiceNow, Jira, these sources.
Snowflake has been using dbt for a long time. I'd be interested to hear if you feel like your use of dbt is different or novel based on your unique role in the ecosystem. I'd be curious to hear if there's other tooling that you use in your stack that's worth talking about.
Yeah, so the stack is very big, but we’ve used dbt since the beginning, especially for data modeling and analytics engineering purposes. We are very, very satisfied users of dbt. And especially my team, they spend almost 70% to 80% of their time writing models in dbt and deploying them.
What do you think about the talent market for dbt in India? It's still a new-ish product. There's, probably deep benches of talent in India using different ways to get similar jobs done. Do you have a hard time sourcing DBT talent in India or do you think there's a lot of it there?
As you said, India is majorly heavily relying on Spark, Informatica and ETL. dbt is picking up especially with niche companies or new-edge tech companies. But I would say I still find some difficulty sourcing talent.
Do you assume you're going to have to train people when you bring them into your team?
I do, but the learning curve is very, very easy because there’s a lot of documentation available. When somebody new joins the team, there is a four-hour dbt workshop. I ask them to go through it on the first day.
One of the interesting things about India is that often, and this is not true everywhere, people are worried about their budgets. This makes open-source tools like Spark and dbt more popular. But what's interesting is that I think it's fair to say Informatica is pretty freaking expensive. And yet there's a huge base of Informatica expertise in the country. Who is using Informatica, and how do you square these two things?
When you talk about India's cost sensitivity, that’s true. But you need to understand the developer community in India, most of the time, is working for global companies head offices in the U.S., Europe, or Australia. India is not the revenue-generating entity. Thus these decisions of whether to have Informatica or dbt or Snowflake. These still are done in the headquarters where the company exists initially. Informatica is expensive, but this is getting paid from the headquarters.
Changing gears, I think that you are on record as talking about Iceberg in public settings. Iceberg is an open-table format that has kind of taken off in popularity over the past two years.
What do you think is driving customer interest in Iceberg?
One is the interoperability, which results into the no-vendor lock-in mindset. This is a fast-evolving ecosystem, right? If customers want to be agile, then they are looking for some middle ground where they can think about switching the platform or the processing engine they are using currently.
Open-table formats like Iceberg give you that kind of flexibility. You can store data in the open-table format and use a processing engine like Snowflake or Databricks to process it. You save money on storage, but it may cost more because you need the knowledge to use it and then keep it up-to-date.
I think most people agree that larger companies are mostly driving the market. The main benefit they hear is that it's more flexible and can't be stuck with just one platform.
There's a misconception that if you store data in the Iceberg table format, that you've done it. That's what everyone's talking about. But in fact, storing data in the Iceberg format is only a part of the game. The next phase is like, well, where's your catalog?
This is where things get complicated. Snowflake made an announcement about Polaris at Snowflake Summit. There's internally managed, like Snowflake's managed catalog, and then there's externally managed catalogs. And I'm just curious if you could help us figure out the differences between these things and the advantages and the limitations.
You said it right. Storing data in Iceberg format is one thing, but unless you have a catalog, you will not be able to query the latest data or keep track of the latest and all the atomic properties if you want to leverage it, right?
When this Iceberg table format started getting traction, each platform or like Snowflake, Databricks, they all started creating their own catalog. And you need to understand what a catalog is not. A catalog is not storing the actual data. It is just a pointer to the data, which is stored somewhere in the cloud in Iceberg format. A catalog is just keeping the pointer to the latest data or the latest files.
You can think of it as metadata. It keeps track of metadata. Now, where do you keep this metadata? One way of doing this is you keep this metadata with Snowflake in the Snowflake Managed Catalog. You need not worry about the UI or the console or how you and your team will view the catalog.
And if you're using Snowflake Manage Catalog, could you point Athena to Snowflake Managed Catalog also, or is it just to be used by Snowflake?
It is just to be used within Snowflake. From the Snowflake processing engine, you can only query the Snowflake managed catalog. That's why Snowflake came up with another concept of open-sourcing the catalog, or the externally managed catalog. It is currently in the incubation stage. It's called Polaris.
If you don't want to have your catalog managed by Snowflake, you can manage your own catalog. You can bring that code base and create and manage your own catalog in your own infrastructure using the Polaris capabilities. But in this case, you need to take care of the wrapper, the front-end UI you want to put in front of Polaris.
I really did appreciate the clarity that came from both Databricks and Snowflake in 2024 standing up on stage and both saying open-table formats are a big deal and we care about Iceberg. I think it's really meaningful that both CEOs got up on stage and said that. I think it's a pretty reliable indication of where the industry is going.
Do you have any expectations on a performance difference when people use Snowflake native storage versus Snowflake managed Polaris?
I think it's very obvious, right? If the data is stored inside Snowflake, the native data storage performance will be faster for obvious reasons. They will always be faster than the externally managed catalog and then stored in the Iceberg format.
I think of Snowflake's history with AI and LLMs as having two distinct phases. There's the pre-Sridhar phase and the post-Sridhar phase. And then the post-Sridhar phase is more like Cortex. Do you think that's an appropriate way of thinking about this?
Yeah, definitely. Sridhar comes with a lot of experience in artificial intelligence, especially in Semantic Search. He's a technologist, by his work in the past with Google, with his own startup.
The moment Sridhar joined Snowflake, all of a sudden Cortex came into picture.
The core philosophy of Snowflake is simplicity, the way the platform was built. Cortex functions, whether it's machine learning powered functions or LLM functions, these things are so simple to use. And there is so much excitement within Snowflake about these functions. That was the shift which happened post-Sridhar, where everybody is empowered to use these large language models, not directly but in the form of SQL functions. And there is lot of talk about how to expand that and create more.
Since Cortex, have you seen adoption of AI in the Snowflake platform accelerate?
100%. Sometimes I think we are doing too much inside the company. Everyone, not just the data team, but also the project management and all the other non-technical teams, can write SQL. We still have to figure out internally a lot of use cases which will impact at scale. But still these Cortex functions we are using heavily internally.
Do you have any dbt pipelines that are just end-to-end Snowflake dynamic tables? I have not personally gone all in on dynamic tables, but I'm curious if you've pushed it hard.
We are right now in that phase where we are moving some of the pipelines, which were managed through Airflow Decks to complete dynamic tables using dbt. And I would say we are not completely there with end-to-end pipeline using dbt using dynamic tables inside dbt. But one of the initiatives we are currently doing where we are migrating from those Airflow Decks to dynamic tables and on dbt itself. Those are more from the master data management side.
It's so interesting, many of the things that we think about in the industry are, there's an underlying consensus that we're fundamentally talking about batch. I didn't say Airflow, but you added it to the conversation. This makes sense because if you put everything in dynamic tables, then suddenly there's no orchestration.
Have you spent much time thinking about how you provide an observability? What happens if it fails?
Yeah, so you're right. That's why I get very practical and I tell my team the same thing. If someone is coming to you because this is new, real-time or streaming, this sounds great, right? But do you really need it? What's the end use of it? Who is consuming that data? Do they really need real time stuff? If there is no business impact, keep things in batch mode because you can observe them very well. And these are more stable. I will be very transparent, we have not built anything concrete to observe the dynamic table, data processing using dynamic tables.
I don't think you're alone there. We as an industry are still early.
Let me ask you the question to close out the podcast. What is something that you hope is true of the data industry over the next five years?
Data literacy is something which is increasing and especially with the democratization of LLMs. People are taking data seriously. And a lot of tools are evolving very fast. if you can interact with the data using copilots, that puts a lot of focus on data-driven products and the data industry. That's why I'm very, very hopeful for the next five years.
This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.