Discover more from The Analytics Engineering Roundup
Ep 25: Automating Away your Work w/ Configuration-as-Code (w/ Sarah Krasnik)
Sarah's a huge fan of Terraform + other config-as-code methods, and maybe you should be too.
Most recently leading a data engineering team at Perpay, Sarah has built and managed data platforms end to end by working closely with internal engineering, product, and operational teams. She recently left her role to pursue a wide variety of endeavors, including writing on her Substack.
In this conversation with Tristan and Julia, Sarah dives into how configuration-as-code can automate away data work, why you might want to consider adding a data lake to your architecture, and how those looking to build a self-serve data culture can look to self-serve frozen yogurt shops for inspiration.
Listen & subscribe from:
Key points from Sarah in this episode:
Tell us more about the process of moving off the data warehouse and completely having a different architecture for the storage layer.
Yeah, it definitely was a lot of work. Definitely, I pulled a lot of people at different areas of the process. This is throughout this whole thing is where I became very emphatic about Terraform because I was trying to build this entire architecture. But I knew that I would have to do it in a staging environment and in a production environment, so I didn't want to go into the weeds with AWS IAM roles and security groups twice — I just wanted to do it once and have it figured out.
So, we definitely worked with engineering a lot to be able to actually do the migration, but at a high level, thinking about data warehouses and data lakes, I think part of having a data lake is also actually having a warehouse. Just because we have transitioned to this doesn't mean that we don't use it warehouse. We do. We use Redshift at right now, but the point is that we have data in S3, which is, for those who don't know, basically the object storage. It's like a file system within AWS, but it's very cheap to store files and to store data, to really store whatever you want in there.
So, having S3, where we can have multiple things pulled we can dump data into S3 once, and then have multiple things pulling from that same information. I almost relate it to what you thinking about the metrics layer, you have this kind of intermediate layer where you're exposing things in one way and in the same way, and then having multiple other resources pulled from that layer.
And so to me as three kinds of services that place where then we have a data warehouse that is backed by S3 some of the data, we might transform a duplicate it and make it hot to expose to analysts in SQL and stuff like that, but also be able to query the data in S3 directly. And also in a third way access, S3 without the warehouse at all. And that's really where the processing that isn't batch, that isn't kind of the traditional sequel processing that we think about. That's where that comes in. And I would say that is what hopefully will open us up to be able to do more real-time applications.
What are the different types of data stores? And how do you create your buckets?
Yeah. I'm not actually even gonna take one sliver of that and just go into the bucket that kind of powers some of the data warehouse. And the way that we've organized it is — I think very intuitive, and I would hope that people would agree but — it's just one bucket powering the backend of the warehouse being, I mean also the data lake and we're kind of using them interchangeably here a little bit.
But within that bucket, one folder is one schema. And then within each of those folders, there's another subfolder that's a table. And then within each one of those folders is how the table is partitioned, right? So if you have some sort of event data, it might be partitioned by date. So that if you query only the last 30 days, it's going to be a lot faster than if you just query an entire right.
You said that the first thing we had to think about is the environment. What happens in each environment of your setup?
Yeah, I think environments are a very important concept that I think started on engineering teams, but really a concept that I think is relevant to engineering, analytics, and honestly even like business and operational teams.
So, I'll start with the kind of development and how I like to think about how engineering teams think about is kind of local development, where you could be disconnected from the internet in theory, and still, make some sort of application or do something locally. But here the analogy of developer is analytics engineer, data engineer, just an analytics person, who's writing code, or just doing something. And you're creating a Github branch, maybe scripting some stuff, architecting, running pieces of something to make a change. And really, I think the fundamental characteristic of a development environment is that it's localized and separate from any other environment as much as possible. And to make this tangible, I think it's helpful to walk through a specific example. So, let's start with simply updating a SQL query.
In development, this could be running this query to completion in maybe your own schema in your data warehouse and validating the results, looking at what they are, what you expect them to be, maybe running some right statistics on the results, just to make sure that your query is doing the thing that you want it — this is testing, which I'm sure we'll talk about later in the episode as well.
But moving to staging, the movement from development to staging is when you think you're ready for production. And what you're trying to do in staging from my perspective is verify that this is actually production-ready. So, the purpose here is to run some process end to end with your change as a sliver of that process. say the query being modified is used downstream in, in some other tasks or some of other jobs. So obviously you tested your period end-to-end and you know what you expect, but that likely doesn't live in a vacuum. Even if some other query isn't using it. You're probably, maybe you're writing it because you're trying to update some sort of dashboard or write something else. There needs to be a place to make sure that integrations work. That's the edge pieces of your change, all fit together. Your output could be what you expect it to be, but changing something as simple as one column from bullying that's true, false to a zero to one integer, because you want to sum over it in a dashboard. That's expected of you, but it could be an expect unexpected for some sort of other downstream processes for someone else. So this environment is really trying to be productive life without the impact of production.
So to jump to that, what really is the impact of production? To me, it's the dependencies on production actually influence decisions and influence end-users and customers. So production data either powers, dashboards, or reports, or reverse ETL processes and top operational tools, but basically the level of exposure in production is at the highest. At this point, the production data is data that it's extremely visible by other people outside of your team, and sometimes even outside of your organization. So you don't want to test here because again, the exposure is highest, so you want to make sure before any changes are put into production to make sure that everything is going to work as you'd expected it to.
What does staging really look like for your team?
Yeah. I think the first thing that I would like to point out is when talking about testing, this is I think where the concept of staging is most different for analytics teams and data teams than it is for engineers because, for engineers, you can test some user flow it doesn't matter how many users exist. You just want to make sure that your UI is like looking how you expect it to. But in a data world, when you write tests and you expect an average of some sort of column to be between some sort of range that actually if that's the case in production, you want that to be the case in staging. Otherwise, if you have a difference in that data your tests are just going to be off.
So, I am a big proponent of setting up a staging environment that actually has production data reading into it, and then you make adjustments on that. So what kind of data quality might look like? Well, actually, I'm going to answer that question a little bit backward in terms of what data quality doesn't look like. It doesn't look like missing tests and only a half tests and staging because then that's going to result in broken dashboards, it's going to result in broken pipelines. And that's where the exposure and production happen.
So, how to test and stage our entire test suite? We use great expectations at PurePay, I think there are a lot of other tools, like Soda, there's obviously dbt Test and all of the integrations there. But as much parody between staging and production as possible. So first you have your SQL tests, but then also you have as many integration points as possible.
For example, if we populate Facebook audiences for marketing. That needs to be tested. That's outside of the analytics ecosystem, but the exposure is I would argue highest because it is outside of the analytics ecosystem. So what we've done is actually set up like test audiences in Facebook that don't power anything they don't power ads, there's no money being spent. But we can send data to them to make sure two things. The first is that the integration actually works and the second is to make sure that the numbers actually match the ballpark of what we expect. So if we expect some audience size to be right, 5,000 or 10,000 people or whatever, we can validate those types of things.
What are some really tangible examples of how you're using Terraform in your data stack today?
So our analytics team has really taken the same road as our engineering team, which is defining honestly as much analytics infrastructure in Terraform, as code as possible, aside from SaaS tools that we use honestly. So for instance, we host airflow internally — so we self-host — and all of that infrastructure is defined in Terraform. The, to box within AWS, exposing that to the VPN, all of those kinds of connectors, that's all defined as code and.
Similarly, we copy data from a variety of production databases to either the data lake or the data warehouse. And because we're an AWS, we use kind of their solution, which is DMS standing for a data migration service. And really it is just like copies data from one place to another. All of that configuration in terms of what database can talk to DMS? What warehouse can talk to DMS? And making sure that all of that is configured and also monitoring around that. That is all done in Terraform.
First and foremost, the benefits really would be allowing us to manage access and security. I mean, that's a huge pain I would say definitely in AWS, but GCP as well. And as an engineering organization, I mean, we've shifted from hosting infrastructure publicly to hosting it in a private network. So what that means is that we have to think even more about security in terms of, instead of just saying "Any traffic from this IP is okay, we have to establish some sort of SSH or some sort of other connection", and that can be difficult in itself.
So being able to establish one way of doing that and then just copying it, and be able to kind of plug and play in that way, Terraform has really stepped in for us to be able to do that.
Any other big productivity hacks that you've figured out for your team?
Yeah. I actually think that this question it's like very much related to what do you put in Terraform and what you do not, right? So if my whole job is managing access to our warehouse and I spend an incredible amount of time like creating users and all of that, but I could reduce the time there, why would I want to continue doing that if I can be more efficient my time can be used elsewhere?
So, I definitely do stand by that, and if I remember correctly, I wrote this after I read a thread about data engineers being wary of SaaS tools and what the job market is going to look like. And my first thought was if Fivetran does what I used to do three, four, however many years ago, it's an incredibly successful company, then why would I want to spend all my time maintaining API integrations? It sounds extremely repetitive. I personally bring no unique value to that, so why wouldn't I want to focus on something that I can bring value to?
As far as where the winds come in, I would actually say the biggest wins come in pretty early. PurePay is I think extremely illiterate and the biggest wins have been pushing for the adoption of some of the tools that I mentioned, like Fivetran and Sensis, to kind of reduce the time to onboard new integrations. But I do think that it's important to call out the analytics ecosystem. And within there, it's very easy to look at people and companies who have already adopted a lot of these tools, but there are many, many organizations that are way further behind, especially in fields with much larger institutional history, like manufacturing or finance.
When I worked in an analytics organization, a large manufacturing company that was kind of my, my first job out of undergrad, leadership was very open source. And that was something that we had to think about. There very wary of open source, very wary of the cloud. So a lot of things were done in-house, and it was not that long ago.
Subsequently, when I worked at one main financial, which is a leading firm based in Wilmington, Delaware, a while back, some teams used Looker — my team was one of them. That was the affiliate marketing team where think about how if you look at credit cards or loans on sites like credit karma versus nerd wallet, my team was kind of responsible for assessing which affiliates were performing well and which affiliates weren't, and how do we want to maximize efficiency and profit in those channels. Part of my team would spend time forecasting for loans coming from affiliates, in terms of how many loans coming from each affiliate, how much would it cost us so on and so forth. And believe it or not, the forecast was not built in Excel, but the comparison between the forecast and the actuals was all also done in Excel. And that was reported weekly, but because the forecast was complex because it reflected some complex interactions and macroeconomic systems and all of that, it took almost a week to update. So doing the math most of the time was spent updating the actuals for the forecast to be able to report to maybe a leadership team how we're doing.
So, one of my projects was to help this kind of part of my team to move this forecast into an automated job. It would report on actual versus the forecast and Looker as opposed to Excel. And what this kind of enabling some of my coworkers to do was where they had a unique understanding of the economy just other things going on that are much harder to code, how the market was doing — I mean, this was pre-COVID — so like stimulus and vacation spending, all of those things weren't fluctuating as much. But, that type, those types of things are much harder to automate a harder to incorporate into a forecast.
So that's kind of where I think the biggest wins are and, honestly, organizations that are catching up and really still trying to onboard with these tools that have made our lives a lot easier.
More about Sarah: