Discover more from The Analytics Engineering Roundup
Coalesce. Data Contracts. The Semantic Layer.
Working more than you mean to.
Are you coming to Coalesce this year? After having to cancel the in-person part of the event in 2020 and 2021, I am incredibly excited about meeting the dbt community in the flesh in 2022! The event has four modalities: online, a hub in New Orleans, and two satellites in London and Sydney. IMO this is the future of large-scale, community-based events. Local ecosystems have in-person experiences to maximize relationship-building opportunities, while a first-class online experience enables the entire community to participate.
I’m headed to New Orleans. Coming? Make sure to sign up soon—there are actually not that many tickets available.
Enjoy the issue!
Chad Sanderson writes about data contracts, a topic near and dear to my heart. To be honest, I didn’t fully grok Chad’s perspective until recently because I have never worked in an environment where these ideas was put in practice. Fortunately I had the opportunity to spend 30 minutes with him recently and ask him all about it.
The big unlock for me was that his recommendation is to never sync data directly from a database. Instead, build an API for any data that you want to sync into your data warehouse and extract all data via this API. The unlock here is that this layer of abstraction provides software engineers building the product with a compatibility layer—a layer in which they can test for and prevent regressions. Connecting directly to the database is too low level and prevents product builders from having any control over what downstream data consumers see or applying any QC to it.
The approach that I recommended a month or so ago in this newsletter was similar in that it focused on using contracts as a way of building systems that don’t break things, but my ideation around how to implement this in practice differed. Honestly, I’m not so attached to a specific approach—what matters to me is the identification of the problem and alignment on the types of guarantees we need our systems to provide us.
Ananth @ Data Engineering Weekly is also interested in contracts, recently releasing Schemata.
Erica writes about working too much…or maybe more accurately, about the factors that make one choose to work too much. It’s a fantastic post, and one that I deeply identified with. I know a lot of data folks for whom this applies.
My kids are now 4 and 2 and I work every day from 7:30-5pm, with a hard stop from 5-7 for family time. If I really need to, I’ll do more in the evening. But often, the very act of closing my laptop at 5pm (and knowing that I have to do it) acts as a bit of closure for the day. Previously I would often work 12-13 hour days without thinking too much of it. But honestly with the switch four years ago I don’t observe such a huge difference in my output. Focus works wonders.
What are the choices you make around how you spend your time? I don’t think there is a correct answer—if you want to work all the time then by all means do it—but I think you should be intentional about it. Don’t lie to yourself that next week will somehow magically get better.
JP Monteiro writes potentially the most sophisticated piece on the semantic layer I’ve read. It’s long. It’s detailed. And it poses such fantastic questions that we just don’t yet know the answer to. By far my favorite: “Who will be the owner of the semantic layer: business teams or data teams?”
Ideally, business should own the definitions. For some reason we don’t think this is likely. Defining metrics is a very precise endeavour. The issue is when we mix technical complexity with business complexity. Semantic layers should allow us to focus on business complexity only and they should enable business people to define their own metrics, their own concepts. (Call me naive, go ahead.)
This feels to me to be the same kind of naiveté that said data analysts could build first-class data pipelines! Which is to say—the good kind. Insofar as we shouldn’t underestimate SQL users, we also shouldn’t underestimate those with deeply technical knowledge of business concepts. The best answer to “who should own X thing?” is typically “the person who is most incentivized to want it to be correct.” This begs a question that JP didn’t pose but that I will here: what is the ideal business-user-focused authoring experience for a metric? There is lots of room for UX experimentation here. My guess is that most metrics will not be defined by editing a YAML file over the long term, although they will likely compile to YAML.
The rest of the post is not to be missed. I remain incredibly excited about the potential of the semantic layer but will hold back from saying more for the moment(!).
[edit: I originally linked to a post here that, upon hearing feedback from the community, turns out to have been a poor editorial choice. If you’ve followed along, read more here.
David’s most recent post talks about one of these up-and-comers—Firebolt—and gets into the cost issue as well. It rightly focuses on the challenges in using Firebolt as an analytics engineer, where incremental models (hard to live without!) are basically non-functional given the lack of either delete or merge operations. 🤷 Redshift didn’t use to support merge either.
In general, I’m excited about the difference in tradeoffs that Firebolt (and others) bring to the table even if it adds some level of complexity for analytics engineers. We will need to learn to model data very differently in, i.e., Firebolt vs. Snowflake vs. Materialize. This is one of the reasons I feel that having some abstracted “best practice” that is assumed to be true for all use cases on all underlying warehouses doesn’t make sense. Sure—some high-level principles should be consistent, but decisions about how to shape your schemas will depend very significantly on use case and platform.