Hi folks! We’ve got some meaty topics to get into this week, so let’s jump straight in!
-Anna
Microsoft Builds the Bomb
Benn Stancil
PLOT TWIST. Microsoft builds a semantic layer deep into its network of Azure data services and takes on the modern data stack. There’s possibly no better validation that an industry is on the right track than when Microsoft builds (or buys) their way into a new tech layer.
Fabric does pretty much what you’d expect, and even promises some AI Copilot goodness in the future. As Benn points out, possibly the most interesting thing among the line up of features is just how interconnected they are:
Ultimately, however, the most notable thing about Fabric isn’t its features, but—as its name implies—the knitting between them. Fabric isn’t a suite of loosely connected services; it’s a single application that has one login experience, one user interface, one storage layer, one permission model, and one monthly bill.
Whereas each feature has a counterpart elsewhere, this connectivity is unique.
If I’m wearing my budget holder hat — it’s an attractive proposition. I’d ideally like to spend as little time as possible on procurement so everything on one invoice sounds pretty great. When I’m wearing my data practitioner hat, however, vertical integration looks somewhat less shiny.
Vendor lock-in can be a risky thing to specialize in. As a developer, would I rather invest my time building a career around something that runs on only one platform, like say .NET, or would I rather cut my teeth on something like Java that runs on any desktop platform? Or should I consider full stack JavaScript that also gets me into mobile platforms with relatively little context switching? Learning and getting great at a more flexible ecosystem means I have more choices in terms of future jobs and types of companies. Specializing in a vertically integrated ecosystem limits my options to certain companies that use that specific platform.
Platform coverage isn’t the only thing to consider here, it’s also the evolution of best practices. Open source ecosystems innovate much more quickly and give a developer exposure to many more new ideas and frameworks. When specializing in one vendor’s ecosystem, you’re limited by the speed with which the vendor brings new ideas into the ecosystem.
Putting my hiring manager hat back on, whom would I rather hire? Developers who are able to adapt to changing business and technology environments because they have exposure to many different approaches through the open source ecosystem, or folks who are really really good at optimizing a particular platform? Each drives value in different ways, so it really depends on what I need at the time, the kind of organization I work in, and what I’m building.
I don’t think there’s a straightforward answer here and there’s the rub. Both kinds of ecosystems exist because they’re built for different organizations with different needs. Those different organizations access very different job markets. I imagine this is a pretty big reason Microsoft’s data stack plays nicely with Spark and other open source standards even if there’s a native solution — the market demands both.
So it’s not quite time to pack up and go home. There’s a good reason composable ecosystems continue to exist (and thrive) alongside fully vertically integrated systems — the pace of innovation, the choice for developers of where to work and how to work, and ultimately, how and when to bring that innovation into your organization.
We’re back to regularly scheduled programming then — the search for the MDS potato 🥔. I think I preferred the headless Marie Antoinette metaphors when referring to headless bi and the data OS but I also really like potatoes so let's go with it ;)
After all, the dream of a modular and integrated modern stack isn’t impossible; it just needs, with all due respect to Douglas Adams, a potato. In Fabric, Microsoft built a potato and a bunch of facial features.
Creating Harmony: Collaborative Database Schema Design for Better Data Teamwork
Greg Meyer
This entire series by Greg is a must-read for anyone who operates with both a data team and a go to market operations function. This week’s installment is especially important reading because it covers tactical data model design decisions like this one:
Product data is 1st party data generated by users when they engage with your product. This data is voluminous – you might have hundreds or thousands of events happening every minute – so you need to be able to insert it into a table as fast as possible. Whether you are setting up dimension tables and adding a new value or counting the number of events for a person, product data is critical data.
Having massive and slow event data jobs is a very easy trap to fall into — you might start out with a simple event log that captures everything you need in one place and appreciate the ease of access. But as you collect more data, this process becomes more and more critical, and often more and more slow.
Our internal data team is doing some work on this area right now and the approach they’re taking is to split out different events by type of event, run separate jobs to model and transform the resulting data models to enable us to tweak how frequently we want to see each dataset update in our systems.
The rest of the article has more solid takeaways like this one. I know I’ll be coming back to this one again and I again.
Don’t do it for the glory
Katie Bauer
Did I mention already how excited I am that Katie now has a substack? Her post this week hits very very very close to home. It’s about what it feels like to be a data leader day to day — spoiler alert: it’s not magic, it’s often pretty frustrating, and you should choose to do it for the right reasons.
I appreciate her candid take that is layered with her own motivations and personal experiences in the role over time.
My motivation for first becoming a data manager was to try to scale myself — do more of the things I thought were important in an organization than I could do myself. And because I was already doing the work of a manager as an IC and it seemed appropriate to recognize that somehow. But if I’m honest, wanting to be taken more seriously did play a part in the decision, much like Katie alludes.
If you are or have been a data manager, what has motivated you to take up this mantle?
Something else that really resonated from Katie’s article:
You also need to be willing to change your mind about what your role is and what your team should do. You don’t need to start completely from scratch, but you should pay attention to how your company reacts to the way you run your team. It can be hard to get verbal feedback on what you could do better, but you can get a pretty regular signal on what works by paying attention to what the rest of the company notices and engages with. It doesn’t mean you need to shift everything your team does to being about those specific activities, but it does mean you should think about why they resonate. If that’s what works, why? How can you make more of your team’s activities like that?
I’ve come to believe that a big part of why data teams are all configured a little differently, live in different parts of their respective organizations and have slightly different scope and mandates is because this is necessary to adapt to the business environment. I’ve written before about the ways your data team is a reflection of your business stage of development and the distinct problems you need to solve in your org structure. I think that flexibility is for better or worse, imperative — the data team is the glue that unites a lot of business processes. As the business evolves, so does the structure of a data team: embedded vs centralized, heavier on modeling and engineering or analytics, etc.
Katie’s quote is a reminder that it’s also important to make more micro adjustments towards what is working, and be cognizant of this at different layers of the organization.
How to use dbt snapshots
Madison Mae
Last but not least, a neat dbt snapshots primer from Madison Mae. I highlight her post in part because (as she points out) snapshots functionality is a bit of an unsung hero — extremely useful, but a somewhat different mental model than a perfectly idempotent data model.
Occasionally I hear objections against snapshots because of reliability concerns — snapshots can break if upstream production data changes. Because of this it’s important to implement tests that alert you of a failure quickly. But it’s not a reason not to use snapshots entirely. Keeping a record of important business data is always useful, and you should trade off the complexity for the value you get — not everything needs a snapshot, but if a table is important enough to your business, at some point you’re going to wish you had a snapshot, and by then it’ll be too late.
Another important note is that snapshots and event logs address different problems: snapshots tell you exactly what the application database sees in a particular moment in time, whereas events log that an activity has occurred. If you need to check that your events are recording things correctly, you’ll do it against a snapshot. There are lots of others!
In short, use snapshots. At some point in the future, you’ll be very glad you did!
That’s all from me for this week folks! Have a lovely weekend 👋