Digesting Coalesce 2022. Cloud 2.0.

9 Predictions for Data in 2023. A Survey on Catalogs.

Nov 06, 2022

Ummm…wow. What a month! It was so fantastic to see many many of you in person in NOLA at Coalesce 2022! After my keynote on Tuesday I got to take off my CEO hat and just hang out with community members for the next 3 days…just so incredibly rewarding. I met so many people whom I had previously had online-only relationships (always a funny and rewarding experience!) and got to go deep with a lot of users.

Here are a couple of thoughts that have been stewing in my brain since then:

There are a LOT of companies who are now 1-2-3+ years into their journey with dbt. We can see this show up in our anonymized data as well—many more projects are larger and more sophisticated than ever. I love this: it both presents new problem surface area and is truly an indication that this thing matters to people. Supporting increasing maturity / complexity is the kind of product challenge I want to be thinking about.
As a corollary to #1, there are far more pure-play software engineers involved with dbt than there have been in the past. Companies are building high-quality tooling where none yet exists off-the-shelf. dbt is now a critical business application and gets real attention from internal technical resources. I

—

Eric Weber writes about data products in the current macro environment. I think the specifics around data products are interesting, but really what I’m here for is the conversation around very clearly defining value and priorities in the context of constrained resources. These lessons from the post generalize well for all data work.

One of my favorite ideas introduced here is: make sure that downstream teams actually experience a cost to the work that they request from the data team. While it’s certainly possible to take this too far, it is challenging to effectively prioritize with a partner who doesn’t experience any costs associated with their requests.

—

Erik Bernhardsson wrote about why we are still early with the cloud. Typically when folks ask “how far do you think we are into the cloud transition?” what they mean is: “what % of all workloads have been migrated from on-prem to cloud?” This post is so much more interesting and creative than that—its core contention is that we’re really only at “Cloud 1.0” (or maybe 1.5!) and that we haven’t truly reshaped our thinking to a cloud-native environment yet.

My favorite bit:

Here's a random assortment of things I feel like we should have, if the cloud had truly delivered. But we don't:
When I compile code, I want to fire up 1000 serverless container and compile tiny parts of my code in parallel.
When I run tests, I want to parallelize all of them. Or define a grid with a 1000 combinations of parameters, or whatever.
I never ever again want to think about IP rules. I want to tell the cloud to connect service A and B!
Why is Bob in the ops team sending the engineers a bunch of shell commands they need to run to update their dev environment to support the latest Frobnicator version? For the third time this month?
Why do I need to SSH into a CI runner to debug some test failure that I can't repro locally?

:nods-enthusiastically:

—

Tom Tunguz, one of the most astute investor-observers of the data market, published 9 Predictions for Data in 2023. It’s good, it’s to-the-point, and it’s exactly the right blend of non-obvious and clearly-correct. One of my favorites:

Metrics layers will unify the data stack.

🙌

Aside from the shoutout to metrics, the best part of this post is that it brings into the conversation ideas that aren’t even on most data folks’ minds when they think about the future. WASM and DuckDB and their interaction. Notebook users as a % of Excel users. Warehouses going HTAP (even if not directly named…this is the driver of the SaaS bullet point). Lots of threads getting woven together.

—

Ok, sorry, I have to link to another fantastic post on data contracts. This one is by Yali Sassoon of Snowplow Analytics, and in it he advocates that the natural interface on which to define a contract is the event. Rather than comment on it myself, I’ll leave it to others to do that—I was so excited to see that one of my favorite observers of the larger software engineering ecosystem, Charity Majors, wrote a whole thread on it:

Charity Majors @mipsytipsy

The idea is that each dataset should have a data contract, consisting of a schema plus any SLAs, semantics, policies etc and a version id. This may strike you as a self-evidently good idea, but the article says it seems to be hotly debated amongst data engineers.

Charity Majors @mipsytipsy

The reason seems to be a clash between two different mental models; one views data as essentially extractive ("data as oil" was new to me), the other emphasizes *creating* datasets instead of simply loading them.

These are just the first two tweets. The overall article is fantastic, but if you only have 30 seconds then click through to CM’s thread as it does a great job of summarizing.

As more and more of the world moves towards microservices and event-driven architectures, it’s not unreasonable to think that these streams will become ever-more-frequently the source of the data we analyze rather than “data as oil” approaches. This doesn’t answer the Salesforce question, but it’s a productive iteration on the overall conversation.

—

The Analytics Engineering Roundup

Discussion about this post