Data Activation. Counting Users. Orchestration.

Plus: Steampipe. How data creates value. Non-binary truth states (!?). More.

Jul 31, 2022

New podcast episode! In it, Julia and I speak with the Katie Bauer. Katie was a founding member of Reddit's data science team and, currently, as Twitter’s Data Science Manager, she leads the company’s infrastructure data science and analytics organization.

This week I’m skipping the big thinkpiece and getting right to the good stuff. Let’s start with two amazing catalogs of resources!

This fantastic repo by Roni Kobrosly containing an index of data leadership articles, including links and summaries. Several are from folks in the analytics engineering community and many are from the wider data ecosystem.
This new Amplify Partners resource curated by Emilie Schario has a massive remit and delivers, categorizing the best resources from six categories.

I mean…should I just end there? Between both of these, you have roughly 100 articles to read, so get to it!

Not that you have more time after that, but here’s what I’ve been reading of late.

Sarah writes about data activation with a pragmatic bent:

Snowflake’s releasing new apps, huh? Will data activation/reverse ETL/[your name of choice here] become redundant?

Her answer (delivered in metaphor): there will always be a mix. Even if Snowflake native apps take off, we’re not about to stop moving data from the warehouse to operational systems. I agree, at least in the current decade.

Pedram writes about counting users! It tackles the complexity involved in anonymous user identification (let’s do user stitching with logged in users next!) and walks through the process of writing some SQL that would get a reasonable user count.

What an absolutely lovely post—it’s maybe the single highest-fidelity representation I’ve seen of what it feels like to learn analytics in the modern data stack…the domain-specific nuts and bolts, not just the technology. Pedram (and others!): I desperately love this format and I know the ecosystem would value more posts like this!

Stephen writes about how traditional orchestration doesn’t translate well into today’s data context. My favorite bit:

Airflow was never intended to be a heterogeneous platform intended for decentralized DAGs. It is a job scheduling and processing engine: take a single team’s workload and orchestrates it on a schedule, akin to a subway system.
The job of today’s data engineers is more akin to managing the entire transportation network — subways, sure, but also streets, buses, bike lanes. When the growth team — those assholes — drop 1000 scooters on the streets overnight, data engineers have to ensure they don’t cause accidents or get people killed. That is the new job.

I really agree with this!! The message isn’t “Airflow is [good/bad].” The message is: the world has changed and we have different problems now…problems that Airflow was never trying to solve. How do we solve those problems!?

This is incredibly cool:

What if there were a way of reading from APIs that abstracted all the low-level grunt work and worked the same way everywhere? Good news! That is exactly what Steampipe does. It’s a tool that translates REST API calls directly into SQL tables. Here are three examples of questions that you can ask and answer using Steampipe.

Maybe you’ve heard of Steampipe before…I hadn’t. The third example in this post where they join data from Twitter and Github is especially cool!

Benn writes about exactly how being data-driven as a company creates value. It’s a more nuanced take than I was ready for, and I was honestly struck by it. I think there is a tremendous amount right about this paragraph (where “count of the deck” is referring to card counting in blackjack):

[Data’s] constant presence in an organization is like knowing the count of the deck. Though it makes us a bit more informed in each decision, the effect is only felt in the aggregate, as the small edge compounds over time.

Overall, I couldn’t agree more, and the card counting metaphor really helped me make a mental breakthrough here. If the above paragraph didn’t make a lot of sense and you haven’t read Bringing Down the House, the post explains the blackjack metaphor in detail and is worth reading.

Also: Benn’s most recent post summarizes an open strategic question in the data tooling ecosystem with acuity:

The biggest looming battle, however, will be over a different territory: The brain—or operating system—of the data stack.
Nothing like this exists yet. Nothing can tell us about the various activities that are bouncing around in our data tools, much less coordinate and manage that activity. Nobody owns the logic that orchestrates data services, or governs how different products talk to one another.

This is certainly true (and effectively summarizes so much user pain happening today) but I honestly don’t know if it’s automatically the financial windfall that Benn seems to assume it is. The winner of the actual computer operating system category is (in some weird twist of fate) not a commercial product. The world has a funny way of defying prediction on a sufficiently long time horizon.

I certainly do agree, though, that this problem will not remain unsolved.

Max Roser is consistently both thought-provoking and inspiring. His most recent post doesn’t disappoint:

It’s hard to resist falling for only one of these perspectives. But to see that a better world is possible we need to see that both are true at the same time, the world is awful and the world is much better.

Holding space for multiple things simultaneously in my head that appear at first glance to contradict one another is one of the most important mental tools I learned in my thirties. Previously I had believed that truth had a rather more binary nature, now I feel comfortable just feeling curious without needing to collapse to a binary truth state. I think that this realization is one of the hardest unlocks that is required on the journey to be a great data professional. Epistemic humility is initially very uncomfortable.

While it’s fairly easy to realize that the above three statements can all be true when presented like this, many things that seem to be in tension with one another require you to sit with them for days, weeks, longer in order to untangle. That’s ok!

I don’t link to a lot of Arxiv papers any more, but this one was just a lot of fun.

Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of Intelligence in general. We introduce two fundamental principles, Parsimony and Self-consistency, that address two fundamental questions regarding Intelligence: what to learn and how to learn, respectively. We believe the two principles are the cornerstones for the emergence of Intelligence, artificial or natural. While these two principles have rich classical roots, we argue that they can be stated anew in entirely measurable and computable ways.

The Analytics Engineering Roundup

Discussion about this post