After just shy of six years and just over 250 issues of the Data Science Roundup, it’s officially time to change the name. In the early days this newsletter was my opportunity, my excuse, to hoover up new knowledge on a fast-moving data ecosystem. And the thing at the time (2015) that seemed to be calling me (and everyone else in the industry) was data science. The early issues were a journey of me learning the data science basics—everything from statistical methods to basic technical how-tos to “how I broke into data science” posts.
All of this was well and good, especially for an ecosystem that was just coming into its own. This newsletter made some “best of” lists and provided some value to new practitioners who wanted to learn the basics of the field. I certainly learned a ton in the process.
But the data ecosystem and my own personal area of expertise have both evolved. I’m still interested in data science and cover it from time to time (especially well-operationalized experimentation programs!). But my own personal area of interest (and the work I do every day) is in a new category that I’ve personally helped bring to life: analytics engineering. If you’ve been a long-time reader, you’ve seen me shift in this direction over the years. Fewer posts on neural network architectures and more posts on disseminating knowledge to humans via data catalogs. A deep fascination with the competitive landscape of cloud data platforms. A real curiosity about “metrics platforms.”
If analytics engineering is a new term for you, the fine folks at Fishtown Analytics (where I’m the CEO) wrote a guide on the topic. The TL;DR for me comes down to creating and disseminating knowledge within organizations. Decomposed to a bullet point list, you can think of the primary components of analytics engineering (as practiced today) as:
Creating an internal data platform. Often this means ingesting all organizational data to a single cloud data store.
Transforming the raw data inside of that platform into meaningful business concepts, often through the practice of dimensional modeling.
Assuring the quality, reliability, and timeliness of the data in this platform via automated systems.
Building or buying tooling to help users of various personas interrogate the data in the data platform to answer their questions about the state of the world.
Conducting enablement sessions to help users self-serve throughout the parts of this process where their participation is important.
What’s different about analytics engineering vs. the approaches to these problems that have come before is twofold:
Software engineering best practices are followed throughout the entire process, resulting in mature, production-grade systems and analytic assets.
Analytics engineers can run the entire insight-generation process, from beginning to end, without needing to file a ticket with another team.
This world is moving incredibly fast. Cloud data platforms are large and growing, new projects and companies are being started all the time, tons of VC dollars are flowing in, users everywhere are turning onto a new way of working, building their skills, and reshaping their organizations. Best practices are just beginning to take shape. It’s exciting.
And I’ve been covering all of it here, in this newsletter, since the beginning. So…this rename is actually long overdue! I’m not a data scientist 🤭, I’m an analytics engineer. And from here on out this newsletter will be called The Analytics Engineering Roundup :) Same content, same perspective, same author, different name.
Also: I just migrated (finally) from Revue to Substack. Not sure if that will impact this issue getting to your inbox, but please bear with me as I work out any kinks.
As always, it’s an absolute honor and privilege to be invited into your inbox every other weekend. And now, on to the good stuff.
This post is not…polished…more of a brain dump, really. But truly one of the most insightful pieces of writing (IMO) on how the market dynamics of the data technology ecosystem operate. Concepts like incentive / mechanism design, and designing your own incentives as an organization, have been so central to my thinking since the beginning but are often under-discussed publicly relative to their importance.
I think that no one writes about this type of insider strategy because you’re either a) running a company based on this knowledge, or b) investing based on this knowledge. Either way, it’s either awkward or actively detrimental to put this stuff in writing! Which is why it’s especially valuable when someone on the inside actually does.
If you’re curious about the trends and forces shaping this industry, this is well worth the time.
The problem, summarized on one sentence and one image:
The work that modern data teams are doing looks nothing like what’s being taught [in college].
Could not agree more—this is the rule, not the exception. As a result there are too many folks being forced to learn too much on the job relative to their software engineering peers who come out of college with many of the skills and experiences they need to get to work on day 1.
Oh…just wow. This post hits hard from the beginning:
If Data is the most precious asset in a company, does it make sense to have only one team responsible for it?
This question has only one appropriate answer! The author follows up shortly with:
Data Engineers should not build ETL pipelines.
This is almost a koan…repeat it to yourself daily for best results. Who builds pipelines if not data engineers? Well…everyone, of course. The people who actually know about all of the data and want to use the outputs of the pipelines! The data engineer’s job is to create the tooling ecosystem to enable this diaspora to thrive.
This post is gospel. The first one in the series is also quite good.
Confluent filed to go public! Just this past Wednesday, I believe. Confluent is the maintainer of Apache Kafka (roughly…the founders are the creators and the company employs the five biggest contributors).
I don’t talk a lot about Kafka / Confluent here. In the “modern data stack,” Kafka isn’t a core tool today; it’s not one that’s of central utility for analytics engineers. But the batch >> streaming conversation within analytics is gathering steam, and my expectation is that all of us analytics engineers will find ourselves more familiar with it in the future.
I got hyped back in 2019 when they released KSQL, a real-time streaming SQL capability (I had dreams of streaming dbt pipelines!). It turned out to not quite be ready to be used in that capacity…at least IMO. The core challenge we experienced when playing with it was there were pretty strict rules around how tables could be joined together, which really felt quite confining relative to the more standard experience of being able to join whatever tables you need to.
It’s still not clear whether SQL-on-logs (Kafka) or something more database-y (Materialize) is going to end up bringing streaming to the world of analytics engineering.
There are so many interesting things to be learned about open source, open source business models, the transition to the cloud, and so much more in Confluent’s story, which this post does a great job of summarizing. If I had to choose just one thing that fascinates me here, it’s just how early we still are in the cloud transition—80% of Confluent’s revenue does not come from its cloud product! We’ll still be watching this transition play out for … a decade? More?
By the way, the post that introduced me to Kafka back in 2015 holds up super-well. Good use of time if this is new to you.
I really have no idea how I ended up reading this post—it’s a year old and I honestly have no idea whose Twitter feed or Slack thread I came across this in…but thank you. Hah! This makes me so happy. Over the years I’ve read so many posts glorifying the Tufte approach to visualization that almost feel religious. Don’t get me wrong, I’m a fan of the Tufte approach! But I am also a fan of nuance, of judgment. And so is this author.
Here are the things to love about this post:
a nuanced perspective…
from an incredibly skilled craftsperson (data viz @ The Economist?!)…
who shows really interesting examples to back up her point of view. Here’s one below:
This post is over a decade old, but discusses an aspect of game theory that I had been unfamiliar with.
Everyone may be rational. Everyone may assume everyone is rational. Everyone may assume that everyone assumes that everyone is rational. But at some point, some people implicitly are going to stop the sequence. In fact, on average, people only nest these assumptions to four levels.
Game theory is so so important in data work (and in life!), and this built a fundamentally new intuition for me. Four levels of recursion! Read the whole post to see why this was such an a-ha moment for me…it’s a quick read.