The Data Diaspora.
The Evolution of Data Engineering. Two posts on AI and LLMs (wild stuff). My thoughts on notebooks.
Episode 34 of the Analytics Engineering Podcast just dropped! In it, Julia and I talk to Prukalpa Sankar and Chad Sanderson. Copying from the episode text:
WARNING: This episode contains in-depth discussion of data contracts.
đŹ It is perhaps the best conversations on this topic that Iâve participated in, so I hope you enjoy it!
In other news, weâre wrapping our annual dbt Community survey. This helps us with planning as we figure out what investments to make over the coming year. If you havenât had 5 minutes to participate yet, this is your last opportunity! Thanks to everyone who has already contributed :D
Enjoy the issue!
- Tristan
This is not really about data and itâs explicitly anti-useful, but it is Benn at his finest.
This is madness, and Iâm being devoured by it. I wake up, I read the news, and I retrieve my jaw from the floor. Surely, this pace cannot sustainâI said in January of 2021, when the internet frothed up a mob to storm the capitol; I said in February of 2021, when the internet frothed up a mob to break the stock market; I said in April of 2021, when the internet frothed up a mob to put all of their savings in dogecoin; I said in April of 2022, when the internet frothed up the richest man in the world to buy Twitter.
The whole post is short, and is maybe the most spot-on summary of my experiences living through the past couple of years in our online ecosystem. Itâs impressionismâŚbuilt up dot after dot after dot; you step back and realize âholy shit that is wild.â
I donât know about you, but Iâm grateful that my plans for the day involve going on a walk in the woods. Also, go join a friendly Mastodon server. It actually is better.
OkâŚI promise the rest of this issue will actually be focused on data.
â
Thalia Barrera writes perhaps the best overview of the state of data engineering since Max Bâs canonical post. If you read this newsletter regularly, youâll likely find that a lot of the content here is well-known: increasing abstraction, focus on software engineering best practices, bringing together stakeholders⌠But this is such a very important story, and as I find myself having conversations with folks further and further away from the âearly adopterâ part of the curve, itâs just so critical to have the whole story together in one place.
I cannot tell you the number of times I run into resistance from practitioners who feel that higher-leverage tooling is threatening (it shouldnât be!). Or from leaders who arenât familiar with the grand arc of history that is playing out here over the course of multiple decades. This future is already here, itâs just not evenly distributed. And itâs storytelling like this that brings everyone along.
â
This post by Mikkel Dengsøe really hits hard for me. Iâm so interested right now by the question who is really doing data work? and this is one of the more interesting things Iâve seen on the topic. While Mikkel comes at it from the perspective of recruiting others in the org to actually join the data team, I donât actually think that this is the most interesting reason to identify these folks.
I have this sense that the âdata teamâ as currently construed is not something that will be commonplace in the future (letâs say a decade from now). Iâm not ready to really make this argument for real, but I just have this multi-year-long gut feeling about it at this point.
The short version of the argument: every data practitioner shouldnât sit on the data team. That wouldnât make any sense. Thatâs like saying âthe only people who should program should sit in software engineering.â This is not true. Programming is the practice of telling computers what to do. Appreciably everyone at the company should, at some point, know how to program at some level.
Programming is like numeracyâŚa skill that most humans didnât have and then society realized that should change. My oldest will apparently learn how to program when she gets to third grade according to her schoolâs curriculum.
Using data is a core part of the skillset of every single person who makes decisions. I certainly agree that we do need specialists to sit at the very middle and do the most complex tasks, but I also think we should pay a lot more attention to the data diaspora.
â
I like to link to the State of AI report every year. Especially as I get more and more ensconced in the world of analytics engineering and further from the advances taking place in AI, I find that this report is a tremendous way to catch up on the things I may have missed from the past year. Hereâs the exec summary:
â
This post made me đ. Titled The Near Future of AI is Action-Driven, the post outlines how current tech, including Large Language Models (LLMs, discussed in the report above), make some pretty wild stuff possible today. This is one of the most impressive demo of LLM capabilities I have seen (and there are many):
Pop out the screenshot and see what itâs doingâŚimpressive.
I donât include this here because I think youâre likely working directly with LLMs (although maybe you are!). Rather, itâs because I think that this type of workflow marries incredibly well with analytics engineering and Iâve spoken to several startups working at this intersection.
This type of question-answer flow works incredibly well on top of clean, validated datasets, but even more so on top of well-defined metrics. Itâs nearly impossible to ask a question on top of a giant pile of data piped in from who-knows-where, all in raw form. This isnât a technical limitation, itâs a limitation of languageâwe just canât specify enough detail in a concise enough way to ask a single question and get a reasonable answer back if a large corpus of definitions hasnât yet been built.
Definitions, in language, allow us to work at higher levels of abstraction. And what are analytics engineers doing but building a large catalog of definitions? We are fundamentally librarians, defining things, creating order from chaos. The more powerful the interactive question-answer loop is that sits on top of our work, the higher leverage that work is.
Related: what can large language models do that âsmallerâ models canât? One author finds 137 distinct abilities.
â
This wonât be the first time you read a history of notebook programming and it likely wonât be the last. But I think itâs a really useful time to check in on the state of notebooks and prognosticate about the future. Iâll assume youâve read (or at least scanned!) the above post before reading on, as itâs a good overview.
Hereâs my belief: notebooks are just a mechanism to author a series of ordered computations. The notebook medium is not interested in what the computations are, it is instead a UX paradigm to develop / order / execute / present them. We tend to associate notebooks with Julia, Python, and R, the languages that made up the original Jupyter anagram, but if youâve used other, more recently-developed notebook products you realize that this is no longer true.
Take Hex, the notebook that I am personally the most familiar with today. Cells are totally arbitrary computation. Cell A can be Python, cell B can be SQL, and cell C can be dbt-SQL (or dbt-Python!) run through the dbt Semantic Layer. One could as easily imagine a âspreadsheet cell,â where a dataset is loaded up into a spreadsheet interface and arbitrary operations are performed on top of it, only to be read back into downstream cells. Talk about a multi-persona data product!
The only rules in this world:
All computations must happen on top of a data frame interface
Each cell must be able to read in data from any data frame produced earlier in the graph
Each cell may output 0 to 1 data frames.
If youâll note, this feels a lot like the rules of dbt:
All computations must happen inside of a single data platform
Each model must be able to read in data from any model earlier in the graph
Each model may output 0 to 1 tables
The hard part is making the interop work. dbt sidesteps this problem by using the cloud data platform as its processing layer, but most notebook products are unwilling to accept this constraint. Much of Hexâs tech goes into making the interop between different cells feel totally invisibleâŚit just kinda works.
And that, to me, is the other part of the notebook transition: it has to be in the cloud. The type of infra required to deliver on a magical polyglot experience is truly non-trivial. Whether youâre using Arrow as the glue or DuckDB or Substrait, itâs real work to make all of these different compute paradigms play nicely together.
So, hereâs my take on the future of notebooks:
Notebooks are a UX for authoring data flows, not a particular type of flow. This UX prioritizes interactivity and exploratory analytics.
Notebooks subsume all other types of computation as different cell types, and these cells can all exchange data with one another.
Delivering the ideal notebook experience requires a cloud-based product.