dbt-excel. A Symposium on Orchestration.

Plus just a dash of LLMs.

Apr 09, 2023

New podcast episode! Taylor Murphy and Pedram Navid join Julia to recap Data Council 2023 and have a bit of fun. They talked about streaming, how the MDS is growing up, new SQL variants, and, of course, AI.

—

dbt-excel may be my favorite April Fool’s joke of all time. And, from what I can tell (although I haven’t had the chance to play with it myself) it may also actually work. Which just makes me so happy—a handful of humans spent real time and energy making this thing. Here are my favorite parts.

First, the video on the home page is hilarious. “We need evolution, not revolution.” “We make the world a better place, one dbt adapter at a time.”

Second, there really is working code in the repo. It’s not a huge project, but there is real work that has gone in, including refactoring.

Third, the issues are freaking fantastic. Here’s the issue titled “[FEATURE] add support for Excel 95 and earlier“:

Fourth, the readme is truly hilarious:

1,048,576 rows ought to be enough for anybody.

I know a bunch of people involved in pulling this together, but I haven’t spoken to any of them about it, and so I got to go in cold. What a delight. Josh, Anders, the folks at GoDataDriven, anyone else who has been involved…nice work. I rarely laugh out loud while catching up on data community news :D

—

Stephen Bailey has been running a pretty cool thing for the last several weeks on his Substack: he’s hosting a symposium that asks the question “Is the orchestrator dead or alive?” He held a call for contributors and has published six articles so far (I think that’s it? although not quite sure…). They’ve all adopted Stephen’s colorful writing style and are fun points of view to inhabit. I want to highlight four of them here.

Vinnie Dalpiccol takes the original argument further and states that Nobody Should Write ETL.

The world I wanna live in is one in which some team, be it the engineers or the data producers themselves, declare source assets, including metadata and contracts about its shape, and consumers can hook up to them, the system being able to intelligently tell where they’re saved and how often they’re updated, and taking care of keeping all downstream dependencies up to date.

Vinnie took a roundabout way to get to this point in the post, but it’s a very reasonable one. It’s not the world we live in today, and I’m not sure how to get there from where we’re at, but it’s a vision of the future I could get behind.

Louise writes Will active metadata eat the orchestrator?

While it may appear that active metadata has the potential to replace certain aspects of the orchestrator, particularly the triggering mechanism component, it should be put at the service of a different use case.

This is an unusually transparent look at the strategy of a whole segment of the modern data stack from one of its participants. Fully agree with the conclusions.

Benoit Pimpaud states that Airflow's neighborhood must be razed.

Yes, the Airflow house is great. I loved Airflow’s garden, like any front-end developer who loved jQuery lighting at some point. But now the house is too big to keep all our furniture and decorations tidy.

Perhaps my single favorite idea from this post is summed up in this line: “The declarative paradigm is nudging every part of the data stack.” I strongly agree with this statement, and I think many of the problems we still have are ones where we have not yet extended the declarative paradigm far enough. This is hard, because being declarative requires a fairly thick layer of infrastructure. You can’t write declarative SQL, for example, until you’ve fully fleshed out the study of relational algebra and built this complex thing called a “database”. In order to drag other areas from the imperative to the declarative, we will need to do similar levels of work. dbt has done that with a lot of data pipelines, and that work is just now starting to happen in the world of orchestration.

Benjamin Djidi asserts that Orchestration is a time killer.

…you simply can’t orchestrate everything. It introduces complexity and latency into your data operations and is generally a very efficient way to obliterate many weekly hours into an activity that ultimately shouldn’t exist.
Imagine if everything you run in prod had to be orchestrated, that would be the death of agile. Really, why is this still a thing in the data world?

He goes on to share pretty concrete thoughts around what the future looks like and how it differs from the present, even taking aim at the idea of a “job” with a defined start and end time—something that we take for granted today! It’s very easy to imagine a state of the world in the not-too-distant future where the concept of a “job” seems very legacy.

—

My thoughts on orchestration are fairly straightforward. They are as follows.

The current status quo is obviously not optimal. It is fragile, too effortful, etc. We don’t use it because we like it, we use it because it’s the best we can do for the moment.
The entire concept of an orchestrator is predicated on a batch-based paradigm. This paradigm is ever-less-relevant as data systems mature.
Data processing can be run in a far more continuous modality, even if the underlying computation is better described as micro-batches.
Any time spent on orchestration is inherently not value-creating. The ideal amount of time to spent orchestrating things is…none. This is a common trait of various activities throughout software history. How often do you think about resource sharing between applications on the supercomputer you’re currently using to read this? When we solve problems of this type appropriately they simply vanish into the background and we forget they ever existed.

The industry is already pushing in this direction and I think we’ll see this change accelerate in the coming 12-24 months.

—

Last thing. I haven’t been able to stop thinking about this post since it first came out. What kinds of ChatGPT plugins need to exist to fundamentally change the practice of analytics? I truly believe composability + LLMs starts to become a huge unlock for the entire data ecosystem.

There’s wayyy too much in this topic to go too deep on for the moment. For now, the Wolfram Alpha plugin allows a user to ask:

How far is it from Chicago to Tokyo?

…and get a good answer. What if you could instead ask:

How are same-store-sales trending YoY in North America?

I think we’ll see this solved very effectively in the next 12 months for proprietary company data; I’m thinking about this as a “level 1” question. I think there’s also a “level 2” type of question, which looks more like:

Why are same-store-sales trending downwards in North America over the past year?

I think the road to that solution is still somewhat less clear, but I think it’s more conceivable now than it’s ever been. Which is mind-blowing.

The Analytics Engineering Roundup

Discussion about this post