My Thoughts Going Into a New Year
AI's (lack of) impact on data practitioners (so far). AI and content rights. OSS Licensing. dbt project hygiene. Where we're at with the MDS.
Happy new year! It’s good to be in your inbox after a bit of a break. I entered 2023 with a lot of trepidation; I’m entering 2024 energized and excited. I wonder if you feel similarly?
In my first post of 2024 I figured I would share some thoughts I have going into the year. These are the things taking up space in my brain today. They are not really predictions, but things I am actively curious about / thinking about / conflicted about. Feedback very welcome.
AI’s impact on most data practitioners has been modest to-date…when will that change?
Go out 2, 3, 5 years and you could get me to believe almost anything about AI’s impact on our profession. I don’t know and you don’t know. It’s fun to speculate about sometimes, but honestly my appetite to continue to read content about “The impact of LLMs on [traditional data topic]” is quite limited at this point. We’ve all done a lot of speculating; enough already.
But has your job—literally, your professional day-to-day—changed much in the past 12 months?
At dbt Labs, AI has started to impact various different parts of our business. We just bought Notion’s new AI features (we are huge Notion users) and after a trial with the whole company it solved some real problems for us. And we are increasingly finding ways to deploy AI in customer support—not in directly customer-facing ways, but as an assistant to our support folks. Metrics have been very positive; the impact is real. We have some other projects that are in the works as well, stuff I’m excited about.
But most data analysts, analytics engineers, and data engineers have not seen their roles change meaningfully yet, as far as I can tell. Why? When will that change? What will unlock that change?
Perhaps the most surprising thing about this is that AI seems to be impacting the lives of data professionals more slowly than the lives of software engineers. Even data professionals who primarily spend their days writing code! Software engineers are already seeing significant efficiency uplift via current versions of Github Copilot; can’t data people benefit from the same thing?
My guess is that this is because most application code isn’t as tightly coupled to the underlying data. It more often contains type information. It more reliably has clean abstractions. So an LLM can, with this code, a prompt, and no other context, do useful work. But when is the last time that you were able to build a dbt model without heavily interacting with your underlying data store? I don’t know that I have ever written dbt code without a database terminal in front of me. No one has recreated that process in an LLM-enabled code authoring experience yet.
Certainly, this is solvable. And it’s solvable with current-generation capabilities. I think this is a product design question at this point and I personally expect that we—data practitioners—will start to see our workflows being more significantly impacted by AI in the coming 12-24 months.
Sidenote: Am I an AI late adopter?
Ok, I’m not using this newsletter as simply an opportunity to work out my own personal issues in public, but I want to share something I’ve come to realize about myself in the hopes of hearing your thoughts.
I think I am personally an AI late adopter. I realize how that sounds—I sound like an absolute neanderthal. Do I not realize that AI is the future?
The weird thing is: I am actually a huge believer in AI. I understand the tech, I have used a lot of AI-enabled products, I have experimented with changes in my workflow. And yet, every time, I have come to prefer the existing way I work over enhancing my capabilities with AI. Here’s why:
I am a good writer; I have a voice. I don’t want an AI to write a bunch of generic prose for me in an internal doc or an email. And I don’t want to edit that AI-generated content to make it more ‘personal’ either. Writing is most of my job; when I write something it has to be my brain doing it.
I have great support staff who help me out with a lot of different things that, sometimes, an AI might have been able to handle. But try to convince me to trade my real human team for AI assistants and I will fight you. It’s not close.
I don’t yet trust AI-generated analytics. I could imagine getting there, and that would be a HUGE accelerator, but we’re not there yet. And, as discussed earlier, AI-assisted dbt modeling isn’t either.
Like I shared earlier, we are making AI investments at dbt Labs, this is just a commentary on my own personal workflows. The one tool I’m playing with right now is the Perplexity Chrome extension. We’ll see if it’s sticky.
I would love to find ways to bring AI into my day-to-day that actually fit with how I work. If you have thoughts / suggestions, please let me know.
The OpenAI / NYTimes Lawsuit will resolve a lot of ambiguity around AI and content rights
Ben Thompson has the most thoughtful take on this IMO. I won’t try to summarize his point as, in classic Stratechery fashion, he tells quite a story. The heart of the piece is splitting fair use into inputs and outputs—i.e. it is fine to “input” copyrighted material, but it is illegal to “output” copyrighted material. I 100% agree that this is the central point of the case.
Honestly I think I basically agree with OpenAI’s public stance here:
I only have two additional points to make beyond what Ben covered:
This is not that different from news publications being pissed at Google a decade ago. I think the resolution will be the same (i.e. mostly no intervention).
I’m not necessarily making a judgment on whether this is Good or Bad—I honestly don’t have a strong perspective there. But I do think that this is what is consistent with current Fair Use doctrine and how the technology actually works. If we want a different outcome, we have to pass different laws. And I think that is quite unlikely in the current political environment.
Zooming out, I do believe that getting clarity on this topic with as much case law on the books as soon as possible would be excellent for innovation. There are a lot of people with half-baked viewpoints on this topic and this creates a level of uncertainty that is unhelpful.
OSS licensing remains a hot topic
Last year Hashicorp made a huge licensing switch away from permissive OSS. I went on stage at Coalesce and committed to keeping dbt permissive OSS. Just this week, Snowplow announced that they are migrating away from an OSS license.
Hashicorp products are used by millions of developers and Terraform was a big part of the inspiration for dbt. Snowplow is the second most widely used web analytics product (behind Google Analytics). These licensing changes are really massive. As a believer in open source software, I am a little disappointed in the way this has played out. As a CEO, I understand the pressures that can lead to these types of decisions. I do believe for dbt that we have a solid commercial path without needing to make this change.
I think the key is to make good decisions about what is open vs. what is proprietary and monetized early on and to continue to iterate on that as you get signals from the market.
dbt project hygiene, scalability, and self-service will continue to be in-focus for the data community
The ability to get more people in an org involved in writing dbt code, sharing their work, and making sure that project architecture and standards are maintained remains a huge focus for the ecosystem. I’ve talked a lot about that over the past year, and we’ve shipped a lot of product to help with it. dbt Mesh is resonating, as is dbt Explorer. There is more to do.
I care about pulling many more people in an organization into the dbt workflow, but making sure that happens in a structured, governed, fashion. I think we have a clear path towards doing exactly that, and the upside for organizations is tremendous. That’s a topic for another day.
dbt project hygiene is an issue in my personal day-to-day. I was doing work in our dbt project the other day and realized that a model I was using—one I had created years ago and hadn’t touched recently—had been essentially “stealth deprecated”. Another model had been built to supersede it, but the folks who built that other model hadn’t done the work to go back and remove my old code (maybe 6-8 models). This became a source of confusion and inefficiency, never mind unnecessary warehouse spend.
Identifying, deprecating, and deleting old dbt code is an important part of dbt project hygiene, and today it takes more work than it should. (If you want a good post on this topic, Jay Sobel wrote a great one.) We also don’t create enough space for our data teams to do that work and that should change too.
We continue to be in the deployment phase for the MDS
The modern data stack that we’ve all come to love over the past decade isn’t going anywhere; its categories are getting increasingly mature and increasingly well-integrated. Its technologies and best practices are getting more widely deployed, both to more companies and more broadly inside of companies.
This is the phase of any cycle where the real work gets done and where the real value gets created. It’s the phase for getting living in the trenches and solving real problems. The MDS was the future five years ago and it’s still the future today, but we actually have to roll up our sleeves to make the replatforming happen. 30 years of investment in the prior paradigm doesn’t get upended simply because of a few insightful blog posts.
This is where I spend much of my time now: working with companies to build bridges to this future. As an MDS early adopter, it was easy for me to imagine that most interesting problems were solved years ago. But the world is so much bigger than I had any ability to imagine back then, and there are so many more problems to solve to bring the fundamental innovations of the modern data stack to everyone who needs them.
This is what my 2024 is about. I’m looking forward to getting at it alongside you.
Here are my thoughts on AI data workflow integration: number one use - reformatting and light cleaning. GPT works well for getting text out of an image, for reformatting code, for doing things you could do with regex but don’t feel like writing regex, for writing an annoyingly long case statement, pivoting or in pivoting data, etc. GPT hasn’t worked so well doing any kind of analysis with any complexity but it can do a little bit. My favorite use case that turned out pretty well was converting SQL between SQL Server syntax and Snowflake syntax. I would say it saved time but definitely wasn’t perfect.
Also - in the realm of dbt cleanup I’ll add implementing new dbt features. I know I have some projects from around 2019 that are still in production that I doubt anyone went back and added the concept of sources, to them much less any concept of metrics or semantic models, mesh, etc. AND I know I didn’t really document them very well. The use case was honestly version control for database views.. and that’s about it. Now I would add a lot more - but I suspect many companies may just be stuck at the “version control” stage. I say stuck - but still in a way better spot than before...
OSS/hashicorp: see open tofu, open bao - terraform/vault were not big/complicated enough to be monopolized successfully, and the license allowed branching.