Discover more from The Analytics Engineering Roundup
Reading the Clouds. Llama 2 and Licensing.
Plus an open question on the future of BI: looking for your thoughts!
Hi! Welcome to August. Before diving in, I just want to call your attention to a feature that we’ve started doing on the dbt blog called the Community Spotlight. Once a quarter we share the stories of members throughout the community, and we just released a new cohort. If you’re looking to get more deeply involved in the dbt community, these folks are fantastic examples of how to do exactly that.
To Opeyemi, Jing, Alan, Faith, Owen, and Josh: thanks for everything you do to make the dbt Community what it is. 💜
Enjoy the issue!
Is this working? This whole ‘get all my news from newsletters’ thing? I used to like Twitter and RSS, but both of those formats have become…less useful, to put it mildly. Over the past six months I’ve converted most of my news consumption habits over to newsletters.
I must admit that I’m not that happy with my habit and workflow so far. I’ve set up all of my Superhuman news split rules so that I have a ‘newsletters’ inbox, but I basically never have enough time to actually go in there and catch up. Ever since setting this up, my unread count has only climbed; a month or so in and I’m now at 277 unread emails in this split, mostly about data. In other contexts I’m an inbox zero person, so this unread count feels like a failure of some kind—either personal or systemic, I can’t tell which.
The funny thing is: while I feel totally underwater with my now-newsletter-powered news feed, the stats for this particular newsletter are continuing to climb. The Analytics Engineering Roundup (previously the Data Science Roundup) has now been active for going on 8 years. It started with ~6k subscribers and has steadily climbed to 22k. Not a huge number, but not terrible for what is ultimately a pretty niche topic. And the engagement numbers continue to be very strong.
So, maybe it’s just me? Maybe I’m unusual in my longing for the glory days of blogging and RSS feeds? Or even of Twitter circa 2019. :shrug: Regardless, I continue to be excited for the simple act of writing this newsletter: as has always been true, it forces me to stay current on the industry that I work in. Deadlines and accountability have always been important to motivate me to do hard things, so thank you for supplying both.
You’ll note that our authorship is changing a bit. My partner in crime for the last couple of years, Anna Filippova, is off to find her next adventure. I wouldn’t be surprised if she still showed up in this space from time-to-time as a guest, but we’ll lose her in our regular cadence. This is a bummer for me as I know that there are many topics that Anna writes about much more effectively than I do, but I wish her the best in her next adventure. I hope you’ll join me in wishing her well.
As we adjust, we’ll be going to an every-other-week cadence for a little while.
Thanks, as always, for sharing your Sundays with me.
What I see in the clouds
I want to start out talking about cloud earnings. While this feels very disjointed from the day-to-day concerns of data practitioners, the tidal waves that show up in the earnings of the hyperscalers cascade outwards and impact each of us in a myriad ways, big and small.
There are many, many places to read about cloud financial results, but my regular source continues to be Jamin Ball’s Clouded Judgment. His recent coverage is quite succinct and hits the important parts. Here’s a quote from AWS’ earnings:
“What we're seeing in the quarter is that those cost optimizations, while still going on, are moderating and many maybe behind us in some of our large customers. And now we're seeing more progression into new workloads, new business. So those balanced out in Q2. We're not going to give segment guidance for Q3. But what I would add is that we saw Q2 trends continue into July.”
AWS is the largest of the hyperscalers by a healthy margin; its financial results are one of the best measures we have for the status of the data and technology space as a whole. The trends in customer behavior it is observing are a good indication of patterns of behavior across the entire industry.
Spend on data infrastructure was one of the big drivers of new cloud workloads in the 2020-2022 time period, and retrenchment and optimization have defined much of our industry for the past year. That has led our conversations, from CDOs all the way down to individual data practitioners, to be more focused on how to maximize business value and minimize cost rather than how to make progress towards a five-year vision of the future. This impacts the public conversation in our industry, product roadmaps, startup formation, everything. It is the current in the river we’re all swimming in.
If you think of the data value chain—from ingestion to storage/compute to transformation, quality, monitoring, discovery, orchestration (etc.)—the categories that are the most ‘mature’ today are the ones that made significant progress prior to the current period of retrenchment. Specifically: ingestion, storage/compute, and transformation were all a part of “modern data stack wave #1” and were, on some level, a part of an accepted standard. The categories that will likely form “wave #2” had not yet become generally accepted wisdom as of early 2022, and progress slowed for ~a year as companies shrunk their appetites to take on new initiatives.
What AWS is saying in this commentary is that we’ve hit something of a turning point, where customers are starting to focus on what’s new, on what’s next. Spinning up new projects rather than eking more value out of what exists.
This is exciting to me. While we certainly have made a lot of progress as an industry since Redshift’s launch in 2013 ushering in the era of the Modern Data Stack, we have much further yet to go. We can now move data, run compute over it at scale, build increasingly-mature systems that transform data into knowledge, and we can do all of this with lower and lower friction and greater performance.
But we have not tamed the chaos. There is still tremendous duplication of work, inconsistency of metrics, inability to find and assess the trustworthiness of data assets, and govern ownership of data. We spend too much time as mechanics, replacing spark plugs and changing oil, and too little time engaging with the data on a strategic level.
We are not even close to done evolving as a profession. The change that we will see over the coming decade will be at least as great as the change that we’ve seen over the past one.
This is why I’m excited about the news coming out of the hyperscalers’ earnings. The past year and a half of retrenchment has been good and healthy. I’m excited to return to looking to the future; to dive in and solve the next set of problems.
Llama 2, licensing, and competition in LLMs
Llama 2’s launch is a very big deal. It’s not GPT-4, but it may be the highest-performing open source model today. The two biggest facts to know about this release, IMO:
Meta released the weights, not just the model.
The weights and the model both are licensed for commercial use.
These are both new developments vis-a-vis the original, and are a very big deal for AI overall.
Beyond allowing commercial use, there are two things in the license that I find fascinating. First, Meta explicitly forbids its direct competitors from using Llama 2:
2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee's affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
Heh…I’ve never seen a license quite like that! It’s an interesting way to limit the commercial use to just the competitors that Meta cares about while allowing everyone else to innovate. It’s not going to satisfy religious OSS advocates, but I think the world is increasingly becoming more pragmatic on this topic. I’d prefer a license that is largely open rather than a religious war that kept the license much more closed.
Second, and maybe even more interestingly, Meta explicitly forbids Llama 2 from being used to improve other models:
v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
I find this interesting because it feels like a potential battleground. What has become very clear over the past several months of rapid progress in the field is that LLMs are actually quite effective tools at producing better LLMs. There are a couple of ways I’ve seen this done:
Generate training data sets using Model A to train Model B. For example, generate a million children’s stories to train a small model that is great at understanding language and semantic concepts, but is largely lacking specific content.
Evaluate the performance of Model A vs. Model B. It is increasingly common to use a state-of-the-art model to evaluate the performance of smaller models. Facebook’s Llama 2 paper does this using GPT-4.
I’m sure there are others that I’m not aware of. Both of these are interesting feedback mechanisms that allow industry progress to feed on itself, creating an accelerating flywheel—every successive model is another stepping stone to create further improvements. If you’re looking for the fastest-possible progress in the industry this is good but may it not be desirable from the perspective of each individual model producer.
I am not a lawyer, but … this seems like a very challenging stance to take simply from an enforceability standpoint. The output of Llama 2 is simply text, and (as far as I know) there is no way to trace a specific piece of text back to the model that produced it. So :shrug: I’m not sure how this stipulation works in practice. But it is interesting to see this license term as it seems like a fairly clear indication of Meta’s perceived strategic threats.
Right now there is a lot of variability between the availability and licensing terms of code and model weights for the leading LLMs. I expect this to settle out over the coming ~year as we start to see the strategic positions of the players cluster into a small number of camps. This will be quite determinative in shaping the current AI wave.
The future of BI
I leave you with a question, one of the biggest questions in my brain in this period of transition for the data ecosystem. I would welcome your thoughts, well-thought-out or totally disorganized. Here it is:
How does LLM-powered AI impact the BI ecosystem?
From what I can tell, there are basically two answers:
Status quo. Sure, there will be some nice interfaces built that are powered by LLMs, but the core BI experience will not be changed in ways that are fundamentally disruptive. The same people will do roughly the same jobs they did previously, and the same vendors will likely dominate (now with new AI-powered features).
Disruptive. The natural language interface to data will fundamentally disrupt what we currently think of as BI and the entire ecosystem will need to be rebuilt from the ground up. The human jobs-to-be-done will shift significantly and more humans will be able to interact with data more directly.
I have heard smart people take both sides of this. I generally find myself pulled towards perspective #2, but I don’t have any real conviction behind this. It is one of the open threads in my brain that I am constantly looking for new data points on. If you have thoughts to share, I’d welcome them.