Ep 37: What does Apache Arrow unlock for analytics? (w/ Wes McKinney)
A must-listen if you're looking for a deeper understanding of the guts (file formats, chip design etc) of the data stack.
Wes McKinney, among other open source contributions, is the creator of pandas, co-creator of Apache Arrow, and now Co-founder/CTO at Voltron Data.
In this conversation with Tristan and Julia, Wes takes us on a tour of the underlying guts, from hardware to data formats, of the data ecosystem.
What innovations, down to the hardware level, will stack to lead to significantly better performance for analytics workloads in the coming years?
To dig deeper on the Apache Arrow ecosystem, check out replays from their recent conference at The Data Thread.
Listen & subscribe from:
Key points from Wes in this episode:
Tell us a little about what Arrow is and the promise of the technology.
Sure. So in 2015, partly in response to spending a couple of years thinking about how we could build a better computational foundation for data frames, I became exposed to the big data world and found that plugging the data science ecosystem into the big data stacks and things like Apache Spark and, database systems more generally, things in the Hadoop ecosystem was extremely difficult. And I started poking around the open-source ecosystem. I found that there were a lot of other developers who were thinking about how to make data systems more modular and composable.
And we realized that a problem that we could solve that would make the problem a lot easier, make this problem of composability and modularity a lot easier is if we had a universal language-independent data standard for tabular data. And so, the initial purpose of the Apache Arrow project was to develop a universal column-oriented data standard that could be used portably across data engines and programming languages.
And that's solved some immediate need to move a large data set like in process between C plus and Java. Like we can do that now. That solved some pretty critical, low-level systems interoperability challenges. But the Arrow project has, over the last six years, blossomed into this multi-language toolbox of libraries for building analytical computing systems.
So we've developed a data format and then built protocols for serializing and transporting that data format. We built a framework for building high-performance network services, data services, and microservices that transport tabular analytical data sets. We're building standards and protocols for databases with built-in native support for the Arrow data format.
And more recently, we're also building and supporting the development of modern computing engines native to the Arrow format. And so that's given way to things like data fusion in Rust, Acero, and C plus, and Arrow-compatible query databases and query engines like Duct and Velox led by Meta. And so it's very interesting.
I understand that Arrow is a columnar memory format that makes it easier for different languages to transport data and read and understand in memory. Can you provide lower-level systems information to help those that are less familiar with Arrow come up to speed?
Yeah, so one of the pieces of the project is what we call the arrow column or format, which is a language-independent memory format for data frames and tabular data.
It's how you arrange the data in memory for processing. It's how it fits in your computer's RAM. You can also put it on disk and load it into memory without having to do any conversions or decentralization, which is a very helpful feature in building systems.
A lot of the development work has been building libraries and components across a wide variety of languages to make it easier for developers to build Arrow-enabled systems.
There are all these different formats or languages that Arrow's compatible with: C, C++, Java. JavaScript, Go Python, Rust, Ruby, and so on. If most data work is done in Python, SQL, and Scala, why does it matter that JavaScript can now read from this memory format?
It comes down to speed, efficiency, and latency in applications. Just to use the JavaScript example, several projects have adopted Arrow as the interchange format for connecting their web app, front end to the back end, which might use Python or Java or Rust or something else.
So basically, you can send data from the back end of your application to your front end, and there's effectively zero overhead aside from actually relocating the network speed.
And so you don't have that extra hit of overhead of okay, I received the data from the back end, now I have to convert it to this data format I use in my front end application and JavaScript. And so we've seen that Streamlit was one example of a project (it's now a part of Snowflake) and they built this framework for building rapid machine-learning applications in Python. And so there was a lot of data communication between the front end and the back end. And they had a bunch of custom code that they used for dealing with the conversions between the front end and the back end.
And when they adopted Arrow, not only did they make things significantly faster, you can look up the article, and it was like, I don't know the exact number off hand, but like 10x faster. But they were also able to delete all of this custom code that they wrote.
And so things get faster. You get to delete a bunch of custom code. It's a win-win.
Latency has a few components: disk access, converting between memory formats, and network access. If Arrow is taking a big chunk out of these other sections of things, is that most of the time spent in most data applications, or is it only a small percentage?
It can be a big part, and we've seen reports and people talk about 80% of a data application could be spent in serialization, like converting between memory formats.
But nowadays, particularly with the way that modern processors, like GPUs and CPUs are developing, the way that you arrange the data to be processed makes a big difference. If you arrange the data incorrectly, it can make the processor much less efficient.
And so, like modern CPUs, like column-oriented data, you can use SID instructions, which stands for single instruction data. So they're processor intrinsics that allow the processor to process multiple values with a single CPU cycle.
And so, as time has gone on, the number of values that can be processed with a single intrinsic construction has increased up to 512 bits at a time. Or traditionally, CPUs could do 64 bits. They could only deal with one 4-bit value at a time per CPU cycle.
And so essentially arranging the data to maximize throughput for the processor, not only the ability to process values, using these vector instructions but also such that you're taking advantage of the processor's cache hierarchy. Because now, the modern CPU is not just like a processor that runs x86 code or ARM instructions. There are these three layers of L1, L2, L3 cache. And so you have to think about when you're designing the processing engine for your data system, as the system developer, you have to design around how data is going to flow between these different levels of caches and that can make 10 to 100 times difference in performance if you're using the processor effectively. Of course, a lot of computing nowadays, even though Moore's law for CPUs has slowed down processor processing efficiency, is still getting faster in GPUs.
So like a lot of just the innovation and processing power and the ability to process mass quantities of numeric data or analytical data, it's not happening in CPUs as much anymore. It's happening in GPUs and FPGAs, and even custom silicon.
And so I think that the next generation or the next frontier for analytical data processing systems is going to happen similar to what's happened with cryptocurrency mining. It's a dumb example, but it started out running on CPUs, and then it was GPUs, and then FPGAs, and now it's all custom silicon. But if there are opportunities to leverage GPUs and other accelerated computing technologies to process, run SQL queries, and do analytics faster. When there's a will, there's a way.
I don't know if you've contemplated the amount of money that's being spent globally on running analytics workloads or doing machine learning, but it's probably pretty insane if we actually could get the number.
George Fraser wrote a post on Hacker News. He said “Arrow is the most important thing happening in the data ecosystem right now.” What is the crazy out-there dream if Arrow becomes adopted everywhere and people can read from the same format? How is our world going to change?
It's hard to extrapolate 10 years from here. But the main thing that I'm really excited about and that one of the reasons why we got this company, Voltron Data, is that we're really investing in making the different layers of the modern, analytical system more modular, composable, and polyglot.
And so I think to have ultra high performance, scalable language integrated querying capabilities whether it's querying for analytics or future engineering, for machine learning, and pre-processing.
Because ultimately, what we want is to align the interests of software developers and hardware developers so we can have the hardware developers working to develop the analytical computing primitives to do the internal details of SQL queries and feature engineering machine learning as power efficiently as possible so that we're emitting less carbon to run these computing workloads.
But then, on the user side, we have the freedom of choice regarding programming languages. So we have high-quality, natural-feeling programming, interfaces, APIs, whether it's in Python or in Rust or Java or Go, or Julia; who knows what'll be the most popular programming language for data science in 10 years, but the hope is that it doesn't matter.
Like you can pick the language that makes sense for the kinds of applications that you're developing in your different system integration needs. And your ability to process large amounts of data and get that data at high speed into your application is not inhibited by your choice of programming language.
Where can folks learn more about you and the work that you're doing?
We had a conference called The Data Thread earlier this year, and there's a whole bunch of talks from that you can look and see different applications of Arrow and the Arrow ecosystem. And so there's a great set of videos from some of our engineers and folks in the Arrow community.
We are hosting another edition of The Data Thread in February. And so we're going to have another great round of talks and content. And as time passes and the Arrow ecosystem gets bigger and more diverse, we will learn about new and exciting use cases and things our people are building.
So it's one of those cases where the future is here; it's just not evenly distributed. And we're excited to get the good news out and to cross-pollinate across the early adopters and their experiences as we build toward this more modular and composable future for the ecosystem.