The intersection of UI, exploratory data analysis, and SQL (w/ Hamilton Ulmer)

The technologies driving data visualization today

Dec 22, 2024

Hamilton Ulmer is working at the intersection of UI, exploratory data analysis, and SQL at MotherDuck, and he's built a long career in EDA. Hamilton and Tristan dive deep into the history of exploratory data analysis.

Even if you spend most of your time below the frontend layer of the analytics stack, it’s important to understand trends in both the practice of data visualization and the technologies that underlie that practice.

All of it deeply shapes the space that we operate in.

This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways from this episode

If you're like other people who started their data science careers in the 2010s, you probably ended up doing all different parts of the analysis pipeline. Over time, it seems like you and your career have become more focused on the data visualization part. Is that fair to say?

When I joined Mozilla, we had what was then considered pretty big data. It was also very complicated, messy stuff that the browser was generating. We were using a lot of that telemetry to basically calculate the numbers for the business as well.

Data visualization for me has always been a means to an end, and that means understanding the data that powers the business and the product. And so I think those interests are intrinsically connected.

Many data visualization and exploratory data analysis projects focus on the end of the analysis process, the presentation layer. This includes things that you might put in a slide deck. But they're having to work with extremely messy, highly nested data generated by the web browser, which possibly runs in a semi-degraded state. Not all the data that it sends is good data.

That first mile is‌ way more interesting to me. And it's maybe the genesis of my interest in data tools in general is that first mile problem, not the last mile problem. This is a place where exploratory data analysis is ‌especially valuable. But I think the way that people think about EDA is more in the middle or toward the end. That first part is‌ really critical.

Exploratory data analysis is used to answer the question, can I trust this data? What do I need to do to it to get it to a state where I can trust it so I know what I can expect from it.

Absolutely. I think this is the fundamental trade-off in a lot of analytics tools. The people that make those tools or the libraries that we use are often made by technical people, oftentimes people with a research background that have to do data cleaning. But the financial value comes from the end-user experience of dashboards.

And so there's always been this trade-off‌ with EDA tools between these two things. That’s the case of Polaris, which is a research project out of Stanford in the early 2000s. These researchers wanted to‌ figure out how to make EDA interactive and exploratory.

This was during a time when computers were just starting to get better at this kind of work and data was being generated. And those researchers at the Polaris paper were groundbreaking for analytics. Those researchers went on to found Tableau.

And the killer use case, wasn't the first mile. It was the last‌, because the economic buyer of the tool cared a lot about understanding what was going on with their business. So a lot of the focus for EDA tools has been on BI rather than the thing that‌‌ really vexes data practitioners—cleaning the data. Everyone likes the joke that 80 percent of the job is cleaning up the data.

So if you look at a model of data work, it's largely about trying to correct problems with data collection as early as possible to figure out what you can possibly say about the business down the line. That's‌ a high-value thing, but it's hard to sell that‌ to people. And that's why I think BI tools focus on the presentation layer.

Can you talk a little bit about how over the last 20 or so years the data visualization industry has evolved? Are we operating at a higher level of abstraction than we used to be?

Imagine it's the 1970s and you're a statistician doing research. You find that you can put your tables of data into the computer to combine and show them quickly. Before, you had to do it by hand. If you've ever read anything by Edward Tufte about historical data viz, you could tell someone that you had a pencil and paper and drew, had to figure out where to put the points to show the aggregation, right? Really time-intensive.

John Tukey was this really famous statistician. He's‌ the person doing exploratory data analysis. He said something that isn't controversial, which is that you should look at the data before doing a statistical analysis. This wasn't easy to do without computers.

It was part of a movement to bring computation to statistics that became‌ the whole point of the field after a certain point.

And then in the early 90s, spreadsheet software—Excel—became the most important data tool ever created. They began adding charts to their spreadsheets, and that was really a great early form of data visualization, probably the most popular form, right?

You have this other cross-current here where the browser became the medium for interactive data visualization, and not some desktop app that a bunch of people have to write C code. That‌ was probably the biggest expansion of the labor market in data visualization.

And that's where you see D3 becoming one of the most important entry points for those people to become essentially front-end engineers.

D3’s premise was that building high-level primitives for data visualization wasn't possible without getting a mid-level connection to the browser APIs. D3 is still widely used. It's not used in the same way as it was in the past, but it's still very widely used. I use it every day for all of the data visualization tooling I build, just because it has so many helpful things that I don't want to build myself at this point.

So the web became important for the medium of data visualization. And really for analytics tools as well. Most of the BI tools moved to being web-based in some way. And so then you had this other cross-current in the last 20 years, which is the tech boom, internet companies, things like that.

You had a huge influx of technical PhDs in the industry. You had all of these people bringing their analysis tools that they used in their research. The scientific Python computing stack was something that you might've‌ toyed with in grad school and used for your research. And now you're bringing it to your job because it's a tool, you know, R is another example of this. This is sort of where I enter the story‌ is as a statistician with R.

My guess is that probably everybody listening here is familiar with the name DuckDB, but you should probably do a little bit of an overview.

DuckDB is an in-process database. The closest analogy would be something like SQLite, which I don't think is a fair comparison because DuckDB does so much more. SQLite is this tiny transactional database that's‌ the most important piece of software ever made. I don't think that's a stretch to say that our lives are powered by thousands of SQLite databases on all of our devices.

Our browsers all have individual SQLite databases powering them. It's an incredible thing when you can just have your database as a file somewhere and then whatever process can just query that directly rather than having a database run on its own independent server. And so DuckDB is‌ like that but for analytical queries, not transactional ones, the kinds of queries that your audience is quite familiar with.

The project started‌ in the late 2010s. Hannes Muehleisen and Mark Roosevelt, who are two researchers at a research institution in Amsterdam called CWI, which previously was probably best known for being the place where Python was invented.

The influence for DuckDB was the workloads that PhD data scientists were bringing to industry. Crunching down CSV files and Parquet files has always been a bit of a challenge. Mark and Hannes realized they could build a database to make data analysis easier. And so that was sort of the genesis of DuckDB.

But as they began working on it and applying some of the most cutting-edge ideas in analytical databases to the project, they began to realize it could do more and more stuff.

And that's ‌why I joined MotherDuck because I'm part of this movement of people that care about data visualization that have discovered DuckDB, and really can't look back.

That divide we were talking about, the front end being all JavaScript and the back end being who knows what. If you can make that back-end DuckDB, and you can do incredible things with it, you can actually determine the future.

The queries you need to run on the front end effortlessly update your UIs. And so it reduces the latency of interactions for really complex things.

We talked about the difference between the BI world and scientific computing world before. One of the interesting differences there is that scientific computing does not typically speak SQL. Is DuckDB capable of doing some of these scientific computing functions or does it not need to? What's changed?

In 2015 there were people that would scoff at the idea of writing SQL, maybe they adopted BigQuery and discovered writing SQL actually wasn't a big deal.

There was a period, especially in the 2010s where people weren't sure if SQL was going to survive. The environment was different then, but one thing that ended up happening was more stuff moved to SQL rather than less stuff moving to SQL.

I was at Mozilla when we bought BigQuery. It was like a breath of fresh air, being able to write SQL, a query that I can analytically verify myself and just have it do the thing was really special.

Industry has moved more towards SQL. That said, DataFrames are amaz ergonomically quite amazing, especially our ecosystem with dplyr for data transformation is like a really elegant, nice way to work with data. Databases like DuckDB can do things in dplyr and it will write the DuckDB query for you which is really nice.

I really desperately love dplyr. It is maybe my favorite of all our packages. I feel like sometimes there's this religious war between SQL people and not SQL people.

What's interesting about this too is a lot of these analytical SQL dialects are starting to also address some of the same types of problems. I think I mentioned BigQuery before, like writing BigQuery SQL and working with arrays in BigQuery and things like that became easier to do some of those hard data manipulation things in SQL.

And looking at DuckDB, something that Mark and Hannes care a lot about is just the ergonomics of writing SQL. So they've made their own extensions to SQL to make it easier to do the sorts of things that you would see in dplyr. SQL is complicated because it's kind of a ancient programming language that has stood the test of time. It's not Latin. It's something else.

The spec for SQL is like thousands of pages and there's not a single database on the planet that actually implements all of the spec. It's one of these bizarre situations we're in where the dialects differ in some critical ways.

If you know one dialect of German, you could speak the other one kind of with other people. It's similar with SQL. This lack of control over the language itself, does two things. One, it frustrates everyone because it's much easier to go to R or Python where everything is well defined and the grammar is small. There's not thousands and thousands of keywords to implement in these languages.

But also, it's an area of innovation and opportunity for other database engines, and I think the DuckDB creators have seen that. And so things like list comprehensions, which are so useful in Python, you can actually do in DuckDB SQL. Function chaining, you can also do that in DuckDB Python and or DuckDB SQL

It's an interesting area of innovation, and I think it also upsets a number of people as well.

There are people who think SQL needs to be kind of this boring thing that everyone knows and stop innovating. I'm much more of an experimentalist. Given that there is no actual SQL standard beyond the bare minimum that everyone implements. Why not innovate?

Can you talk a little bit about how, how is it possible that DuckDB runs locally. Why can't I run Snowflake locally?

To understand this, you have to understand a bit of the historical trends in the tech industry around the concept of big data. So 15 years ago, tech companies began generating large amounts of data as they had users use their applications.

Facebook's a great example. The amount of data that needed to be processed in order for you to understand it was much larger than what one computer on its own could even do. You could go buy a desktop tower, put it on your desk, get all the data on it and attempt to do something with it, but it was going to be extremely slow, right?

The disk space wasn't big enough. You didn't have enough memory to do interesting things. Computers weren't good enough. Computers have recently become good enough, I think. And that's the TLDR.

But because they weren't good enough at that period, we began to see great research out of Google around MapReduce and about splitting up computation across a bunch of rented out computers in the cloud, a bunch of cheap, low powered machines, and this fanning out the computation to a bunch of machines that weren't on your computer was the way that you could actually do anything meaningful.

And that idea took hold in industry. As more companies became data-driven, this was the way to do it. And this is actually part of the core thesis of MotherDuck. It's 2024 now, and I'm using a modern MacBook Pro. And when I switched from SQLite to DuckDB, querying a really large, 10 gigabyte dataset, it was instantaneous.

The first time that I ran a query in DuckDB, I thought, is it broken? I don't think this is possible. MotherDuck’s core thesis is that our computers have actually gotten fast enough to handle those workloads that back in 2011, we could not have done on a desktop tower.And that's really magical.

Computers have gotten good enough to do what was previously considered very big data or big data. The definition of big data has shifted over time to be increasingly larger. And so the things that were big data when Snowflake and BigQuery were created might not necessarily be big data today.

They might be something that you could process on one really beefy computer that you rent from AWS. And so I think that's why DuckDB is becoming really popular, by the way, is because computers have caught up.

There is an architecture, the MPP, Massively Parallel Processing architecture, that became initially popular in the 2000s. You could reasonably say that Snowflake and BigQuery are also versions of an MPP architecture, certainly evolved from earlier versions, but there's probably some overhead involved in. But that all of a sudden can be faster and it can also be distributed to your local machine. Am I getting that right?

I think that's the most concise story for DuckDB and why it's so successful. The fact that it can run in a process anywhere means it's running everywhere. Companies are adopting DuckDB as point solutions all over the place.

And you don't hear about it all the time, but it is absolutely happening. So it may not be the total solution for their data today, but it's oftentimes the best solution for individual pieces. I think I saw somebody joke that DuckDB may single-handedly prevent global warming caused by the JVM.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The Analytics Engineering Roundup

Discussion about this post