Cloud Notebooks. Uber's Queryparser. Interpretability Research. [DSR #126]

dbt turned 2 years old on Friday! My cofounder, Drew, wrote an awesome blog post on the journey so far and where we go from here. If you spend any of your time doing data transformation, check it out.

Enjoy this week’s issue!

- Tristan

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

The Week's Most Useful Posts

How to Datalab: Running notebooks against large datasets

How to Datalab: Running notebooks against large datasets

Streaming your big data down to your local compute environment is slow and costly. In this episode of AI Adventures, we’ll see how to bring your notebook environment to your data!

Datalab and Sagemaker (Google’s and Amazon’s notebook products, respectively) are interesting to me for exactly this point: it’s just not a great idea to process data using your local processor. Sure—there plenty of times when doing this would work just fine, but if you accustom yourself to that workflow, all of your tooling will be built around it. When you find yourself needing to process a larger dataset, you’ll all the sudden have to step into a different tool set.

Instead, default to processing all of your data in the cloud. This doesn’t necessarily mean you need to use a cloud provider’s notebook product! You can absolutely set yourself up a local notebook that uses cloud resources to perform computation, but this article presents a wonderful walkthrough of how Google Cloud Datalab makes the workflow really seamless and easy.

If you’re still using local CPU cycles on numpy, give this a shot. It’s easy and it scales really well.


Visualizing Outliers

I won’t apologize for linking to almost everything Nathan Yau posts—FlowingData continues to be, IMO, the best data viz blog on the internet.

In this post he does an awesome walkthrough of visual techniques to highlight outliers in a data set. Succinct, great examples.


Facebook: Why diversity matters in AI research

My advice for women who want to get into AI research? Be passionate and be curious about your passions. Seek opportunities to collaborate with others. Start with one course and see where it takes you.

Such an important topic.


The Building Blocks of Interpretability

The Building Blocks of Interpretability

Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them — and the rich structure of this combinatorial space.

Wow—this is a massive new study from Google and CMU on what is potentially the hottest topic in AI today. The authors put a tremendous amount of work into the dynamic content to interact with and attempt to explain the behaviors of a network. Really very cool work, although it’s really more of an exploration than an announcement of specific findings. Here’s the conclusion:

There is a rich design space for interacting with enumerative algorithms, and we believe an equally rich space exists for interacting with neural networks. We have a lot of work left ahead of us to build powerful and trustworthy interfaces for interpretability. But, if we succeed, interpretability promises to be a powerful tool in enabling meaningful human oversight and in building fair, safe, and aligned AI systems.

Long, but even if you don’t read the whole thing you should still click through and play with the interactives.


Why humans learn faster than AI—for now

So what makes humans so much better? It turns out that we do not approach this game with a blank slate. A human will see that he or she has control over the robot, and that the robot should avoid fire, climb ladders, jump over gaps, and avoid a frowning enemy to reach the princess. All this is thanks to prior knowledge that certain objects are good while others (with frowns or flames) are bad, that platforms support objects while ladders can be climbed, that things that look the same behave in the same way, that gravity pulls objects down, and even what “objects” are: things that are separate from other things and have different properties.

The insight may not be incredibly surprising, but the experiment these researchers ran was really fascinating, and you can replicate it yourself on the website. It turns out that you are also not going to be good at beating this very simple game once your priors are stripped away.


Uber: Queryparser, an Open Source Tool for Parsing and Analyzing SQL

In early 2015, Uber Engineering migrated its business entities from integer identifiers to UUID identifiers as part of an initiative towards using multiple active data centers.

To achieve this, our Data Warehouse team was tasked with identifying every foreign-key relationship between every table in the data warehouse to backfill all the ID columns with corresponding UUIDs.¹ Given the decentralized ownership of our tables, this was not a simple endeavor. The most promising solution was to crowdsource the information by scraping all the SQL queries submitted to the warehouse and observing which columns were joined together. To serve this need, we built and open sourced Queryparser, our tool for parsing and analyzing SQL queries.

Fascinating. So many users, so many tables, so many queries that it was infeasible to do standard data discovery. Very cool technology solution to a common problem. Wonder what other cool applications there are for this tool…


Big, fast, easy data with KSQL

Care about the future of real-time analytics? This is a must-read.

I linked to the launch post for KSQL, Kafka’s real-time, stream-analyzing SQL dialect months ago—I think the technology has the potential to do meaningfully new things. This post goes much deeper on use cases and example code.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123