Five Data Eng Projects. OpenAI's Twists and Turns. The Next Decade in AI. [DSR #219]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
There’s been a lot of activity in the data engineering world lately, and a ton of really interesting projects and ideas have come on the scene in the past few years. This post is an introduction to (just) five that I think a data engineer who wants to stay current needs to know about.
The author is super-smart Dmitriy Ryaboy, currently VP Eng @ Zymergen. You’ll likely know about several of these projects, but the list is exhaustive and Dmitriy’s writeup of each tool is insightful.
The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism.
Whoah. This is a long read, and it’s a fascinating one. I’ve been following (and linking to) OpenAI since its inception, and it’s been clear that both the internal operations and the external perception have shifted over time. I hadn’t really dug in to try to figure out what was driving that shift, but fortunately someone much more qualified did that for me.
This piece weaves a compelling narrative that includes things like the lab’s decrease in transparency, increased reliance on external funding, and changes in governance and leadership. Given the centrality of OpenAI in the ecosystem today, this is a good read if you want more context behind the stream of press releases.
…the proposal of this paper that we must refocus, working towards developing a framework for building systems that can routinely acquire, represent, and manipulate abstract knowledge, using that knowledge in the service of building, updating, and reasoning over complex, internal models of the external world.
If OpenAI tends to slant towards throwing more TPUs at the problem, Gary Marcus is on the other end of the spectrum.
In some sense what I will be counseling is a return to three concerns of classical artificial intelligence—knowledge, internal models, and reasoning—but with the hope of addressing them in new ways, with a modern palette of techniques.
While I certainly don’t possess expertise at the level of either of these parties, it seems fairly obvious to me that developing a causal model is a necessary prerequisite to more general purpose intelligence. Judea Pearl’s work on this topic resonates. In this paper, Marcus presents a bunch of great examples where the modern state-of-the-art fails to produce sensible outputs directly as a result to this lack of causal reasoning.
We’re excited to take the wrapping off of Materialize today.
I first covered Materialize back in December after meeting their CEO, being really impressed, and digging in further. The product just launched into Beta; this is the announcement post. I still believe this could be a very big deal.
We discovered an issue with how our primary model was making state-by-state and district-by-district forecasts. Specifically, the model was not properly calculating the demographic regressions that we use as a complement to the polls.
This article isn’t interesting because of the specific issue that FiveThirtyEight identified or how it was impacting their results. In fact the article itself is…kind of boring. So why link to it?
I think it’s fascinating that this article was published at all. This is a well-known journalistic organization publishing a correction based on the functioning of a predictive model, and going deep (in public!) about what exactly the issue was and how it was caused. I personally haven’t seen anything like this before. It’s a testament to the unique culture at FiveThirtyEight and the uniquely data-driven bent of their readers that they have a space where this type of article can find its way to the front page.
Predictive models underly much of our understanding of the world today, and govern our lives in increasingly important ways. This type of transparency around results (and incorrect results) isn’t just a technology challenge (model explainability), it’s a cultural one. This type of transparency should be recognized and rewarded.
We took our first step toward the adoption of Apache Arrow with the release of our latest JDBC and Python clients. Fetching result sets over these clients now leverages the Arrow columnar format to avoid the overhead previously associated with serializing and deserializing Snowflake data structures which are also in columnar format.
The Snowflake product team is jumping on the Arrow train! This is super-cool; it resulted in an up-to-10x performance improvement in some benchmarking the team did. This will likely impact data science use cases that get data out of Snowflake moreso than BI use cases—grabbing a full Pandas dataframe is way more data intensive than grabbing the aggregates that power a chart.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123