Org Structure @ Stitch Fix. Data Discovery @ Spotify. Druid, Clickhouse, & Pinot. A16Z on AI Startups. [DSR #220]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Holy shit. This is the most useful resource on the topic of organizational design in data science that I’ve ever read. The core idea is that data science should be a top-level function in the organization, have accountability for business outcomes, and then should have autonomy to pursue those outcomes. With those three things in place, data science can transition from a service organization (fulfilling requests from other internal teams) into a department that adds unique value that wouldn’t have come from simply fulfilling a Jira ticket.
I’ll leave my commentary there because I’d much rather you read the post. Really, really fantastic.
Wow. There has been so much activity in the data discovery / knowledge management / data catalog space within BigTech in the recent past. Linkedin, Airbnb, Lyft, and WeWork all have made meaningful contributions here and there are many different internal tools (some open source and some not) floating around.
I continue to care a lot about this because I think it’s the Next Big Problem in data. Data warehouses and data ingestion tooling is mature, data transformation in the warehouse environment is increasingly mature, and now users are beginning to create massive numbers of datasets using this new environment they’ve been given. With any organization of sufficient size, curation and discovery becomes an issue very quickly.
I’m looking forward to spending more time on this problem in the coming months.
This is an incredibly in-depth post describing three new-ish OLAP SQL engines that have achieved notable traction and performance for certain use cases. While none of them are poised to take over as the next generic SQL OLAP engine tomorrow, they have impressive performance characteristics.
The subject systems run queries faster than the Big Data processing systems from the SQL-on-Hadoop family: Hive, Impala, Presto and Spark, even when the latter access the data stored in columnar format, such as Parquet or Kudu. This is because ClickHouse, Druid and Pinot
- Have their own format for storing data with indexes, and tightly integrated with their query processing engines. SQL-on-Hadoop systems are generally agnostic of the data format and therefore less “intrusive” in Big Data backends.
- Have data distributed relatively “statically” between the nodes, and the distributed query execution takes advantage of this knowledge. On the flip side, ClickHouse, Druid and Pinot don’t support queries that require movements of large amounts of data between the nodes, e. g. joins between two large tables.
Very worth following.
Ok just: 🤣😂🤣😂
This author, while I think his take may overstate the point just a bit, provides a compelling and hilarious summary of A16Z’s recent post on how AI startups need to be evaluated differently than their pure-SaaS cousins. Here’s my favorite line:
Those who use the latest DL woo on the huge data sets they require will have huge compute bills unless they buy their own hardware.
Using “latest DL woo” to summarize some pretty foundational breakthroughs in computing really tickles me. The whole article is equally irreverent. Another great sentence:
The VC backed startup might be betting on their “special tool” as its moaty IP.
“moaty” as an adjective is fantastic.
How hard work and a bit of luck got me into the field and up the ladder.
Great story. Here’s his summary:
The post ended with a reflection on three keys for success as a data scientist—based on my experience—namely: (i) continuous self-learning, (ii) get shit done, and (iii) emphatic communication.
I could not agree more.
Heh. A/B testing is hard. Etsy recently realized that an improved recommendation algorithm actually hurt overall site-wide conversion rate and GMV! This turned out to be because this improvement actually cannibalized users’ engagement with search. This (rather dense) post goes through how they proved the magnitude of this effect.
For me, this is another nail in the coffin of the “let’s just throw [A/B testing tool of choice] on the site and run some experiments!” approach to experimentation. Unless you’re ready to think deeply about your experimental strategy and spend real time analyzing the results, tooling isn’t going to produce sustainable outcomes on its own.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123