Getting hired. Data warehouse benchmarks & upgrades. dbt. Learning math. [DSR #108]

Tristan Handy

Oct 22, 2017

Fishtown Analytics is hiring! If you know anyone starting out their career in data, point them our way.

Enjoy!

- Tristan

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

What, exactly, is dbt?

The dbt community is growing quickly (109 companies as of today!), so I took some time to write up a post targeted at new users. If you’re not familiar with dbt, I hope this post will provide a good intro.

dbt is the T in ETL. It doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT.

blog.fishtownanalytics.com • Share

Amazon Redshift’s Hardware Upgrade Improves Query Speed by up to 5x

Redshift upgraded their hardware, resulting in massive speed increases for customers across the board.

If you’re a Redshift user, you should be upgrading to these new node types ASAP. Here’s the AWS announcement with instructions.

webflow-blog.periscopedata.com • Share

Data Warehouse Benchmark: Amazon Redshift vs Snowflake vs Google BigQuery

This is an atypically thoughtful benchmark from Fivetran founder George Fraser. He found that that the performance of each warehouse under most common scenarios is very similar:

These three warehouses all have excellent price and performance. We shouldn’t be surprised that they are similar: the basic techniques for making a fast columnar data warehouse have been well-known since the C-Store paper was published in 2005. These three data warehouses undoubtedly use the standard performance tricks: columnar storage, cost-based query planning, pipelined execution, and just-in-time compilation. We should be skeptical of any benchmark claiming that one of these warehouses is more than 2x faster than another.

This is insightful. This technology is approaching maturity, and as such, we should be less concerned with performance and more focused on user experience (query dialect, maintenance, ecosystem, etc).

blog.fivetran.com • Share

Data Scientists in Software Teams: State of the Art and Challenges

Data scientists are becoming popular within software teams, e.g., Facebook, LinkedIn and Microsoft are creating a new career path for data scientists. In this paper, we present a large-scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities.

Great paper. The most interesting part to me was the nine clusters of data scientists the researchers identified. Worth thinking about where you fit in.

web.cs.ucla.edu • Share

Learning Maths for Machine Learning and Deep Learning

The article presents two books that make Calculus and Linear Algebra accessible.

This author’s story is so common: learned math, had direct application for it, forgot it. While you can copy-paste your way through some data science without understanding the math you’re relying on, solidifying your math fundamentals is critical to taking the next step.

medium.com • Share

Landing A Data Science Gig In New York City

An extremely deep, specific post on how the author landed a data science job in NYC, including recommendations on how you can do the same.

asharma567.github.io • Share

Announcing AVA: A Finely Labeled Video Dataset for Human Action Understanding

Just released from Google Research:

In order to facilitate further research into human action recognition, we have released AVA…a new dataset that provides multiple action labels for each person in extended video sequences.

It is data, not algorithms, that has historically been the limiting reagent in ML progress.

research.googleblog.com • Share

Data viz of the week

Impressive growth of wind and solar. Click through for an interactive version.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.

fishtownanalytics.com • Share

Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.

www.stitchdata.com • Share

By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

915 Spring Garden St., Suite 500, Philadelphia, PA 19123

The Analytics Engineering Roundup

Discussion about this post

Ready for more?