Data Science Roundup #47 - Text analysis, tabular design, and a massive Airflow Tutorial!
Hi! Welcome to the redesigned Data Science Roundup! I would love to hear your feedback on the new format; I’ve switched to using a product called Revue, which seems pretty awesome so far.
- Tristan
This week's best data science articles
Text analysis of Trump's tweets confirms he writes only the (angrier) Android half
This is a strong piece of data journalism and includes all of the R code necessary to replicate the results. My favorite line: “I’d rather get inside the head of this anonymous staffer, whose job is to imitate Trump’s unique cadence.” (Can you imagine having that job?) Highly recommended to aspiring data journalists—someone should publish a follow-on article in a month!
We’ve all seen poor visual design of tables: left-aligned numbers? Tons of useless formatting? There’s a lot that goes into making tabular data easy to consume, and with all the attention that goes into data viz today, the UI of tabular data often gets overlooked. No longer.
Using Agile development techniques for data science projects
I’m deeply interested in how to run effective data science projects. I’ve written in the past about the workflow problem that data scientists and analysts have today, and this podcast goes deeper into the project methodology component. The guest recommends an Agile approach, focusing on minimizing the cycle time between questions and answers.
Clustering R packages based on Github Data in Google BigQuery
Still haven’t played with BigQuery? Now’s your chance. This post contains a detailed walkthrough on analyzing data on R in R, using BigQuery to churn through the massive amounts of raw R code in Github. Just be careful to select from the data subset they provide or you’ll find yourself querying more than a terabyte of data and racking up charges fast :)
Building a Data Pipeline with Airflow
Holy crap. I’ve linked to Mark’s stuff before, and this article doesn’t disappoint. In it, he walks through the complete process of setting up Airflow (now an Apache project) using a simple example of grabbing foreign exchange rates from an API, storing them in Postgres, and then caching them in Redis. It’s not simple to get Airflow up and running, but this article gives you everything you need.
Getting into Data Science: A Guide for Students and Parents
There are so many posts focused on “getting into data science”, but most of them are focused on mid-career folks looking to acquire new skills. This is the first guide I’ve come across that answers the question from the perspective of a student (or the parents of that student). It’s a good start, but there is a lot more thinking that needs to be done in this area: the data scientists of the future will be using these tools and mental models from a young age.
Data viz of the week
Immediate visual impact: Canada is a big fan of US oil.
Thanks to our sponsors!
Fishtown Analytics is a boutique analytics consultancy serving venture-funded startups. We partner with CEOs and senior execs to implement advanced analytics.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123