Creating a Data Roadmap. ML Engineering. Plotly. Reliable Events. Airflow. [DSR #169]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

Creating a Data Road Map

Creating a Data Road Map

As the new year rolls around, many Data leaders are thinking about (or have already created) 2019 road maps for their team and function. Since Data often works cross functionally with other teams, it’s key that you consider other team’s priorities and objectives in developing your road map. Below is a blueprint you can use to get started.

Annual planning is something that I suck at. When leading a company or team, I like to preserve optionality and allow myself to make planning decisions as information arises. But in many organizational contexts, the ability to set an annual roadmap is what will get you resources and the collaboration required from other teams. So…kind of important.


The Rise of ML Engineering

I fundamentally believe that 2019 will be the year of ML Engineering.

This article is spot on. It also pokes at one of my favorite frustrations curating this newsletter:

nobody needs yet another library or tutorial to build a 3-layer neural network on MNIST

Yes! Too much attention to model-building, which ends up only being about 10-20% of the time spent in the actual job. Instead, it’s time for the industry to focus on the things that will return business value: tooling, infrastructure, and best practices for creating ML pipelines.


What They Don’t Teach You in Machine Learning Courses

There is much more that goes into delivering impactful data science project than just a working model.

Similar topic: model-building isn’t the hard part, the hard part is how to operationalize data science within an organization. Software engineering projects ran poorly for decades before approaches and tooling became popular that allowed these projects to be successful more often than not.


The Next Level of Data Visualization in Python

Now that it’s 2019, it is time to upgrade your Python plotting library for better efficiency, functionality, and aesthetics in your data science visualizations.

Will Koehrsen has a great post on why it’s time to throw away matplotlib in favor of Plotly + cufflinks. I hadn’t heard of cufflinks before; it’s a good find. If you’re still plotting in matplotlib, you really should scan through the examples in this tutorial and consider making the switch.


Understanding Your Users with Consistent and Reliable Event Data

One of the most common challenges companies have with event tracking is ensuring that they’re pushing consistent and reliable data through their event pipeline. Tools like Segment make it very easy to start pushing data, but that simplicity comes with a tradeoff: there are no guardrails to ensure the data you’re pushing conforms to any set of expectations.

The team at Airtasker worked to build their own tooling for enforcing a common set of event definitions and has incorporated it into their development workflow. In this post they explain their solution. It’s a quick read, and highly worthwhile. If you’ve experienced similar issues (and most of us have), it’s worth thinking about incorporating a similar approach at your org.


The Apache Software Foundation Announces Apache® Airflow™ as a Top-Level Project

Airflow was in incubation until now; it’s just been upgraded to an Apache TLP (top-level project). This is a mark of the maturity of its consensus-based governance processes.

It’s unlikely that this has any immediate impact for you, but it’s worth noting that one of the main tools in the data engineering ecosystem is now a mature project in the world’s leading open source software foundation. Good news.


Lazydata: Scalable Data Dependencies for Python Projects

Very cool project!

Problem: Keeping all data files in git (e.g. via git-lfs) results in a bloated repository copy that takes ages to pull. Keeping code and data out of sync is a disaster waiting to happen.

Solution: lazydata only stores references to data files in git, and syncs data files on-demand when they are needed.

Why: The semantics of code and data are different - code needs to be versioned to merge it, and data just needs to be kept in sync. lazydata achieves exactly this in a minimal way.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123