Kafka @ the NYT. Data Quality and Outlier Detection. Security in the era of Equifax. [DSR #104]

Some foundational stuff this week. I’d love to talk to anyone who is thinking about log-based architectures in BI—just respond to this email.

Enjoy :)

- Tristan

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

Required Reading

Publishing with Apache Kafka at The New York Times

The biggest architectural shift happening in modern distributed systems is moving to log-based communication between large numbers of microservices.

If the above sentence made your eyes glaze over, you are not alone. Data scientists and analysts typically don’t spend a ton of time thinking about the technical implementation of how their data arrives at their doorstep, but this adoption of log-based communication will have very significant impacts on your work in coming years. You should get ahead of it.

Imagine that, instead of querying a customers table, you query a customers event stream that contains every update to a customer ever. You’ll find that some queries get more complicated: if you want the current state of a customer you’ll have to reconstruct it for yourself. But you’ll also have more power: you’ll be able to reconstruct that state at any given point in time. You’ll also likely have access to data in much closer to real-time.

This article is an excellent intro into log-based architectures, as seen through the experience that the New York Times had in their migration. If this isn’t a topic you’re familiar with, spend the time to digest.


Focus on: Messy Data

Data Quality in the era of AI

Modern systems must be aware of the quality of the incoming data and capable of identifying, reporting and handling erroneous cases accordingly.

Modern data organizations have dramatically increased their ability to ingest and process new data streams, but they frequently have no mechanism to validate that data either at the outset of a project or, just as importantly, on an ongoing basis. This article walks through the why and how of creating high-quality data.

We think a ton about this problem at Fishtown Analytics and it’s exactly what we’ve built dbt data tests to help with. We’re starting to see more companies beginning to take this issue seriously, but it’s still rare.



A Brief Overview of Outlier Detection Techniques

Even once your data obeys all of your technical rules, it’ll still have outliers. The post gives my favorite definition yet of what exactly an outlier is:

[An outlier is an] observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism.

The reason I like this definition is that it focuses on the data creation process, not the data point itself. Outliers are not data points that you don’t like—that disconfirm your hypothesis—they’re data points whose variance suggests they were captured erroneously. We see this frequently in web analytics: because the data collection environment is so heterogenous, some outliers show a time on site of over a decade.

If you’re going to provide high quality analytics to your end consumers on an ongoing basis, outlier detection and correction needs to be a part of your pipeline.


Everything Else

Data Security for Data Scientists

Another day, another breach. The Equifax credit data breach is just the latest in a series of stories about major organizations’ data being exposed.

Stellar read. Goes through many concrete recommendations for how to improve the security of your workflow.


How to Find the Best Data Jobs

Today, we’re announcing a new resource for the data science community to help wistful data talent and managers pining for their next superstar. Mode’s new Data Jobs Board lists great jobs for analysts, data scientists and data engineers at exciting companies doing cutting-edge work.


The Ten Fallacies of Data Science

There exists a hidden gap between the more idealized view of the world given to data-science students and recent hires, and the issues they often face getting to grips with real-world data science problems in industry.


Query the planet: Geospatial big data analytics at Uber

In this article, we discuss our engineering effort to optimize geospatial queries in Presto.


Introducing: Unity Machine Learning Agents

It is critical to our mission to enable machine learning researchers with the most powerful training scenarios, and for us to give back to the gaming community by enabling them to utilize the latest machine learning technologies. As the first step in this endeavor, we are excited to introduce Unity Machine Learning Agents.


Data viz of the week

Title: "Modern Slavery is Disturbingly Common"

Title: "Modern Slavery is Disturbingly Common"

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123