Data Cleaning. Evil Data Scientists. Intake Forms. TensorBase is Fast. Apache Arrow. [DSR #235]
Quick note that we recently announced our 2020 virtual user conference, Coalesce. If you’re a dbt user (or are dbt-curious!), we’d love to have you join us in December! And, if you’re interested in speaking please check out the call for speakers.
- Tristan
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
My first two posts here are both from the same author, Randy Au, whose newsletter I just discovered. I actually didn’t realize they were both Randy’s posts until after writing them both up and then found myself even more impressed. You should probably subscribe!
Data Cleaning IS Analysis, Not Grunt Work
The act of cleaning data is the act of preferentially transforming data so that your chosen analysis algorithm produces interpretable results. That is also the act of data analysis.
This is a long, but wonderful post. It is deeply expressive of a core belief of mine: data analysis and data transformation/preparation/cleaning are fundamentally inseparable. Attempting to separate them is (IMO) one of the biggest problems in many data teams, and unifying them is one of the biggest opportunities.
This core belief (based on deep personal experience as a data analyst) is what motivated me to want to build dbt back in 2016.
What if you were an evil data scientist?
This is one of the most unusual posts I’ve come across in the 5+ years I’ve curated the Roundup. I’ll let the author introduce it:
I work in data, and one thing that I think everyone in data knows but never really thinks to explain to non-data people is I could make the numbers say almost anything I want them to and the only thing keeping the world safe is my sense of ethics. (twitter)
The author goes through a bunch of different scenarios of how a self-interested data scientist could take fairly small but nefarious actions to forward their own career. My take: I 100% agree that there are things in this vein that are possible and that certainly happen every day.
But what I believe is the much-more-common situation is that motivated reasoning creeps its way into the analytical process. There’s typically no moment in time when someone decides to do something to subvert the truth in their own best interests. Instead, there are a series of small decisions that must be made about how to account for reality in a given model, and it’s not always super-clear what “correct” looks like. Assuming you’re human, you’ll have an incredibly hard time preventing your own personal interests from clouding your judgment, and the compounding of many seemingly-inconsequential decisions can end up being quite large. The effect is similar, and the tactics are identical, but evil intent is not required.
Why link this?
If you’re in a large org, it’s not crazy to actually watch for self-serving behaviors like these.
Even if it’s just you, it’s a great reminder of your own power and to seriously check your own motivated reasoning.
This question (from our data team's intake form) has been helpful for clarifying expectations around analysis https://t.co/zikvBSHJhL
An Intake Form for Data Requests
I very much like the idea of an “intake form” that allows the data team to create structure around how it receives requests from the rest of the org, and I think this concise post is very worth reading.
I do get the sense that the author (of whom I think highly) is operating a data-as-a-service org, which relies on customers filing requests and getting analyses back. I generally see modern data teams migrating towards data-as-a-product, where the handoff is the data model, not the data analysis. All that to say, you might want to consider how you would adapt this tactic based on how your team interacts with its customers.
TensorBase: a modern engineering effort for taming data deluges
Very interesting, though very early:
TensorBase is a modern engineering effort for building a high performance and cost-effective data warehouse in an open source culture.
Most notably, it’s being benchmarked on the NYC Taxi dataset doing an aggregation 6x faster than Clickhouse (aggregating 1.46b rows in 118ms!). Obviously, whenever you see a headline performance number that feels to good to be true skepticism is warranted, but it’s certainly made me put TensorBase on my watch list. I’ll be following its progress closely.
Outer Join | Remote jobs in data science
Outer Join is the premier job board for remote work in data science, analytics, and engineering.
!! Super-cool. Lots of opportunities posted on there at high-quality companies. From what I can tell, this is quite new.
Apache Arrow: The Hidden Champion of Data Analytics
Arrow is used by open-source projects like Apache Parquet, Apache Spark, pandas, and many commercial or closed-source services. It provides the following functionality: a) in-memory computing, b) a standardized columnar storage format, c) an IPC and RPC framework for data exchange between processes and nodes respectively.
I’ve been following Apache Arrow for a long time (2016!) and have been / continue to be bullish on it. This post is a great intro if you’re not familiar, but it also has some excellent performance data in it that was new to me.
IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark
…heh, wow.
Thanks to our sponsors!
dbt: Your Entire Analytics Engineering Workflow
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123