Meltano's Singer SDK. Why Choose Open? Developing Junior Talent. Idempotence. The Grind. [DSR #249]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

Ok I am really just enjoying this “just because you’re vaccinated” meme:

Vicki Boykis


Being vaccinated doesn't mean you can stop writing unit tests

11:48 AM - 9 Apr 2021

JD Long


I don't know who needs to hear this, but just because you're vaccinated doesn't mean you can do this... #python #covid

5:18 PM - 9 Apr 2021

Tristan Handy


Being vaccinated doesn't mean you can skip building staging models.

6:38 AM - 10 Apr 2021

Vicki, did you start this meme? It really tickled me :) I hope we can keep it going.

Meltano Launches v0.1.0 of the Singer Tap SDK - Meltano

Meltano Launches v0.1.0 of the Singer Tap SDK - Meltano

Today the Meltano team is excited to announce a milestone for the Singer community: the v0.1.0 launch of our Singer Tap Software Development Kit! The SDK is a framework that makes it easier than ever to build high quality data extractors, aka taps. With the SDK, tap developers can take full advantage of the Singer spec without being an expert on it, while enabling them to focus on the code unique to the API or database they are extracting data from.

This is very cool. I’m very excited that Meltano is stepping up its support for the Singer ecosystem. Congrats to Taylor Murphy for his new leadership position on the team!


Snowflake: Choosing Open Wisely

I originally added this link because I felt an inherent desire to roast it. And it is roast-able…but I’ll restrain myself for a moment.

I do think that the post makes good points—data systems where the integration point is Files On Disk are constrained in ways that data systems where the integration point is an API are not. Integrating with an API can help with governance / security, transaction consistency, versioning… these are exactly these things that Snowflake fundamentally does very well. I’m very sympathetic to the systems design thinking here and believe that data engineers are often overly fond of operating in a files-on-disk world. Software engineers generally prefer to ascend layers of abstraction to achieve leverage; shouldn’t we want that too?

There are multiple arguments you could level at the post, though. Here’s the thing that bothers me most: you could achieve all of these architectural benefits while still open sourcing the core technology. And in so doing, you’d create a tremendous asset for all of humanity, something that can never be taken away: a contribution to the knowledge loop.

Open source isn’t an architectural choice. You can achieve whatever architecture you want, whatever release schedule you want (etc) however the source code is licensed. Open source doesn’t prevent you from pursuing a modern cloud managed service delivery model.

The reason you keep code proprietary is so that you have a monopoly on it and can extract rents from that monopoly. That’s fine—it’s called capitalism. (We write a lot of proprietary code too!) But let’s call a spade a spade.


Building Powerful Data Teams: On Investing in Junior Talent

Last issue I linked to a post about apprenticeships, and Claire brought Brittany Bennett’s new (and very related) post to my attention this past week (thanks!!). It’s so good!

  • Fundamentally doing analytics / analytics engineering work doesn’t require decades of experience. It’s a great path to operating with leverage quickly, which means that (if you run a team) you should really be working on building pathways in for junior talent.

  • It feels like an open question to me what the appropriate leverage ratios are. On teams of software engineers there are accepted ratios of junior to mid to senior. What are those for data teams?

  • Success has much more to do with factors like psychological safety, team motivation, and freedom to operate. This significantly impacts how teams should be managed.

Read it…really.


Speaking of Brittany Bennett…here’s my favorite tweet of the week. What’s your response?

brittany bennett | 500+ connections


what does it mean to truly be a great data professional? what is greatness to you?

7:42 PM - 9 Apr 2021

Embrace the Grind

Sometimes, programming feels like magic: you chant some arcane incantation and a fleet of robots do your bidding. But sometimes, magic is mundane. If you’re willing to embrace the grind, you can pull off the impossible.

Wow I desperately love this post. Anyone who has built a career in data has (for the past five-ish years since data has been “hot”) been asked by people getting into the field what their “secret” was. How do I do what you’ve done? This is the best possible version of the answer I’ve always given, said far better than I’ve ever managed.



Idempotence Now Prevents Pain Later

Idempotence is the property of a software that when run 1 or more times, it only has the effect of being run once.

Why does idempotence matter?

This is a question that data analysts on their way to becoming analytics engineers often do not grok immediately. Often you have to suffer the pain of operating a non-idempotent process in production prior to understanding just how incredibly critical this property is, and why you should refuse to design data systems that lack this property.

This post is an incredibly simple, concise answer to this question using a concrete example. It’s a great resource; share with folks on your team.


Open sourcing Querybook, Pinterest’s collaborative big data hub

Open sourcing Querybook, Pinterest’s collaborative big data hub

This is super-neat. A SQL notebook! Querybook has three types of cells: text, SQL, and chart. I really love seeing innovation in this space. Speaking of which…check out Hex.

Notebooks had previously been popular primarily in data science, but are slowly making their way through to all user personas on the data team. There are real advantages to conducting analysis in this way. I much prefer notebooks for exploratory analytics.


Thanks to our sponsor!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123