Discover more from The Analytics Engineering Roundup
Meltano's Singer SDK. Why Choose Open? Developing Junior Talent. Idempotence. The Grind. [DSR #249]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Ok I am really just enjoying this “just because you’re vaccinated” meme:
Being vaccinated doesn't mean you can stop writing unit tests
Being vaccinated doesn't mean you can skip building staging models.
Vicki, did you start this meme? It really tickled me :) I hope we can keep it going.
Today the Meltano team is excited to announce a milestone for the Singer community: the v0.1.0 launch of our Singer Tap Software Development Kit! The SDK is a framework that makes it easier than ever to build high quality data extractors, aka taps. With the SDK, tap developers can take full advantage of the Singer spec without being an expert on it, while enabling them to focus on the code unique to the API or database they are extracting data from.
This is very cool. I’m very excited that Meltano is stepping up its support for the Singer ecosystem. Congrats to Taylor Murphy for his new leadership position on the team!
I originally added this link because I felt an inherent desire to roast it. And it is roast-able…but I’ll restrain myself for a moment.
I do think that the post makes good points—data systems where the integration point is Files On Disk are constrained in ways that data systems where the integration point is an API are not. Integrating with an API can help with governance / security, transaction consistency, versioning… these are exactly these things that Snowflake fundamentally does very well. I’m very sympathetic to the systems design thinking here and believe that data engineers are often overly fond of operating in a files-on-disk world. Software engineers generally prefer to ascend layers of abstraction to achieve leverage; shouldn’t we want that too?
There are multiple arguments you could level at the post, though. Here’s the thing that bothers me most: you could achieve all of these architectural benefits while still open sourcing the core technology. And in so doing, you’d create a tremendous asset for all of humanity, something that can never be taken away: a contribution to the knowledge loop.
Open source isn’t an architectural choice. You can achieve whatever architecture you want, whatever release schedule you want (etc) however the source code is licensed. Open source doesn’t prevent you from pursuing a modern cloud managed service delivery model.
The reason you keep code proprietary is so that you have a monopoly on it and can extract rents from that monopoly. That’s fine—it’s called capitalism. (We write a lot of proprietary code too!) But let’s call a spade a spade.
Last issue I linked to a post about apprenticeships, and Claire brought Brittany Bennett’s new (and very related) post to my attention this past week (thanks!!). It’s so good!
Fundamentally doing analytics / analytics engineering work doesn’t require decades of experience. It’s a great path to operating with leverage quickly, which means that (if you run a team) you should really be working on building pathways in for junior talent.
It feels like an open question to me what the appropriate leverage ratios are. On teams of software engineers there are accepted ratios of junior to mid to senior. What are those for data teams?
Success has much more to do with factors like psychological safety, team motivation, and freedom to operate. This significantly impacts how teams should be managed.
Speaking of Brittany Bennett…here’s my favorite tweet of the week. What’s your response?
what does it mean to truly be a great data professional? what is greatness to you?
Sometimes, programming feels like magic: you chant some arcane incantation and a fleet of robots do your bidding. But sometimes, magic is mundane. If you’re willing to embrace the grind, you can pull off the impossible.
Wow I desperately love this post. Anyone who has built a career in data has (for the past five-ish years since data has been “hot”) been asked by people getting into the field what their “secret” was. How do I do what you’ve done? This is the best possible version of the answer I’ve always given, said far better than I’ve ever managed.
Idempotence is the property of a software that when run 1 or more times, it only has the effect of being run once.
Why does idempotence matter?
This is a question that data analysts on their way to becoming analytics engineers often do not grok immediately. Often you have to suffer the pain of operating a non-idempotent process in production prior to understanding just how incredibly critical this property is, and why you should refuse to design data systems that lack this property.
This post is an incredibly simple, concise answer to this question using a concrete example. It’s a great resource; share with folks on your team.
This is super-neat. A SQL notebook! Querybook has three types of cells: text, SQL, and chart. I really love seeing innovation in this space. Speaking of which…check out Hex.
Notebooks had previously been popular primarily in data science, but are slowly making their way through to all user personas on the data team. There are real advantages to conducting analysis in this way. I much prefer notebooks for exploratory analytics.
Thanks to our sponsor!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123