Git for Data. AI Overhang? Metadata @ Shopify. Data Clinics. Tools I'm Watching. [DSR #232]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
File this under “things written by Michael Kaminsky that I couldn’t have possibly said better myself.” (This has become a large folder over the years.)
Here’s his conclusion:
I’m broadly sympathetic to the goals that people who are working on “git for data” projects have. However, I continue to believe that it’s important to keep code separate from data and that if your data system is deterministic and append-only, then you can achieve all of your goals by using version-control for your code and then selectively applying transformations to subsets of the data to re-create the data state at any time. The motto remains: Keep version control for your code, and keep a log for your data.
I have always been just slightly confused as to the “git for data” concept but never dug in deeply. It just doesn’t mirror the experience I have of doing large-scale production data work! That is not at all to say that it doesn’t have an important role to play in certain ML workflows, but I think it’s important to have a clear understanding of where it’s relevant and where Kaminsky’s approach (which is what I’ve always practiced) is preferable.
Very open to being told I’m wrong here! Just hit reply.
In a followup to my recent GPT-3 posts, this article is fascinating, and fundamentally quite practical:
I am worried we’re in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.
It’s focused on a thought experiment: “What if, by scaling up existing GPT-3-style NLP by a factor of 1000, we could achieve human-level performance on a wide range of tasks?”
Another hyperscale tech company, another home-grown metadata product. I’ve written a lot about these over the years, and Shopify’s Artifact seems to be very solid, although I’m not sure that it’s incredibly different from others like Amundsen and DataHub. A couple of things I liked about it:
The team aggressively gauged success using customer surveys, and meaningfully made an impact in data workflows. Really impressive, and indicative of what this category of tooling can achieve.
The post actually outlines how the team thought about the build/buy decision. This is the first time I’ve seen a team reflect on this, and one of their reasons really resonated with me:
At Shopify, we have a wide range of data assets, each requiring its own set of metadata, processes, and user interaction. The tooling available in the market doesn’t offer support for this type of variety without heavy customization work.
Essentially: this product needs to integrate with everything in order to be useful, and given the complexity of the data infrastructure at any sufficiently large organization this will almost never be true without significant internal engineering work.
Very cool, very practical…I want to steal this.
Data Clinics, the time our team puts aside daily for working with stakeholders on any walk-in requests, offer the best ROI for the data team’s time. (…) Before we had Data Clinics, our work was all over the place: one day would be entirely focussed on ad-hoc, and the next entirely on longer-term projects. (…) We needed a way to prioritize both sides: ad-hoc and planned.
DC_THURS : dbt w/ Drew Banin
Data Council was kind enough to host my cofounder Drew on a recent episode of their show DC_THURS where he talks analytics engineering, data quality, and the future of dbt. Good stuff.
Sarah and Joyce at Projects to Know wrote a better description of this than I could:
In the past few years, policymakers have released new regulations like GDPR and CCPA, which impact how companies handle sensitive user data. Likewise, companies have designed new security strategies to protect their users. However, the technical implementation of these complex policies can be very challenging. To address this problem, some developers use web frameworks like Hails and Jacqueline to declaratively specify and automate data-dependent policies for information flow control (IFC). Unfortunately, these IFC frameworks may impose performance costs and do not catch errors early. In response to these limitations, Polikarpova et al. present Liquid Information Flow TYpes (LIFTY), a DSL for writing secure data centric applications, which encodes static IFC into an expressive yet decidable type system. With LIFTY, developers declare sources of sensitive data and specify policies. The language then statically and automatically verifies that the application conforms to the security policies.
I find this interesting because most companies I speak to are attempting to apply policies once the data leaves the local application and begins to be piped throughout the larger data engineering stack. This takes the viewpoint that data policies should be enacted and enforced by the data-generating application. There is certainly a lot to like about this approach.
New Tools I'm Watching
I’m always monitoring the data tooling landscape, and I figured I’d start sharing some of the more interesting products I come across. No guarantees that I’ll have any to share every week; it’ll be as I come across them. Have any to share? Feel free to email me, but I’m only going to share tools that I find interesting and that are new to me.
Iteratively: Tracking plans! Most likely, the quality of your event data is low; Iteratively is helping you solve that. This is a real problem.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123