The Analytics Engineer. Code > GUIs. Dask. Feature Stores. BERT and Sentence Structure. [DSR #228]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Analytics Engineer: this term has started showing up in blog posts and job listings. It all happened quickly; just a couple of years ago, it wasn’t a thing our friends in the data ecosystem talked about. So how did it start trending, what is it exactly, and is it here to stay? We decided to take a closer look, and here’s what we found out.
👍👍Awesome stuff from Data Council. It’s nice to see this role getting growing recognition.
There’s a wave of energy and investment going into “no-code” tools. The thesis is that working with code is intimidating and difficult, so removing it is the best way to empower the non-technical masses.
(…) The real issue is with the arcane, overly complex workflows required to do anything productive with code — the lived reality is that “code” itself is the least intimidating part of coding.
I could not agree more with this, and I think it’s something that both product designers and investors frequently get wrong. Writing software code is like using human language—the combinatorial power of expressing ideas in language (whatever language) means that if you have the right primitives you can express literally anything. If, however, you’re locked into a GUI, you can only do something if the people who built that UI anticipated your need.
Want to enable more people to work with data? Remove technical hurdles, not code itself. Fantastic post.
Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure.
However, we demonstrate that modern deep contextual language models learn major aspects of this structure, without any explicit supervision. (…) we show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists. These results help explain why these models have brought such large improvements across many language-understanding tasks.
This is super super cool. We’ve started to understand the mechanisms of why BERT works as well as it does and it turns out that it’s parsing sentence structure in much the same way we do without having been explicitly taught how to do so.
If you’ve been waiting for a resource to dig into the inner workings of BERT and transformer models, this is a good one.
Hah. What a page! I covered Michelangelo here when it came out but honestly had no idea what an explosion of tooling we’d seen in the space in the past two years. This page is a collection of every known article and talk on the topic.
If you’re not familiar with what a feature store is, you should scroll down to the “feature store concepts” section and read that.
You’re likely familiar with Dask, the parallelization framework for Python. If you’re like me, you haven’t had to use it yourself and so haven’t had the opportunity to go deep. This podcast is an easy way to get an overview; it gets satisfyingly technical. Here’s my favorite chunk of the transcript:
(Q) In Dask, if I want to instantiate a really, really big distributed array, what kinds of work are you doing in Dask to instantiate that array?
(A) (…) we’ve got these thousand machines each holding maybe 10 NumPy arrays and now we need to sort of map and figure out which for this particular NumPy array, where does it fit in the broader picture? Maybe this is the NumPy array that corresponds to the temperature over France, for example. On this other computer is a NumPy array corresponding to the block of temperature over Italy. We know that if we want to sort of look at the Italy-France connection, we need to have those two machines.
Dask is a really a system that’s watching all those machines and is tracking all those Python objects and is as necessary telling those machines what to do, “Okay. It’s now time for the machine holding France to compute its sum. It’s now time for the machine holding Italy to transfer that array over to the machine holding France so that we can do some interaction.”
There’re two problems here. One is figuring out a plan of which arrays need to talk to each other and then executing that plan, which is a lot of talking to all the machines, make sure they’re doing the right thing. If one machine goes down, making sure the work that was on it gets replaced.
Cool! Dask literally lays on top of Numpy and adds cluster support. Got it.
Profiles of three databases: TileDB, Materialize, and Prisma (not really a database?). Good stuff. I strongly agree with the author’s perspective:
I’m a fan of “specialty” DBs that hone in on a specific set of data types and problems. The great thing about traditional RDBMSes is that they’re versatile enough to cover an extremely wide array of use cases (no pun intended) but sometimes you have “last mile” edge cases that are both (a) beyond the capabilities of “kitchen sink” systems and also (b) at the core of your business. I expect to see the emergence of more systems like this as database use cases become ever more specialized and new problem domains emerge.
I believe that data science workloads will, as they become increasingly well-defined and increasingly productionized, lead to the Cambrian explosion of databases that the author describes. The progressive separation of application logic and data processing that occurred over the course of 20+ years leading up to the RDBMS has a clear analog in our current era.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123