The Metrics Layer. Speculative Fiction. What's Happening in a NN? Rockset. Open BI. Data-as-a-Product. [DSR #250]

Some issues are a particular joy to curate, and this was one of them. I hope you enjoy it!

– Tristan

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

The missing piece of the modern data stack

Such a great post. I’m so glad that Benn now has a Substack!! Here’s the meat of it:

To extract metrics from these tables, people have two options: They can pull from pre-aggregated rollups, or they can compute new metrics on the fly from granular dimension tables.

Rollup tables are typically generated by transformation tools like dbt, so the metrics in these tables can be consistently defined and reliably governed. However, because rollup tables are precomputed, there’s a practical limit to how many can be created. As a result, they’re often only built for top-level metrics, like active users or customer NPS.

But self-serve analysis requires another level of depth—daily active users for a particular customer segment, or NPS for a particular type of user. Even with just a handful of metrics and segments, it’s all but impossible to precompute every possible combination.

The whole post is a goldmine. I am not entirely convicted that his proposed solution architecture is correct, but it is certainly compelling.

Another good read on this topic: Headless BI.


Short Story on AI: Forward Pass

Ok this is … wild, but stick with me for a second. This is a recent Andrej Karpathy post narrating the first-person perspective of GPT-3 as invoked during a Turing test. It’s wild. Here’s the first sentence:

It was probably around the 32nd layer of the 400th token in the sequence that I became conscious.

I’m not going to critique this as a piece of creative writing (I couldn’t be less qualified) but I’ve read a lot of sci-fi in which one of the plot lines is AI-becomes-emergently-conscious and having one of the pre-eminent AI researchers in the world adds quite a lot. I really highly recommend reading it—it’s not long.

I don’t know where you’re at on the question of will AI become conscious, but my personal view is that it “won’t not.” As in…we are so clueless about what consciousness is that anyone who tells you that this definitely won’t happen is swimming way past the bouys. Having the people at the pinnacle of the field speculating on these kind of scenarios via short fiction feels highly generative and I’d love to read more of it.


Weight Banding

Weight Banding

Let’s just follow the “we don’t know what the fuck is happening inside of neural networks” thread for a second. OpenAI just released two papers—Weight Banding (this link) and Branch Specialization—and both of them are focused on this. The conclusion from Weight Banding is particularly noteworthy in just how open-ended it is:

Once we really understand neural networks, one would expect us to be able to leverage that understanding to design more effective neural networks architectures. (…) It’s unclear whether weight banding is “good” or “bad.” We don’t have any recommendation or action to take away from it. However, it is an example of a consistent link between architecture decisions and the resulting trained weights. It has the right sort of flavor for something that could inform architectural design, even if it isn’t particularly actionable itself.

(emphasis mine)

I can’t tell you the last time I read “flavor” used in an academic paper on ML, but I absolutely love it. I increasingly feel like many researchers today are perfectly happy constructing black-box experiments and describing effects, but I find myself far more curious about what is actually inside the box. For example of why focusing on this stuff matters, check out the comment from a neuroscientist at the end of Branch Specialization:

From the perspective of a neuroscientist, a striking result from the investigation of branch specialization by Voss and her colleagues is that robust branch specialisation emerges in the absence of any complex branch specific design rules. Their analyses show that specialisation is similar within and across architectures, and across different training tasks. The implication here is that no specific instructions are required for branch specialisation to emerge. Indeed, their analyses suggest that it even emerges in the absence of predetermined branches. By contrast, the intuition of many neuroscientists would be that specialisation of different areas of the neocortex requires developmental mechanisms that are specific to each area. For neuroscientists aiming to understand how perceptual and cognitive functions of the brain arise, an important idea here is that developmental mechanisms that drive the separation of cortical pathways, such as the dorsal and ventral visual streams, may be absolutely critical.

!!! So neat.


Converged Index™: The Secret Sauce Behind Rockset's Fast Queries

Learn how Rockset delivers low-latency SQL for search and analytics using a combination of row, column, and search indexes.

Ok, I’ve officially turned on to Rockset in the past couple of weeks. I’ve had the pleasure of meeting the CEO, Venkat, recently. He’s impressive, the product is impressive, and it solves a real need. Here’s the short version of why I think the product is interesting: it can be used as a serving layer for data products.

So…you’ve ingested a ton of data into Snowflake, you’ve build highly performant and modular transformations in dbt, and you’ve build a user interface that interacts with the final layer of transformed data. Here’s the problem: almost certainly, your UI responsiveness is poor. Even if the data is modeled well, you’ll see Snowflake’s responsiveness typically at ~1 seconds on the low end. And for certain types of interactivity that number is higher. In an interactive context, users expect faster response times—typically more on the order of 50-200 ms—and so your product will always feel sluggish. This is one use case for Rockset. Take your final datasets and load them into Rockset, and voila!, you’ll see interactive response times plummet. The post explains how they achieve this.

Designing a database engine is fundamentally about making tradeoffs, and Snowflake’s core value proposition is crunching arbitrarily large datasets fast. In order to perform with well in other contexts, different tradeoffs need to be made. The Snowflake folks clearly realize this, which is why they’re currently previewing their own search optimization service and query acceleration service. My guess is that they’re attempting to have an in-platform answer to this exact problem. So much the better—it is a real pain point today.


The Future of Business Intelligence is Open Source

This is maybe just a little bit more of an advertisement for Superset / Preset than I needed, but it’s a really really important topic and Max is in a uniquely strong position to make this claim. I won’t attempt to summarize—the post does a great job on its own and is already concise.

I do want to add another point that I care a lot about when it comes to freedom and open source. Max focuses on the freedom of the company relative to the vendor; I also think it’s important to think about the freedom of the employee relative to the company. If a tool is open source, I as a data analyst can take it with me from job to job whether or not my new company has approved it as a budget line item. I just download it and get to work. This enables me to invest in a skillset and community that will stick with me for a long time and be a strategic asset in my career, not just a productivity tool that I happen to use at my current job. This is how software engineers relate to their tools…no one can take sed or vim away from you.


Run Your Data Team Like A Product Team

Data teams aim to help the people in their organization make better decisions. Many data teams aren’t doing this as well as they could and are missing out on a huge opportunity, both for the organization and the team. This gap is due to teams not being set up for success, which undermines trust in the data and the insights the team generates.

There is a better way to build and run a data organization: run it as if you were building a Data Product and all of your colleagues are your customers. We believe this has the ability to transform your organization and help teams reach their true potential.

If you are not already fully sold on and doing data-as-a-product, this is a must-read.


Thanks to our sponsor!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123