Everpresent metadata. Thin semantic layers. Accessible data.
Strategy in the data platform layer. Data mesh. And lots lots more.
This week I want to step away from the “thinkpiece” section of the Roundup and focus exclusively on sharing some great posts. There’s just too much good stuff getting written right now and I’m excited to read and talk about it!
Enjoy the issue!
- Tristan
Josh Wills started an epic battle of vendor one-upmanship in this hilarious-to-me Twitter thread. 100% agree with the original point: metadata should just be…everywhere.
This is a good hook to Sarah’s most recent newsletter where she talks about how to select a data catalog. It’s a fantastic resource exploring this category, sharing information on both the staid vendors (Informatica, IBM) and the explosion of newer ones more recently.
IMO, like Josh says, the core of the “catalog” is actually the integrations. Can you ingest and embed metadata from and to all of the relevant places? I do not want to enter descriptions into a catalog basically…ever. Does your catalog support integrating with dbt’s metadata so that you can grab descriptions and more? And I don’t want to go somewhere else to learn about the freshness of a particular pipeline. Does your catalog surface metadata into your BI tool of choice? Many tools now do a good job on both fronts.
One of my favorite recent posts is by Max @ Preset and discusses different approaches to building dashboards. Some BI tools work off of a query (à la Mode), some work off of a semantic layer (à la Looker). Max, and Superset/Preset, advocate for a third approach: building dashboards on top of “datasets”.
I think the term dataset is, in this context, a bit confusing because Max means a lot more than simply tables/views in a database:
…raw data tables alone aren't enough and it's critical to incorporate some ideas from the semantic layer to get the best of both worlds. Datasets should contain extra semantics, like:
clear labels and rich descriptions for the dataset itself
clear labels and descriptions for the columns
metrics as aggregate SQL expressions
Total Population:
SUM(population)
Sessions per User:
COUNT(DISTINCT session_id) / COUNT(DISTINCT user_id)
calculated dimensions computed row-by-row at runtime
definitions for which columns can be aggregated and filtered on
for time series columns, information on timezone, time granularity, etc
for numerical columns, information on units and preferred formatting
What I think the article is really advocating for is a semantic layer without joins…! Tables plus metadata to make those tables consumable in a BI layer. Which, notably, is exactly how dbt’s metrics are built :D
I think the conversation we’re starting to have as an industry is “How thick should my semantic layer be?” As the entire ecosystem charges headlong towards the semantic-layer-informed version of the world, this is a critical question. Do we want to attempt to define all of our business logic in the semantic layer, just like we did back in 2015 in LookML? Do we want to try to smash as much as we can into precalculated columns in dbt models?
Neither of these extremes are good answers. Try to do too much in the semantic layer and you lose all of the assertiveness (testing, CI/CD, lineage, etc.) of dbt. Try to do too much in dbt and you generate a bunch of junk models attempting to anticipate users’ interactive needs.
If you’ve spent enough time building in the dbt+LookML stack, you’ll have developed a good sense for what belongs in which layer. Gold-quality dim and fact tables go in dbt, business metrics go in the semantic layer.
Auren Hoffman (who is shockingly good at Twitter but also is the founder of SafeGraph) believes that:
It’s Our Moral Obligation to Make Data More Accessible
It is not often that posts about data ring out in the register of a moral polemic. And at first this post felt…strange, out of place. But as I read further, I found myself swept up. Here’s one argument that really stuck in my craw:
The IRS has income data on hundreds of millions of people over decades – including the incomes of people’s parents and grandparents. It is one of the largest and most comprehensive longitudinal studies in history. (…)
However, only a select few researchers have access to the data. Raj Chetty is famous. He’s a Professor of Economics at Harvard. He won the John Bates Clark Medal. His studies have been cited by thousands of articles. He’s amazing. He is one of roughly four researchers that has access to the IRS data.
By analyzing the tax returns, Chetty and his colleague were able to publish many monumental longitudinal studies. One example is where Chetty and his colleagues analyzed upward mobility across generations throughout the U.S. They found that upward mobility was heavily influenced by where one grew up. His finding: upward mobility exists – it’s just not evenly distributed.
Auren’s argument here is that, using modern differential privacy techniques, it’s possible to make IRS data available to anyone interested in doing this research and not just to 4 card-carrying members of the Big Academia. I’m a fan of institutions of higher education, but I’m not a fan of top-down hierarchical control of who can have access to the truth.
Here’s another one from Auren for good measure:
Here’s a quick hit on a new way to visualize what is happening in different types of joins. Time to replace those tired old Venn diagrams in your training decks?
Dan Cahana, a partner at GGV Capital, talks about strategy in the data ecosystem. There’s a lot of good in there, this paragraph made me 👀
companies like Monte Carlo*, which deals primarily with metadata, or dbt Labs, which builds a data model, play at a layer of abstraction above the underlying data, which gives them an opportunity to build broader platforms that treat the warehouse as a (sophisticated) primitive.
The entire post is largely an extension of Benn’s original post on the Snowflake app store strategy. It’s one of the more effective “where is this all going” posts I’ve read of late.
Monzo wrote about their ML stack! It’s a solid “how we built this” post; the reason I link to it here is that I’ve had so many conversations recently about integration between the analytics and ML parts of the data tech ecosystem and I think Monzo is a great example of putting these two things together.
(…) we write dbt models to transform any data and prepare the input for the batch prediction job, and then we write a Python job that pulls in the data, loads the model, and spits out the predictions. The dbt models and the batch job are orchestrated together using Airflow, which is run by our Data Platform Engineering team. Once the required dbt models have finished running, the job is submitted to the AI Platform.
This requires alignment between two different teams on common tooling, but the power of an integrated analytics-and-ML system is 🔥
I don’t often cover Data Mesh, honestly, because I’ve been waiting for it to feel like a more well-defined thing. This post is one of the more level-headed, practical discussions in the space; it focuses on what data engineering looks like in a data mesh world. Here is my absolute favorite part:
It is clear that we are entering an era of distributed analytical data systems. This happened before with the operational(OLTP) systems — many organizations have transitioned from monolithic applications to microservices. While it is still in the early stages for data, distributed approaches offers opportunities to fix some of the problems we discussed earlier.
For example, Data Mesh proposes separation of responsibilities for engineers building the analytical data systems. In Data Mesh, there are domain teams who are responsible for creating domain oriented products called data products and a platform team who focuses on creating the technical enables for the domain teams.
This approach is similar to what has been proven successful in Microservices implementations. There also are typically two kinds teams — the microservices or domain teams who has the responsibility on building the services, and the domain specific parts of the user interface and the platform teams which provides service capabilities to create and and compose these services and user interfaces.
This approach i.e. creating a self service platform which hides the technical and functional complexity and provides a clean abstraction to its consumers is very common in distributed systems because it enables autonomy but also ensures conformance to critical aspects.
I think there are real reasons to believe that this distributed architecture could start maturing in the coming years. Just had a really exciting conversation with folks at the company about this and a lot of us are pretty jazzed.
Emily Thompson invited her colleague Tara Robertson on to her newsletter to talk about eliminating bias in your hiring process:
In this post, we want to contribute to the evolving discussion by highlighting three areas that are worth putting some intentional focus towards to hire Data Scientists more inclusively:
Writing a focused job description that matches your team’s needs
Being strategic in sourcing to go beyond your network
Designing a structured interview process so that you can be consistent in evaluating candidates.
We talk a lot about hiring and team building here, and there are a lot of hiring managers that read this newsletter. This stuff is just critically important in building a high-performing, diverse team.
The most recent post from David J is fantastic. In it, he attempts to answer “why are companies struggling with data?” This inability to derive value from data in spite of good technology is something I’ve observed repeatedly and originally really struggled to understand. I think the post does such a great job of breaking this dynamic down for those of us who may not experience these problems in our day-to-day. Here’s a particularly poignant bit:
Many of the most vocal members of the data community at large, work at organisations that have achieved a great culture. Most organisations don't have a great culture!
Many have low talent density which is protected by favouritism and cliques. Many have a lack of openness and trust in the organisation, perpetuated by political and self-interested staff in leadership. Many have poor leadership that can't lead or manage their staff well. Many are just mediocre in culture... not particularly bad and not great; I believe org culture is roughly normally distributed in terms of standard. Many are not focused on culture at all, and are purely focused on revenue and profit.
While we have companies and organisations in the world, the standard of culture and technical ability in them will remain normally distributed. The vast majority of data practitioners won't work in companies at the better end of the distribution. We need to be realistic about this as a data community, and enable ways for more companies to win with data. We especially need to enable the majority of companies in the middle of the distribution. Many data practitioners from the innovators part of the distribution are founding companies for this purpose.
Love it.