4 Comments
Sep 16·edited Sep 16

Hi Tristan. Interesting post and I am happy to see practitioners trying to tackle the full lifecycle of analytics.

There are many good parts to this, and for people new to the space, your piece summarizes many important aspects of the modern data stack. Since it is already a good preliminary summary, I will instead add some constructive criticism on a handful of areas where I think more explanation could be useful.

In the `Intro`, you mention 4 types of truth claims: descriptive, causal, predictive and prescriptive. What I think is important to note, and to spend more time on at some point (though today it evades much of the modern data stack), is the correlative nature of the last 3. Causation (“if this occurs, then that will occur”), prediction (“given this, we will observe that”) and prescription (“if I do this, then that will occur”) are all correlative in nature. In particular - and in contrast to univariate descriptive statistics (“reporting”) - we are looking at the relationship between two or more variables. We know the answer to this lies in statistics/data science/machine learning/causal inference, but your piece does not really make mention of these.

When I think of “analytics”, I think of developing an understanding of the data. Certainly, this includes univariate, descriptive reporting. Business intelligence. Dashboards. This is where dbt shines. But I also think of “if-then” causality. “Why did this go up?” “Will that go up if we do this?” Many, many analytical questions are attempts to infer causality (though, in practice, this is quite difficult, so we settle for correlation, and thus the vast majority of statistics) - in other words, how a system works. What levers affect which outputs. As data practitioners, we are stuck with the data, but the analytical nature of our job is to statistically infer the “data generating process” (DGP) - that is, the shape of the system itself.

As a result, I think it’s a stretch to reach for the mantle of “analytics” when, so far, we’re really just talking about reporting. Displaying data is admittedly, in its own right, very difficult to do correctly/reliably/durably/comprehensibly/”auditably”/performantly/resiliently, and I think you emphasize more or less the right points here.

In your `Requirements of a mature analytics workflow`, some additional requirements - which maybe you are placing into existing categories - could include: Durability (what if someone deletes the staging table in Snowflake?), Semantics (Governance generally includes this, though your description mostly focuses instead on regulatory compliance), Discoverability (also under Governance), and Monitoring/Observability (perhaps this falls into Reliability, though monitoring is such an important space - DataDog, Splunk, New Relic, etc. - that the keyword I believe deserves a mention).

In the `Stakeholders of the ADLC`, I mostly agree with these, though I do think there is a “researcher” persona which is different from the others. Perhaps you are placing it under the analyst - which is reasonable - though in practice the skills of a researcher (statistics, econometrics, data science) overlap almost not at all with the typical data/business analyst, who instead needs business context, dashboard skills, PowerPoint/Excel, and maybe a bit of pandas. For the 3 roles as you have them, I personally differentiate them by what they are trying to accomplish: to assemble the data, to understand it, and finally, to act on it.

In the `Hats, not badges` section, the word that jumped to mind to me was “full-stack analytics” - i.e. we ought not to be strictly confined to our title’s narrow domain.

In the `ADLC model` section, I don’t necessarily agree that “analytical systems are software systems.” PowerPoint for example is an integral part of how we communicate our analyses to stakeholders. It is absolutely the case that an “analysis” today is memorialized in a 10-slide deck, and frankly, I think this is the best output for it. Perhaps it is debatable whether that should be the case, but I think it is unobjectionable that most analyses today are conveyed through presentation format. In fact, one distinction I find between a dashboard (live) and a PowerPoint (static) is precisely due to the fact that a PowerPoint is static: we are taking a stance (snapshot) at a particular point of time (of the data) and making a recommendation (today) because of it. Looking at a live dashboard does not achieve this; the narrative nature of a PowerPoint does.

Exploratory data analysis, in my experience, is also not really a software system. I fully anticipate that 90% of our research is throwaway work. So it is the case with all science. Of course the last 10% we do indeed memorialize as code and it is checked into VCS, but 90% is not, and it does not adhere to most software principles. I often differentiate between “research code” and “software code”, and research code exists because it provides a useful, albeit transitory, purpose. I would say that “reporting systems” or “BI systems” are indeed software systems, but analytical systems more generally seems like a bit of a far reach.

In the `Discover and Analyze` section, you discuss `Discover` in detail (mostly a lot of governance items) but I feel like the `Analyze` section is lacking.

Overall, I think it is a good introductory post, although I imagine a comprehensive explanation of the “Analytics Development Lifecycle” would likely necessitate more of a book-length review. I look forward to reading more!

Expand full comment

Hey Tristan, great post as always.

I've read through the entire ADLC white-paper and it feels like you missed the most important point.

Coming from an engineering background, the reduction from software to data engineering is obvious (DevOps => DataOps). BUT, there's a very important distinction - the actual data.

Besides managing and deploying code (SQL/Python/...), you also need to run the data.

This creates a host of new challenges - orchestration, data quality (no clear stack trace), data version management, data discovery, data observability etc.

Feels like it should get more emphasis throughout the essay.

Expand full comment
author

Hi! Love this comment.

I *intentionally* left this out. In the data industry, we are *too obsessed* with these topics, and we totally lose the idea of a workflow. The conversation ends up being "what tool are you using for observability!?" which is ultimately not productive.

Yes, sure, there are jobs-to-be-done here that are specifically related to data. And I call that out in the whitepaper. But the SDLC doesn't talk about "building mobile apps" or "building services" either. And we all agree that it shouldn't. IMO the ADLC is a _workflow_ and it is independent of the specific JTBD--it applies to _all_ of them.

Appreciate the comment.

Expand full comment

I get where you're coming from, but IMO, everything about running data - who's executing it, where, how we bounce back when things go sideways - that's all part of the ADLC workflow.

Sure, it's tempting to turn this into a tools conversation (and fun, let's be honest), but there's a reason we've got data-specific software - because the lifecycle contains more steps.

Expand full comment