Analytics is a Profession.

Doctors, lawyers, engineers, accountants, and...data analysts?

Hi! Hope you’re having a great weekend :D

Last week you heard, for the first time, from my brand new partner in crime, Anna Filippova, with Quasi-mystical arts of data & the modern data experience. I’m really so excited to have Anna join me here on a regular basis—Anna’s both a practitioner of analytics engineering and a student of open source communities, and her perspective dovetails with / enriches my own so well. We’ll be alternating weeks from here on out, and may change a bit of plumbing with Substack moving forwards to facilitate this. We’ll share more in coming weeks.

More quick announcements:

🤝 Coalesce!

Haven’t signed up for Coalesce 2021? Now is a good time. My favorite recent talk added to the agenda: Alex Viana, VP of Data @ HealthJoy is talking about rapid prototyping:

One of the main ways that engineering and product teams mitigate these risks are by building prototypes. But in order to do this, data teams need to let go of their natural affinity for highly accurate data and instead become more comfortable validating ideas using approximate or hypothetical data.

🎧 New Podcast Episode!

Caitlin Colgrove (CTO @ Hex) joined Julia and me to discuss interactive notebooks and Google Docs as a mental model for production analytics exploration. Listen here or on your podcast app of choice.

That’s it, enjoy the issue :)

- Tristan


On Telling Compelling Stories

(and why we so often don’t)

My favorite Twitter thread for the week, of course, started with @sethrosen:

Of course, serious data professionals TJ Murphy and Emilie Schario decided that they needed to ruin the fun:

It turns out that Paige Berry, who is on Emilie’s team at Netlify, recently wrote an in-depth piece on exactly this topic for Locally Optimistic (which is just so convenient that it almost makes me think that Seth’s initial tweet was just a guerrilla marketing campaign for it):

After some trial and error, the Netlify Data Team learned to do some things that work well for sharing an insight and engaging our colleagues. We lead with a handful of bullet points highlighting the key takeaways, include a high-quality visualization, and encourage further exploration and discussion.

This recommendation is not rocket science—it really isn’t. If you were majoring in journalism you would start learning to create compelling summaries that encouraged readers to click in the first semester of your freshman year. But does your organization do this well? I work with some truly world-class data professionals and we don’t do this particularly well. My belief is that this is the norm today. Paige’s nudge, then, is clearly needed even if it seems like it shouldn’t be.

I’m often a tools person, seeing the problems of the data world through the lens of what is broken or missing from our ecosystem of tooling. But this isn’t tooling thing…you could do great storytelling with a private Github repo and an email account.

No…we can’t blame our tools here. This one is on us. I think the reasons for this state of the world are instructive and worth exploring.

Zoom level.

Journalists generally stay zoomed out reasonably far from an issue. They’re always constructing the story in their heads and attempting to figure out where datapoints they’re getting fit into that story. This consistent zoom level throughout the process leads to a consistent focus on and refinement of a narrative.

Data professionals have to constantly change their zoom levels. You start zoomed way out, focused on the business problem to be solved. You then zoom in and in and in, going from analyses to datasets to individual anomalous records. Back out to integrate this information into a more cohesive model, then back in to make and test changes.

This constant alteration of zoom levels, from viewing grains of sand to viewing the entire globe, is disorienting.It takes a ton of experience to navigate these zoom levels appropriately without getting lost in the transitions. And critically, it makes it super-hard to construct a compelling narrative to be consumed by others at your organization—after having been at the grains-of-sand zoom level it becomes very hard to tell the zoomed-out story that will connect with others.

Not my job.

The question “What is the appropriate interface between data professionals and business stakeholders?” is hotly debated today. Depending on whom you ask, the answer could be:

  • Data people deliver datasets for business stakeholders to self-serve on

  • Data people deliver interactive data products for business stakeholders to self-serve on

  • Data people deliver analytical assets that contain the answer to a specific business question

  • Data people deliver explanatory narratives

  • Data people deliver business recommendations

As you go further down this list, these require increasingly more judgment, context, and experience.

My opinion? I think experienced data professionals should be able to operate in each one of these modalities and moreover should be able to choose which modality is appropriate for a given task. But that choice in and of itself takes experience, and as a result the higher-order thinking often just doesn’t happen.

When we fail to craft compelling narratives, it’s often because we didn’t even realize we should be doing so.

The curse of the generalist.

We are each scientists and journalists and strategists all at the same time. No one expect’s preeminent physicists to also write the Popular Science article makes their work exciting for the general populace. No one expects them to create the commercial strategy that turns their discovery into a product. We acknowledge that these skillsets are each deep fields of expertise in and of their own right and that humans will generally need to specialize in one of them.

Lucky you, you get to do all three! Not only do you have to discover the “laws of physics” that govern your organization, you have to write about them in a way that makes them accessible to the rest of your organization, and you have to be able to recommend strategic responses to these discoveries!

This is a tall order. If this is how we’re constructing the role of the data scientist/analyst, we should expect that different humans will have different strengths and that the only good answer will be teams whose diversity helps round out the inevitably-lopsided skills of the component humans. A well-rounded and high-functioning team is then an absolute requirement for both the success of the organization and the long-term success of the individual humans that make it up. Expecting individual humans to be world class at every part of the job is simply not reasonable or realistic.

So: when you hire onto your data team, are you interviewing for narrative ability? Do you train on it? Is it a part of your peer review process? If the answers are no/no/no, there’s little reason to expect that your data team would be able to write about their work particularly effectively.

Analytics is a…profession.

All of the above reasons why effective storytelling rarely happens in data can be boiled down to a single, simple thing: analytics is really damn hard to do well. Which doesn’t feel constructive! That feels like me complaining. But that’s not it at all—jobs that are hard tend to follow one of three paths:

  1. Specialize.
    Sometimes it’s possible to decompose a job into many component parts, assign those each to different specialists, and create an assembly line model of production.

  2. Automate
    Sometimes you can externalize the knowledge of a small group of experts into some artifact of technology, which then reduces the expertise needed by individual practitioners.

  3. Professionalize
    If the job is irreducibly hard (like medicine or law or engineering), create the supports (training / credentialing / etc.) to support humans doing that job well even though it is hard.

I think we have to admit that creating and disseminating knowledge in organizations is really, irreducibly hard. I think the answer is actually #3, and I think we’re going through that process as an industry today.

The fact that data work is hard should be the starting point from which we build everything else about the field. The minute that you start from a place of this is hard you start making different assumptions about everything.

This is one of the first things I wrote about when starting (then) Fishtown Analytics.

I believe that analytics is a flow-state activity. It rewards concentration, attention to detail, and deep experience. I want to build a company that can train and reward this type of work.

I still 100% agree with this statement, although I want to push it further. There is no bootcamp for analytics that can give you the experience / context / priors / mindset / breadth of skills required to be a great data analyst in ten weeks. You simply have to do the work, alongside great people from whom you can learn, for a really long time. Like…a decade or more.

This shouldn’t be surprising—this generally is what is required to be truly great at anything worth pursuing in life. But to-date industry hasn’t encouraged us to invest that time and energy. It has given us training programs and degrees that are disconnected from actual practice. It has generally given us shitty tools that don’t allow us to attain higher and higher leverage in our work. It has generally topped out our salaries after a promotion or two. It has generally encouraged us to “get into management” after spending a couple of years as an IC.

All of that is wrong and damaging. If data is irreducibly hard, we need to build amazing training programs, empower practitioners with great tools, pay them really well, and create long and fulfilling career paths that one can spend 30+ years in as an individual contributor.

All of this should sound familiar—it’s how we do it in the professions.

From the rest of the internet…

⚔️ Annika Lewis has a great teardown of the Snowflake / Databricks dynamic.

🧊 If the Snowflake / Databricks story is one of the integration of compute and storage, the launch of Tabular (from the makers of Apache Iceberg) aims to disaggregate these two layers, allowing you to layer multiple compute engines on top of a single storage engine. 💥

💯 Vincent Warmerdam suggests that label accuracy is often a bigger problem than model accuracy:

(…) maybe we should spend a less time tuning parameters and instead spend it trying to get a more meaningful dataset.

📊 Alessio Fanelli of 645 Ventures writes one of the better histories of BI that I’ve ever read in his investment announcement for Cube.js. Surprisingly meaty / worthwhile post for an investment announcement ;)

Claire Carroll writes a great explainer on one of the most boringly-named concepts in the world, the OLAP cube. If you spend any time building derived tables, thinking through these problems is mandatory.