Editors note - before you dive into today’s regularly scheduled Roundup, we have an ask:
If you haven’t already, please do take a couple of minutes to fill out the State of Analytics Engineering report survey - the survey closes at the end of the month. Once we crunch the numbers, you’ll be able to see the aggregated data in our 2024 version of the report.
Intro
Mike Powell fell down to his knees in dismay: he had just fouled up the biggest opportunity of his career. He was locked in a long jump competition against the titan of the track, Carl Lewis, an athlete so dominant that he had a 65-meet winning streak and hadn’t been defeated in over 10 years.1 On his 4th attempt, the roar of the crowd let Powell know that his jump was big. But then the red flag from the official made one thing clear: he had fouled at the take-off board and his mark would not count.
How do you recover when your golden opportunity seems to slip through your fingers like sand?
In long jump, as in data, accuracy matters. Now back to the story.
Nearly running out of attempts, Powell began his well-practiced run-up and gained momentum as he approached the take-off board. He hit the perfect placement and flew through the air. The moment he splashed into the sand at the 9 meter mark, the crowd knew he had done something special — when the official mark was announced, he had leaped further than any human in history, surpassing Bob Beamon’s seemingly unbreakable record and defeating Carl Lewis in the process. And his record has stood the test of time, still standing 33 years later.
It took me a while to make a connection between data work and my personal experiences as a long jump coach. But it dawned on me recently that there were more similarities than I’d thought. After all, we love to learn from other disciplines here on the Roundup. Most of the time it’s software engineering, but we’ll take inspiration wherever we can find it. When we talk about fouls in data and in jumping - one thing is definitely the same - they’re bound to happen from time-to-time, but you’d rather have it happen in practice than during competition.
There are several parallels between going for the gold in the long jump and building trusted data sets using software-inspired CI/CD practices:
Look before you leap: Before the competition starts, long jumpers meticulously practice every step of their run-up. In our world - we can use unit tests to provide us with “practice runs” of all different kinds of scenarios in our data.
Toe the line: In long jumping, the eagle-eyed official watches each long jump to make sure that the jump is fair. We use data tests at transformation time in our work to make sure that we spot any fouls long before our stakeholders would notice.
Put your best foot forward: Each long jumper gets six attempts, with only the best one counting. We can use patterns like blue-green deployments to ensure we’re putting only the best, most accurate datasets in front of our data consumers.
Let’s walk through each of these and them talk about a generalizable framework for creating trusted datasets.
Look before you leap: practice run-ups and using unit tests
If you watch a long jumper closely, you may notice a meticulously-placed mark on the runway where they begin their run-up. The best place for the athlete to start from will depend on the conditions of the day (is it windy?), how the athlete is feeling (”feeling fast today!”), and how well they are executing their running steps (over-striding or under-striding). When they do their practice run-ups, the athlete and coach will gauge if the athlete’s starting point is dialed-in or needs some adjustments.
They will do multiple practice run-ups before the competition starts rather than winging it (“YOLO!”). Better to discover any risks before doing it live, right?
In analytics engineering, software-inspired unit tests serve as these practice run-ups. They ensure each transformation performs as expected before integrating it into the larger project. Just as a long jumper will use practice run-ups to make sure they are ready to compete, developers can use unit tests beforehand to make sure their code is ready for a variety of scenarios when it’s “go time” for building production data.
Why and how of unit testing
Daniel Terhorst-North believes “the purpose of testing is to increase confidence for stakeholders through evidence.”2
This is a “why” for testing in general that we can apply to unit tests specifically, but what about “how”?
At its most simple, here’s how to approach unit testing:
identify a potential situation that a stakeholder isn’t confident about
produce the evidence that the transformation works as expected
Automated unit testing frameworks3 can produce this evidence – they just need to know the potential situation and the expected behavior.
These are called test fixtures (AKA “mocks”)4 which are split into two groups:
Given input(s) (“the situation”)
Expected output (“the expected behavior”)
From there, the automated unit testing framework can produce the evidence you are looking for to be confident before production, just as a coach and athlete are looking for confidence they are ready to compete.
Toe the Line: avoiding fouls and using data tests
When an athlete launches into the air in the long jump, whether their distance can count or not hinges on a single, critical assessment: their foot placement at takeoff. This assessment is binary — it’s either a fair jump or a foul. An eagle-eyed official sitting right at the take-off board will make the call by raising a white or red flag. The combination of a clear boundary line and a skilled official preserves the fairness of the competition.
Data tests are akin to an athlete's foot placement at takeoff. The result of each test is like a white or red flag — is the code performing as expected, or is there an error, a “foul”, that needs correction?
How of data testing
Data tests are all about defining a bright line between good and bad, fair and foul, passing or failing.
It's like saying, "Let's start by assuming all the data is good unless we have a tangible reason to think it’s not."5
The crux is determining each of the scenarios that are unacceptable to stakeholders.
If it is unacceptable for the customer_id
column to have any null values, then the applicable data test would merely check if that column has any null values.
In that way, each data test just defines the criteria that matches a particular unacceptable scenario.
If any data is found that matches one of these criteria, then we have tangible reason to think the data isn’t good. Otherwise, we can keep the positive assumption that everything is okay.6
In the case of Mike Powell’s oh-so-close 4th attempt, mere centimeters separated foul from fair, but the official made the correct call because they could clearly see his toe over the line.
Put your best foot forward: blue-green deployments
Mike Powell actually fouled his 6th and final attempt. But thankfully for him, it’s not your most recent attempt that counts in long jump, just the best one.
Wouldn’t it be great if your data worked the same way?
In her presentation, “Whoops, the numbers are wrong!”, Michelle Ufford shared a pattern to do just that. She described keeping data in a staging area until the data has been checked for all the unacceptable criteria. As long as it’s all good, then it can be released from the staging area so that consumers can access it.
Many of you will recognize this pattern but call it different names. Some in the software engineering world call it blue-green deployment7, while others call it red-black. For the data world, Michelle called it “WAP” and commented: “It stands for Write-Audit-Publish. That’s what happens when you let engineers name things.”
Since blue-green, red-black, WAP, etc are all essentially the same thing with different names, I feel at liberty to take my own shot at naming this pattern for the data space by calling it SWAP:
Stage Writes
Audit (with data tests)
Publish
The rules of long jump are like a blue-green deployment: when Powell made his massive leap, the official keeping track just swapped out his previous best mark with his improved one. But when he fouled his final jump, he still got to keep his best one. Likewise, a previous version of the data set can be swapped out for the newly built one after it passes all its data tests. But if any data tests don’t pass for the new version, then it is prevented from being promoted.
Putting it all together: SWIP-SWAP
To wrap it up, we saw some of the ways in which Mike Powell set the long jump world record and how they compare to practices in the analytics engineering realm.
Bringing the triad of unit testing, data testing, and blue-green deployments together gives us the following name8:
SWIP - Software Inspired (unit testing) Practices
SWAP - Staged Writes, Audit (via data testing), Publish
This is similar to the Continuous Integration / Continuous Deployment (CI/CD) patterns from software – each applies at a different time in the life cycle. SWIP applies during hands-on development and CI processes before merging code changes. Then SWAP applies when re-building production data sets.
Just like Mike Powell achieved his “tip-top-hop” and won the gold, analytics engineers can use a “SWIP-SWAP” pattern to build top-notch data sets that can go the distance.
Philly fliers: Mike Powell was born in Philadelphia. Carl Lewis grew up in suburban Philly and is a devout Eagles fan.
Who counts as a stakeholder? Anyone that participates in making the data set or using it! This would include the analytics dev team, developer of a BI dashboard depending on it, the person using the dashboard to make business decisions, etc.
dbt v1.8 will include a native unit testing framework
The first property that Kent Beck describes for good unit tests is that they are “isolated”, which is accomplished via test fixtures for the inputs and outputs. Interestingly, two of Kent’s other properties explicitly mention “confidence”, so that seems to be a common theme.
This positive assumption is called the “null hypothesis”.
Even a single failing data test is all you need to reject the null hypothesis. Great data tests avoid type I errors (false positives), by focusing on scenarios that are known to be bad and not just potential anomalies that could go either way. data tests ≠ anomaly detection!
Daniel Terhorst-North was one of the originators of blue-green deployments and Behavior-driven development (BDD).
In software, there is always "code" and "data". Unit tests test the code, data tests test the data.
I still don't think we have a good framework in the modern data stack to test SQL code. Some libraries are out there in their infancy, but today I would guess most orgs simply do not unit test their SQL code. It is too difficult.
Data tests of course are much more well-supported. Nevertheless, mostly everyone defaults to the same tests of "not null" and "unique", which are frankly table stakes and not worth repeating. What about snapshotting of data and testing for large (unexpected) changes in the distribution of data over time? Business teams always want to know "what changed and why", which is simply infeasible if you don't snapshot (CDC, SCD) data. This type of thinking should really become a first class citizen in data testing.
Also it is hard to talk about data quality without discussing versioning. Definitions (and thus SQL logic) change - how do we version our data and our data models? How do we sunset old data and notify clients/dependencies? dbt has some interesting new developments here, but versioning is hard and best practices here should be at the forefront of ensuring data quality over time.