1 Comment
Jan 28·edited Jan 28

In software, there is always "code" and "data". Unit tests test the code, data tests test the data.

I still don't think we have a good framework in the modern data stack to test SQL code. Some libraries are out there in their infancy, but today I would guess most orgs simply do not unit test their SQL code. It is too difficult.

Data tests of course are much more well-supported. Nevertheless, mostly everyone defaults to the same tests of "not null" and "unique", which are frankly table stakes and not worth repeating. What about snapshotting of data and testing for large (unexpected) changes in the distribution of data over time? Business teams always want to know "what changed and why", which is simply infeasible if you don't snapshot (CDC, SCD) data. This type of thinking should really become a first class citizen in data testing.

Also it is hard to talk about data quality without discussing versioning. Definitions (and thus SQL logic) change - how do we version our data and our data models? How do we sunset old data and notify clients/dependencies? dbt has some interesting new developments here, but versioning is hard and best practices here should be at the forefront of ensuring data quality over time.

Expand full comment