7 Comments

I had the unique pleasure of spending the first 10months in a new Data Analyst role working on inter-temporal entity resolution!

The company was Ritual: a Toronto-based order-ahead-and-pickup-lunch app. When I joined, the company, every new data hire would be questioned if they had any good solutions to maintain an accurate and current data set on all the restaurants that exist in every major city in the world. *ahem*

It was a proverbial "Sword in the Stone" problem: the company was like a village waiting for a hero to swoop in and pull the sword out. I certainly was NOT that lone hero, but I had some ideas and was put on a pod with a couple very talented people to try and collectively haul out the sword.

Our solution was built on a Python library called Dedupeio: https://github.com/dedupeio/dedupe, which I got turned onto by investigating ways to constrain the explosive complexity of pairwise comparisons.

But guess what? Even with a serious amount of model training from us and folks at Mechanical Turk; it still didn’t work that well. We still had duplicate restaurants, or closed restaurants in our sales pipelines and our city launch priority estimates.

The best solution we got to was to pay for the most comprehensive data source, rather than try and blend 5 different, less-complete sources.

Reflecting now, it was an amazing problem to work on, but most of all because it taught me that sometimes there’s a limit to what an engineered solution can do: sometimes you just have to go about it another way.

Final thought: I’ll never forget talking about the problem with a good friend, on a sunny afternoon in a park, about a month into working on it.

I explained the intricacies of the problem: the incomplete and missing fields, the O(n^2) complexity, etc.

Finally, he looked at me and said: "Have you tried fuzzy matching on name: maybe that would work?"

I had the leave the park early.

Expand full comment