I had the unique pleasure of spending the first 10months in a new Data Analyst role working on inter-temporal entity resolution!
The company was Ritual: a Toronto-based order-ahead-and-pickup-lunch app. When I joined, the company, every new data hire would be questioned if they had any good solutions to maintain an accurate and current data set on all the restaurants that exist in every major city in the world. *ahem*
It was a proverbial "Sword in the Stone" problem: the company was like a village waiting for a hero to swoop in and pull the sword out. I certainly was NOT that lone hero, but I had some ideas and was put on a pod with a couple very talented people to try and collectively haul out the sword.
Our solution was built on a Python library called Dedupeio: https://github.com/dedupeio/dedupe, which I got turned onto by investigating ways to constrain the explosive complexity of pairwise comparisons.
But guess what? Even with a serious amount of model training from us and folks at Mechanical Turk; it still didn’t work that well. We still had duplicate restaurants, or closed restaurants in our sales pipelines and our city launch priority estimates.
The best solution we got to was to pay for the most comprehensive data source, rather than try and blend 5 different, less-complete sources.
Reflecting now, it was an amazing problem to work on, but most of all because it taught me that sometimes there’s a limit to what an engineered solution can do: sometimes you just have to go about it another way.
Final thought: I’ll never forget talking about the problem with a good friend, on a sunny afternoon in a park, about a month into working on it.
I explained the intricacies of the problem: the incomplete and missing fields, the O(n^2) complexity, etc.
Finally, he looked at me and said: "Have you tried fuzzy matching on name: maybe that would work?"
I got some other folks responding directly saying the same thing about firmographic information--maybe some entity resolution is best solved algorithmically and some is best solved via data providers... But even when you get a clean data source, isn't there still matching that needs to be done that requires an algorithmic approach? (I haven't actually had to deal with this approach personally.)
There certainly is a place for the algorithmic approach! In the anecdote I shared of my experience, we were simply lucky to find a vendor with a more complete set of the data we needed.
I would say that is the best option to these problems: try and acquire a better data set, or fix the underlying gap (eg. customer tracking that differs between internal systems/services should be unified by the engineering team, rather than blended using an entity resolution model by the analytics team).
I think that in most cases, businesses need to make a trade off for higher precision and lower recall. Eg. It's better for a model to only match entities it's VERY certain about, and match less entities, than it is to match more entities, but with more mistakes (false positives).
In my experience, a key part of complementing the work of the entity resolution algorithm HAS to be a human-centric feedback/oversight system. This means designing an interface (even as basic as a google sheet) that is populated with proposed matches that the algorithm is uncertain about and monitored by a human.
Ideally when the human reviews those proposed matches (yes, no, unsure) they are also fed back as labelled data into the model. Dedupe.io had all the functionality that let us build a system like that at Ritual!
But at the end of the day: the more accurate a business requires the matches to be, the more they will have to use human oversight to mark samples that don't meet a threshold of certainty.
That's really how I see these models being valuable: not as something automatic, but as a compliment to highlight the most uncertain edge-cases for human review.
I agree here, a fully autonomous entity resolution is still some time off, and no AI can predict if a restaurant is closed or shut down. Confidence scores of the matches, as you rightly mentioned are the key to reducing human effort to a great degree. Till the time there is context and/or semantic understanding of the records/entities, this is as far we can get. I would love to see semantic intelligence built up over time.. that should be cool!
Totally! Not to be a sideline product manager, but I could see a lot of value stemming from offering a really robust, feedback interface as part of a paid/hosted offering of your awesome project!
Being able to give business folks a turnkey solution for reviewing low confidence matches would probably be compelling for a lot of teams working on these problems.
That is a great suggestion Teghan, can only come from experience in this area! May I request some time from you to show Zingg and take your feedback? Even 15 - 30 minutes will be super helpful. DMing you on LinkedIn.
I had the unique pleasure of spending the first 10months in a new Data Analyst role working on inter-temporal entity resolution!
The company was Ritual: a Toronto-based order-ahead-and-pickup-lunch app. When I joined, the company, every new data hire would be questioned if they had any good solutions to maintain an accurate and current data set on all the restaurants that exist in every major city in the world. *ahem*
It was a proverbial "Sword in the Stone" problem: the company was like a village waiting for a hero to swoop in and pull the sword out. I certainly was NOT that lone hero, but I had some ideas and was put on a pod with a couple very talented people to try and collectively haul out the sword.
Our solution was built on a Python library called Dedupeio: https://github.com/dedupeio/dedupe, which I got turned onto by investigating ways to constrain the explosive complexity of pairwise comparisons.
But guess what? Even with a serious amount of model training from us and folks at Mechanical Turk; it still didn’t work that well. We still had duplicate restaurants, or closed restaurants in our sales pipelines and our city launch priority estimates.
The best solution we got to was to pay for the most comprehensive data source, rather than try and blend 5 different, less-complete sources.
Reflecting now, it was an amazing problem to work on, but most of all because it taught me that sometimes there’s a limit to what an engineered solution can do: sometimes you just have to go about it another way.
Final thought: I’ll never forget talking about the problem with a good friend, on a sunny afternoon in a park, about a month into working on it.
I explained the intricacies of the problem: the incomplete and missing fields, the O(n^2) complexity, etc.
Finally, he looked at me and said: "Have you tried fuzzy matching on name: maybe that would work?"
I had the leave the park early.
I got some other folks responding directly saying the same thing about firmographic information--maybe some entity resolution is best solved algorithmically and some is best solved via data providers... But even when you get a clean data source, isn't there still matching that needs to be done that requires an algorithmic approach? (I haven't actually had to deal with this approach personally.)
There certainly is a place for the algorithmic approach! In the anecdote I shared of my experience, we were simply lucky to find a vendor with a more complete set of the data we needed.
I would say that is the best option to these problems: try and acquire a better data set, or fix the underlying gap (eg. customer tracking that differs between internal systems/services should be unified by the engineering team, rather than blended using an entity resolution model by the analytics team).
If entity resolution matching can't be avoided, then it's really a matter of trade offs. Specifically, precision vs. recall: https://en.wikipedia.org/wiki/Precision_and_recall.
I think that in most cases, businesses need to make a trade off for higher precision and lower recall. Eg. It's better for a model to only match entities it's VERY certain about, and match less entities, than it is to match more entities, but with more mistakes (false positives).
In my experience, a key part of complementing the work of the entity resolution algorithm HAS to be a human-centric feedback/oversight system. This means designing an interface (even as basic as a google sheet) that is populated with proposed matches that the algorithm is uncertain about and monitored by a human.
Ideally when the human reviews those proposed matches (yes, no, unsure) they are also fed back as labelled data into the model. Dedupe.io had all the functionality that let us build a system like that at Ritual!
But at the end of the day: the more accurate a business requires the matches to be, the more they will have to use human oversight to mark samples that don't meet a threshold of certainty.
That's really how I see these models being valuable: not as something automatic, but as a compliment to highlight the most uncertain edge-cases for human review.
I agree here, a fully autonomous entity resolution is still some time off, and no AI can predict if a restaurant is closed or shut down. Confidence scores of the matches, as you rightly mentioned are the key to reducing human effort to a great degree. Till the time there is context and/or semantic understanding of the records/entities, this is as far we can get. I would love to see semantic intelligence built up over time.. that should be cool!
Totally! Not to be a sideline product manager, but I could see a lot of value stemming from offering a really robust, feedback interface as part of a paid/hosted offering of your awesome project!
Being able to give business folks a turnkey solution for reviewing low confidence matches would probably be compelling for a lot of teams working on these problems.
That is a great suggestion Teghan, can only come from experience in this area! May I request some time from you to show Zingg and take your feedback? Even 15 - 30 minutes will be super helpful. DMing you on LinkedIn.
Haha botched the punchline here: "I had to leave the park early."