Part 4 Matching Structured Records: Why is it so difficult?
If you are not familiar with the game whack-a-mole (pictured right), you use a mallet to whack each mole as it pops up. But as you knock one down another one appears in a never ending struggle. I use this term often in matching, especially when fine tuning match issues. You can never get a perfect match implementation because there are constantly counter examples and so you make changes where there are net positives. It's very common to spot a match issue and jump to want to fix it. But there's always counter examples that you will effect and you need to think of those. Here's a simple example: Use of anonymous values are very common in any solution. They are words that should be removed before comparing two attributes. Incorporated is often an anonymous value because it's very common for someone to drop the incorporated term when listing a company name, Apple Incorporated or simply Apple. But another record can have incorporated (Apple Incorporated) and when you compare Apple vs Apple Incorporated you generally get penalized for the extra word that doesn't match. So then it's obvious, just anonymize the word and you end up comparing Apple vs Apple, a perfect match. But there's always a counter example and in this case company names are not unique. There can be many companies named Apple that are in different industries or a small business could be named Apple. In this case you might have started with two records; Apple LLC and Apple Incorporated but when you anonymize both LLC and Incorporated you end up comparing Apple to Apple, a perfect match.....not really. My best advice is to always go to the counter situations whenever you are making a change and try to assess whether the change is a net positive.
The last technology to discuss is machine learning and within that category specifically supervised learning. Supervised learning is the branch of machine learning where you are first teaching the machine how it should do it's job. This is done by providing the machine, scenarios with known outcomes as answers so that it can learn.
The other branch, unsupervised learning, is an approach that attempts to learn directly from the data without being provided direct answers to direct scenarios.Machine learning is extremely well suited to be applied as a match technology because it's essentially a pattern recognition solution. It's used to do facial recognition, photo identification, finger print analysis etc... Applying this to match is a natural path and one that we've found very successful. The advantage machine learning has over a probabilistic approach is that it is not linear, meaning that it doesn't have a simple threshold. It can learn and define differences between records based on evaluating it's different attributes much the same way a human would look at the match. Simplified Example:
Pair 1: John Smith 1 Main Street Austin 555-1212 44-555-6666
John Smith 2 Elm Street Dallas 444-1212 44-555-6666
Pair 2: Jim Smith 1 Main Street Austin 555-1212
John Smith 1 Main Street Austin 444-1212 44-555-6666
In this example both pair 1 and pair 2 have scored the same from the probabilistic match process but fall below the automatic threshold so neither is a match even though pair 1 appears to be a match and pair 2 does not. Both are simply not matches. As a human you scan each attribute and make an assessment in the following way "Pair 1 both names and SSN's match exactly and even though there different addresses and phones it's likely the person moved, so this is a match".But for pair 2 "Name is not perfect, phone's are different and SSN is missing for one record so this is not a match as it appears two people Jim and John perhaps live at the same apartment complex." Machine learning can learn in the same way that a human see's those records and have equivalent accuracy.
I hope this series was helpful. More topics to come...