Many believe all match solutions produce similar results. Unfortunately, this is far from what actually exists. The most difficult part of assessing a match solution is not being aware of what should have been matched.
All solutions will produce some number of matches and because they do, it's assumed they find all matches. Most often they miss and some solutions can miss up to 30%.
Here's an example of the progression of match accuracy. Start with creating a match solution that simply matches records based on exact string equivalence.
There will be success in a simple approach like this. However, this will only find between 1-5% of matches depending on data quality.
Then you find missed matches, as one example, because there's a single transposition in date of birth (2-5-1982 v 2-4-1982). Adjustments are made to accommodate this. Then double transposition is discovered as a missed match and another adjustment is made.
Then, more differences are discovered.
Misspellings, nicknames, initials, phonetic differences, addresses, phone numbers...
More adjustments need to be made to accommodate these differences.
In a standard record set consisting of 5-7 attributes (name, address, phone etc...) there will be 100's of thousands of different potential outcomes. The more accurate the solution, the greater number of these 100's patterns it will need to incorporate.
The challenge is that the effort involved in creating a highly accurate solution is not linear. As you invest more time to accommodate the differences, you will have to increase the effort for the same amount of gains (as seen below). Many in-house efforts stop significantly short of the accuracy that can be achieved, but as stated earlier, it's difficult to know what you're missing.
Probably the most difficult aspect of achieving a match solution is accommodating the different balances between all of the attributes.
Matching is not as simple as whether two names are pretty close or if the SSN has a one place transposition. All attributes must be taken into consideration.
For example: If two records have exactly the same name, address but a completely different SSN, should it be considered a match?
Now, what if the SSN is only 1 transposition off?
What if the SSN's are exactly the same, but the first names are different?
These are just a few examples of the 100's of thousands of differences that can occur when matching data. This is the central goal of matching regardless of the technology that is applied. But then it becomes a question of what technical approach works best; addresses the most cases.
Having worked with all 3 approaches, rules-based, probabilistic, and machine learning, we can say that machine learning can address more of these 100's of thousands of cases than the other technologies. And by a significant margin.