Part 2 Matching Structured Records: Why is it so difficult?
Updated: Jun 15, 2019
When you visually compare two records and assess whether you believe they are a match or not, you are comparing each attribute individually and then weighing all the attributes. Example: In the John Smith example you are first comparing names and noting that the names match exactly, then noting that the city, state, zip match but street line is missing etc... You're outcome is influenced not only by an individual attribute but how they all participate. The best match technology attempts to calculate the outcome the same way but what makes this very difficult is the number of different scenarios that can be encountered. If we start with person name and note only some of the many different ways you could find 2 names being compared, here are 8 ways but there are many more than this.
v1: john robert smith vs john robert smith
v2: john smith vs john robert smith
v3: johnathan smith vs john smith
v4: john smiht vs john smith
v5 john bob smith vs john robert smith
v6: robert smith vs john robert smith
v7: smith vs john smith
v8: <blank> vs john smith
Let's assume there's only 8 ways to compare person name (there are a lot more). And let's say you have 5 attributes and all 5 have 8 ways that they can be compared as well. There will be 32,768 ways you could encounter records being compared. 8 x 8 x 8 x 8 x 8 = 32,768. In good technology like IBM MDM Standard Edition the 5 attribute scenario can easily reach into the millions because it has more precision around it's comparison which produces more possibilities.
Importance of Custom Match
Most of the enterprise MDM platforms will customize the match behavior to each individual client. So why is this important? I've had many ask why this matters and why you cannot just use one match approach for client 1 as client 2.The simple answer is that you can if both clients have the same data models, same expectations around match, and relatively the same data.
The reason this is usually not the case is because data models don't align between customers and match expectations vary much more than people understand. As one example, I was implementing IBM MDM for two different hospitals several years ago in two different parts of the country. Both clients had very similar data models (name, address, SSN, DOB etc...) but when we were building the match algorithm they had fairly significant differences in how important attributes were in their environment.Client 1 valued matching SSN's as a fairly significant indicator that two records match. Client 2 held little value in matching SSN's because in their environment they had a lot of SSN overuse. In many cases father's would just put their own SSN down for their son's (mothers for their daughters) and they also had a larger incident of non-US citizen activity where SSN's were commonly shared. Even though both hospitals had similar data models they had significant differences in what each considered matches and hence the need for a custom fit solution. What is the difference between a custom fit model and a generic/shared model? It will be accuracy viewed as records that were matched that should not have matched and unmatched records that should have matched. There are some solutions that do not need great accuracy and a shared match approach is "good enough" to get the job done. Many b2c companies do not employ expensive MDM solutions because they can get 60-70% match accuracy through cheaper solutions and a 30-40% bad match rate is good enough when they are doing things like mailers. Sending 30-40% duplicate mailers is worth the expense as opposed to buying a 6 figure MDM solution. This is not the case for healthcare and many commercial businesses like banks where inaccuracy comes at a much steeper cost.