• Ken Hubacher

Part 3 Matching Structured Records: Why is it so difficult?

Updated: Jun 15, 2019

There has been two main approaches to structured record matching; deterministic and probabilistic.


Deterministic is a rule-based approach to matching and is typically the way most people would implement a process when they are new to matching. It would look something like this "if name and ssn are exact matches then this is a match".

I have seen this approach used by several clients prior to implementing an enterprise match solution. On the surface this seems logical but if you recall from part 2, the number of different combinations multiply very quickly which means you must create enough rules to cover the different scenarios (For the example of 8x8x8x8x8 = 32,768 combinations). You wouldn't need 32,768 rules but you would need a considerable amount to cover the variations.

Match Accuracy Spectrum

At this point, it's important to note that match accuracy is on a spectrum of success but this is not well understood and leads to even further confusion. You can find quick success when creating a match solution such as finding any records that have exact string matches between them. But this will typically produce match rates in the 15-25% range which is very low. Every incremental step toward better match accuracy is exponentially more work and as you approach the upper end of accuracy you can spends weeks just getting another .5%.

Even among enterprise platforms match accuracy can vary 10-20% depending on how well each platform does it's job. If you translate just 10% against 1 million records, you're missing up to 100,000 records in match accuracy.

So when someone builds a solution it's easy to feel like you're close to the upper end of accuracy when you can be far short. But how is accuracy measured? Traditionally you would use a data set with known/agreed upon outcomes and run that data set through the platform to arrive at how many you were able to correctly match and not match.

Probabilistic Approach

Probabilistic is the other common approach vendors use in their match technology. This is statistical approach where weights are applied to the different attributes and depending on how well and how many match between two records, a score is derived. This score is used as a confidence level where a pair that matches above a certain threshold can be deemed matches. The three main disadvantages of a probabilistic approach are it's linear nature and mismatched data sets and difficulty to implement.

A probabilistic approach tends to be the more accurate approach when comparing different vendors and their use of these technologies. It is also the harder of the two to implement and such projects can take months and even years depending on the complexities.

The fact that it produces a single score for each pair makes it difficult to discern differences between attributes like a human thinks. You choose a threshold (score) where you feel comfortable that anything that scores higher is a match. But at the same time you realize that you are leaving potential matches below the threshold but you are not confident OR you have tested and seen that there are a mix of matches and non matches below the threshold which prevents you from lowering it. There can be as much as an additional 10-20% in this gray area that is left unmatched.

And lastly, this approach does not handle misaligned data sets very well. The happy path scenario is that you have some number of sources that all have the same exact set of attributes used to match (e.g. name, address, phone, dob, ssn). But if you have different attributes at different sources this approach becomes harder to use.

For example:

Source1: Name Address Phone DOB SSN

Source2: Name Address DOB

Source3: Name Phone SSN

What makes this difficult is that a maximum score for records between source1 and source2 versus source 2 and source 3 (and source1 and source3) can vary because you are using either more or less attributes in each situation. Having a single threshold won't work well and so you need multiple thresholds to account for the problem.


©2010 by EntityWise.

407 Radam Lane


Austin, Tx 78745