Part 1 Matching Structured Records: Why is it so difficult?
Updated: Jun 15, 2019
When I first interviewed for Initiate Systems (now IBM MDM Standard Edition), a pure play MDM vendor, over 12 years ago, I distinctly remember thinking "how difficult can matching structured records be?". The answer is "very difficult" but even that realization took time.
Most people enter into this topic the same way which means they view this as a fairly easy problem to solve. To this day I continue to have friends and colleagues make statements like "just match on SSN and you're done".
Matching is a journey that starts with some basic knowledge and then expands as you gain more experience . That depth continues over years as you work with the different data sets, different problems across different clients. And I can say that after 12+ years, that include experience within an MDM company, more than a dozen implementations across Patient, Provider, Company, B2B Contact, Product, reference data sets, and building our own match platform, that I still learn new aspects all the time.
Match vs Search
Before we move into the basics of structured record matching, it's worth while quickly covering the difference between search and match, two concepts that are mistakenly interchanged.
Search, like Google or Solr, is a technology that attempts to find any documents that contain the keywords no matter where those keywords occur within the document. Match attempts to align the keywords with certain attributes such as company name or address.
To contrast the two in an example, if I search for Austin Technology Inc, I might get documents that contain that as the company name, I might get documents that have Austin as the city, or technology as the street name (Technology Blvd). This is valid when I'm attempting to find any document that has any related terms.
But when I'm trying to find the best record that matches that company name it's more effective to limit the comparisons of the keywords to only those fields that are relevant, in this case company name. As a match example I would specify CompanyName: Austin Technology Inc and this would only consider those records that have this as part of it's company name.
Structured Record Matching
For our purposes structured records are records that describe either a person, company or product and have segmented attributes that uniquely describe it.
John Smith | 1 Main Street, Austin, Tx, 78750 | 555-1212| 444-55-6666
This record has person name, address, phone, and social security number as attributes that uniquely describe John Smith.
Uniquely described is why most who are not experienced underestimate the difficulties of matching. Looking at this example, it's easy to believe that matching this record against another "like" record is easy.
But what if there was another record very similar to this one
J Smith | 2 Elm Street, Dallas, Tx, 76113 | 555-1212| 444-55-6667
Perhaps John moved either to Dallas or to Austin. What if John's brother, Jim, happens to live in Dallas at 2 Elm Street, is the second record actually his brother's record. But the SSNs are the almost the same? Could one brother be using the other brother's SSN, maybe they're twins and their SSN's one apart.
Unique isn't as unique as we think.
The only reason match technology exists is due to incomplete or inaccurate data.
The two main causes for this issue are human error and changing data. Human error is hard to eradicate because there are so many ways a human can interact with records that include both staff within the company as well as customers/patients themselves entering data through websites and hand written forms. Changing data is natural in any system as people move, change phones, get married/divorced, change companies or as companies change locations or merge/get acquired.