Tuning Entity Resolution Algorithms Too Early — Resist. Resist.

Jeff Jonas
5 min readFeb 3, 2023

--

So many configuration options to tinker with! Oh the joy….

When it comes to entity resolution systems, early tinkering with algorithms can be a waste of time.

While not true for traditional record matching methods, this is 100% true for entity-centric learning type algorithms like those in Senzing® entity resolution.

[NOTE: It is essential to understand the distinction between entity-centric learning and record matching methods. If you aren’t sure about the differences, read this short article and watch the Entity Resolution Explained Step by Step video, paying special attention to the move involving Record 10.]

It’s very common for data scientists to immediately begin tinkering with their entity resolution configuration — on a small data set spanning one or a few data sources. In most case, this is not only wasted effort but counterproductive. Or, as they say in hockey: Are you skating to where the puck is going?

Why? Configuration changes might appear to improve accuracy in the early days, but may just as likely degrade longer term accuracy. These changes often “fix” issues that become nonexistent as data volume and diversity increases over time.

The temptations to tinker are many.

For example, when the names are different enough to cause this kind of match to be missed:

Example of how Entity Resolution Works
False Negative

Imagine that the configured name score for close being set at 85. Yet these names scored a 68.

Why not simply drop the name score threshold from 85 to 68 to pick up this good match? Can you feel the urge, given both records effectively have the same address and similar names?

The urge is strong, so you change the score. Only to find out the consequence later when the new threshold produces this over match, because the names scored an 83:

False Positive

If the threshold had stayed at its original 85 value, this false positive would have been avoided.

Your pursuit of greater accuracy through this type of threshold tinkering usually results in even more tinkering, such as:

Which setting, 85 or 86, causes the greatest benefit or least harm?
Can we factor in the rareness of the name?
What if the names are in French? Or Arabic?
What if the phone and URL are the same?
What if the company has multiple names, spelling variations, or a completely different DBA name?
What if the address is a shopping center address used by a large number of businesses?

When you go down this road you become a builder. A builder of complex rules, ever evolving as more data sources interact. How data source A matches B, C, D. And how data source B matches A, C, D. And so on. Some spend months or years tinkering on such things. Others, who find it so fascinating, will spend a lifetime.

Honestly, it’s crazy. Worse, it’s a huge waste of resources. Heck some organizations find themselves with large teams literally spending millions a year on such obsessive analysis and tinkering.

While I could write an entire book on the how and why, the theory and best practices, the pros and cons, the scale issues, blah blah blah — I’ll summarize: Endless tuning will never beat adding more data.

Continuing on with the above example. When Record 3 arrives, the previous missed match is instantly fixed:

Real-time Learning Fixes Records 1 and 2

Record 3 for Laurent’s Café & Chocolate Bar becomes the glue that binds the records above into one entity.

At the same time, if the name scoring threshold was left alone (at 85), these records remain unmatched and the false positive is avoided:

Tips to Improve Entity Resolution Results

· Avoid the temptation to tinker with tuning Senzing entity resolution in the early days. Instead, resolve the widest and most diverse sets of data you can get your hands on, e.g. expand your observation space.

· Review the quality of your results while pondering what other data sources you could add to to further increase accuracy.

Only when you’ve loaded and resolved all available data sources, should you even consider the science project of tinkering with how the entity resolution works. Tinkering is rarely ever worth your time and energy. There are more important things to work on.

This especially holds true when you’re using Senzing entity resolution, because the software has the following capabilities :

- Entity-centric learning, or the ability to learn different representations (natural variations) of names, addresses, and other features. Record matching systems don’t support this type of learning.

- Principle-based entity resolution, which allows the addition of new data sources and features WITHOUT any new training or tuning.

- Sequence neutrality or continuous, real-time learning, meaning that as new data is received the system immediately reviews and corrects previous entity resolution assertions.

- Standard configuration that works for more than 80% of users, regardless of whether their entities are people, companies, vessels, or planes, or the script is Roman, Cyrillic, Mandarin, or all of the above, all at the same time.

Interesting Fact: Since we use Principle-based entity resolution. If we hear about a configuration change made by one of our users, we determine if this same change would benefit everyone. If the answer is yes, we usually change the default configuration in our latest software release, so everyone can benefit. If we don’t think it will benefit everyone, we are generally suspicious that the user is over-tinkering with their config to improve results with their current data set — instead of allowing future data to fix things. Not always. But often.

If you want to try this experiment above for yourself, with the same data, you can download the data sets in CSV or in JSON.

--

--

Jeff Jonas
Jeff Jonas

Written by Jeff Jonas

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.

No responses yet