Sometimes recognizing a junior vs. senior is obvious. Other times not so much.

Ambiguous Conditions in Entity Resolution Systems

Most entity resolution systems don’t handle ambiguous records properly. This tricky and subtle condition creates false positives that are difficult to find.

In entity resolution, we use the term “ambiguous” to mean “multiple good answers.”

The great American boxer George Foreman named all five of his boys George. Imagine having to perform entity resolution on a record containing only his name, home address, and home phone — nothing else. In a typical household, a record containing a name, home address and phone would likely be unique to a single person. In the case of George Foreman, such a record could be any one of six people.

Look at this simple example:

Is Record 3 the junior (Record 2) or Senior (Record 1)?

Most entity resolution algorithms will arbitrarily resolve Record 3 into either Record 1 (the senior, born in 1970) or Record 2 (the junior, born in 1990). For example, imagine this outcome:

Bad decision: Most algorithms make arbitrary decisions e.g., asserting Record 3 is the senior (Record 1).

Even upon human inspection this match looks good, doesn’t it? That’s the tricky thing about ambiguous records like Record 3 — they can create invisible false positives. Invisible, in that you can’t see the false positive, until becoming aware of Record 2 (the junior).

The existence of Record 2 (the junior) means Record 3 could possibly be Record 1 (the senior) or Record 2 (the junior).

Better decision: Record 3 is possibly either the senior (Record 1) or the junior (Record 2).

Handling ambiguous records properly is very important, especially when deployed in systems that can impinge on someone’s freedom or opportunity e.g., government watch listing or background check system. Imagine if Record 3 was represented derogatory information e.g., “terrorist” or “criminal record.” Arbitrarily matching this derogatory data to the junior or senior would result in a 50/50 chance of adversely impacting the wrong person.

If you want to see how your entity resolution engine handles this ambiguous condition compared to Senzing, check out these three records and more in our Synthetic Truth Set.

For a more technical article on this topic, click here.

--

--

--

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

3 Reasons Corporate Development Historically Fights Against Agile

How To Start Coding — The Ultimate Beginner Guide

The Power of Flow Time

Best extensions for JupyterLab!!

Kubernetes Integration with Python-CGI

Kubernetes & OpenShift for Java Engineers with Roland Huß and Tobias Schneck

Unity Development — Modular Health

Digging deep into UICollectionView

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jeff Jonas

Jeff Jonas

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.

More from Medium

If your business is focused on data-driven, fact-based decisions, your business users may be…

Wellcome Mental Health Data Prize now open to teams in South Africa & the UK

An old fashioned podium in an amphitheatre, with three levels for the three winners

Not storing (almost) the same file twice

Credit Card Clustering (K-Means)