How to Handle Drifting Entity IDs in Entity Resolution Systems

Jeff Jonas
4 min readOct 26, 2022

--

Would not be too shocked to see one kid pop out and land in the other.

A common question from those new to entity resolution is: Are the entity IDs being created persistent? Meaning, once a record is assigned to an entity ID, will this ID be the same forever? Simple answer is: Absolutely not. Persistent entity IDs assigned to records can change. Why? One must be able to change their mind about the past.

In this short article I’ll describe why this is a fact of life and offer several approaches for handling this operationally.

Why Drifting Entity IDs are a Reality

First, why this must be. Invariably entity resolution systems have some errors — false positives or false negatives. Whether these errors are detected by humans (e.g., a consumer complains) or the entity resolution algorithm itself (e.g., a new record reveals there has been a junior/senior confusion in the past) — the records need to be moved from one entity ID to another.

Example 1: A government health department once tried to convince me that their entity IDs never change because their transactions are all assigned to the static national ID number of the patient. I asked how they handle an unconscious patient without an ID (i.e., John Doe). It turns out those types of health care transactions are initially assigned one entity ID and are reassigned another entity ID later once their identity becomes known. Between this and other examples, the department admitted that records, in fact, can move between Entity IDs.

Example 2: In my video Entity Resolution Explained Step by Step, one can see that without the existence of Record 8, Record 6 is at best a possible match to Entity 1. With the arrival of Record 8, one becomes quite confident Record 6 belongs in Entity 1. This and other common examples — of records jumping around — are explained in this video.

When Record 8 is learned, Record 6 pops out of Entity 4 and lands in Entity 1.

Downstream processes are both the victims and beneficiaries of dynamic entity IDs. If an important decision has already been made, maybe an apology is necessary. If only future decisions will be affected by the correction, lucky you.

In the unfortunate case, a correction must be made about who is who (or who is related to whom) — and this reveals a previous incorrect decision that may require remediation. If a terrorist has been let into the country accidentally, the policy is probably: go find them. If a bank loan was made in error, maybe recourse needs to be explored. If someone was overlooked for a low interest rate credit card promotion, what then? These are all policy considerations.

Obviously the closer to real time that entity IDs are corrected, the more reliable the business decisions. A correction discoverable on January 1st that is held till the March 31st quarterly update leaves a lot of time for bad decisions to be made. High consequence systems like fraud or risk-oriented systems should be as close to real time as possible. With marketing systems, on the other hand, batched corrections typically have fewer consequences e.g., a missed upsell opportunity or an inefficient customer support call. Data science and machine learning efforts are often least impacted because they can be handed a full snapshot at the time needed.

Architectural Options for Managing Dynamic Entity IDs

  • Real time entity ID lookup (e.g., using an API) is the simplest approach
  • Real time replication (e.g., monitoring an affected entities queue) for when you absolutely must have the unique ID in your tables for legacy software
  • Batch replication if you must (e.g., exporting a periodic snapshot of the entity map)
  • Or a combination of the above, but always use a real-time lookup whenever possible

Changing entity IDs are a fact of life. Any sane system must be able to change its mind about the past.

Just remember the longer you wait to get the current state of an entity, the more data quality is declining and the number of prior decisions needing remediation is increasing. Just another reason real time entity resolution, changing its mind about the past, is ideal.

[I’ve written more generally on this topic in this article entitled “Sequence Neutrality”. In short, did the observations arrive in order: A, B, C? Or did they arrive: C, B, A? In either case, one’s final understanding should be the same.]

SENZING SPECIFICS

How Does an Entity ID Behave?

Advanced: Real-time Replication and Analytics

Advanced: Replicating the Senzing Results to a Data Warehouse

--

--

Jeff Jonas
Jeff Jonas

Written by Jeff Jonas

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.

No responses yet