Why Federated Search Bites (for enterprise discovery)

Imagine trying to find a book without the help of a search index.

By “federated search” I mean individually searching each system in an attempt to locate records meeting certain criteria, e.g., records about “Liz Reston.” Federated search is like going to the library for a specific book and, instead of searching for its location in the index, looking for the book in every aisle until you locate it.

Federated search is less accurate and less efficient than using an index, regardless of whether federated search is manually implemented (i.e., a human searches each system) or an automated process (i.e., a machine searches each system).

To illustrate, imagine searching for the information “Liz Reston, 123 E Court Rd, reston@home.com” when these seven records exist — each one in a different source system:

  1. Liz Reston, 123 E Court Rd, reston@home.com
  2. Elizabeth Reston, reston@home.com, (202) 762–1401
  3. Beth Reston, (202) 762–1401, beth@old-email.com
  4. Beth Smith-Reston, beth@old-email.com, 444 Fourth St
  5. Lizzy Smith, 444 Fourth St, beth@work.com
  6. Bob Reston, reston@home.com
  7. reston@home.com

First, note two immediate problems:

  • Federated search won’t find records 3, 4, and 5 because no fields match.
  • Federated search will most likely include records 7, which is risky as this record could just as easily be Bob’s record (6).

Federated search, whether conducted manually or implemented with automation, is deficient.

Deficiencies of Manual Federated Search

  • Volume: searching every system, every time is time consuming and challenging, especially when there are dozens or hundreds of different systems (e.g., will the person searching remember to search the payroll database).
  • Variation: the person searching is unlikely to remember to search for every possible variation (e.g., Elizabeth, Beth, Liz or the many spellings of Muhammed including Mhd).
  • Variability: the person searching is unlikely to try dates of birth with month and day transposed (a common data quality problem) or natural variability in addresses such as 123 E Court Rd vs. 123 East Court Road.

While, in theory, automated search could remedy the above deficiencies, there are other serious issues with federated search not easily solved even with automation.

Deficiencies of Automated Federated Search

  • Constraints: legacy systems often don’t provide an efficient means to search by address, phone, email, etc. For example, a payroll system is optimized to only allow searching by employee number, name, date of birth and tax ID, but not email or phone. As a result, automated search may have to scan entire databases record-by-record, which is exceptionally slow.
  • Completeness: if only a name, address, and email are available, how will automated search find records about the same person if records lack those fields (such as in records 4 and 5 above)?
  • Comingling: just because records look alike doesn’t mean they are alike. What if you find a matching record based on an email address that was periodically shared by a husband and wife? Knowing the email has been used by both is essential to understanding who is who in your data. (as noted earlier with regard to record 7).
  • Contamination: many systems write searches to audit logs, which mean every search creates more copies of personal data. From a privacy compliance perspective, this is a nightmare. The first time a person asks you for the data you hold on them (e.g., GDPR) there are only a few records, but the second time they ask there are hundreds of new instances of their data (due to meticulously logged searches)!

The simple remedy to address these problems is to use entity resolution to create an index that turns the above seven records into the following entity-resolved graph:

Entity #1 Contains …

  1. Liz Reston, 123 E Court Rd, reston@home.com
  2. Elizabeth Reston, reston@home.com, (202) 762–1401
  3. Beth Reston, (202) 762–1401, beth@old-email.com
  4. Beth Smith-Reston, beth@old-email.com, 444 Fourth St
  5. Lizzy Smith, 444 Fourth St, beth@work.com

And points to Entity #3 (below) as a possible match.

Entity #2 contains …

6. Bob Reston, reston@home.com

And points to Entity #3 (below) as a possible match.

Entity #3 contains …

7. reston@home.com

And points to Entity #1 and #2 (above) as possible matches.

When searching for “Liz Reston, 123 E Court Rd, reston@home.com” against this index, Entity #1 is discovered as same — revealing Liz’s five records. And Entity #3 (record 7) is highlighted as a possible match, allowing the person searching to more carefully consider these two records.

There are no shortcuts: entity resolved indexes deliver effective and efficient search, whether an organization is simply trying to improve investigative search (e.g., insider threat, bank fraud, fake identities) or striving to comply with new privacy laws (e.g., GDPR or CCPA).

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.