Evaluating Entity Resolution? Don’t Overlook Operational Impacts!

Jeff Jonas
3 min readJan 12, 2022

--

Imagine evaluating a car by just driving it around the block … when your real needs are a low maintenance, all-terrain vehicle with attractive operating costs.

Similarly, imagine evaluating entity resolution technology by just batch loading a couple of data sets … when you need a system that operates in 7x24 in real time, supports dozens of data sources (and growing), requires low-latency entity resolution services to be exposed through the enterprise data fabric, and so on. Just like testing a car with a trip around the block, not a good test!

When evaluating entity resolution technology, don’t overlook your long-term operational requirements. Here are some common oversights:

Not enough emphasis on understanding what it takes to onboard new data sources. Knowing what’s involved is important, whether you expect to add a few, dozens or hundreds of data sources over time. The questions that need to be answered include:

  • What kind to skills are needed and how much time is required for data preparation, mapping and tuning?
  • Do you need to reload all prior data as you add each new data sources? This is important to know as in can mean reprocessing an ever-larger number of records each time.

Too little attention paid to total cost of ownership (TCO). When calculating TCO, it is important to include the following:

  • The cost of onboarding new data source, because adding each one might take an expert(s) a month or more. If you expect to add many new data sources, this can really add up fast in terms of total cost.
  • The cost of building an in-house team of specialists to operate the system. Don’t forget to factor in bench strength and new-hire training programs to backfill attrition.
  • The cost to deploy and maintain an A and B infrastructure, if needed (i.e., one system serving 7x24 operations while the other is handling the periodic reload)
  • The cost of any additional hardware needed to support your roadmap (e.g., delivering low latency API services)

When evaluating systems like Senzing that support transactional updates — handling adds, changes and deletes incrementally (including new data sets) — don’t put too much emphasis on the time and cost incurred to initially load historical and reference data. Why? Because periodic reloads (daily, weekly, etc.) are not required. The time and cost of bulk loading are really only important to quantify if the entity resolution technology requires full reloads to process adds, changes and deletes.

To better estimate operational impacts on your organization, consider these best practices:

Evaluate a system for its initial production use and, more importantly, what you’ll need one, two and three years out. Think through different scenarios of how you might expand the system’s data and use over time. For example, your roadmap goals might include:

  • Adding 32 new data sources containing ~200M more records.
  • Growing to more than 750M records in total.
  • Adding a low-latency entity resolution service via your enterprise data fabric.

Scope total cost of ownership (TCO) for ongoing operations. When it comes to operating production entity resolution systems, there can be huge differences in TCO. Be sure your TCO calculations also include:

  • People and hardware required for data preparation, mapping and configuration of each data source. Note: Don’t be surprised when this cost varies significantly from system to system.
  • Production hardware to include high availability and disaster recovery.
  • Licensing for the entire software stack.
  • Number and types of people needed for daily operations.
  • Requirements to deploy software version upgrades, including regression testing, rollback planning, etc.
  • Security audits, including all the moving parts.

If real-time transactional entity resolution technology is also being evaluated, be sure to appropriately assess the pros and cons.

  • Batch-based systems are usually very efficient at quickly loading large files, which is great until day two when you need to reload everything again.
  • Real-time transactional systems are typically slower when loading bulk data, but handle all future adds, changes and deletes incrementally, without reloading, which can have significant long-term operational benefits.

A common mistake, when evaluating entity resolution technology, is to focus too much on the basics, or the minimum viable product (MVP). To reduce your risk of buyer’s remorse, spend more time up front thinking about the overall journey.

Would love to hear any and all comments about this post, as I’d like to evolve it over time. Also, if you are a like-minded, kindred spirit, please join our Entity Resolution LinkedIn Group.

--

--

Jeff Jonas
Jeff Jonas

Written by Jeff Jonas

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.

No responses yet