How Entity Resolution Checks Work

This page covers everything the Entity Resolution check does, in detail: how it clusters records, how exact and fuzzy target fields combine into a composite score, how the threshold decides whether two records are the same entity, how the distinction field is enforced after clustering, and what the resulting Shape Anomaly looks like.

If you only need a quick reference, the Introduction page covers the formal definition, field scope, and general/anomaly properties. This page is the detailed reference.

How the Check Evaluates Entity Resolution

Every Entity Resolution check follows the same five-step evaluation flow, regardless of how many target fields you configure:

Apply the filter clause. If the check has a filter set, only the rows that match the filter expression continue to the next step. Rows that fall outside the filter are ignored and cannot cause a violation.
Pre-filter on exact (blocking) target fields. If any target field has match_type set to exact, only records that share the same value on every exact field can ever be paired. Records with different values on an exact field are blocked from comparison.
Score pairs against the fuzzy target fields. For each remaining candidate pair, the platform computes a similarity score per fuzzy field (fuzzy text similarity for strings, absolute or relative proximity for numerics, offset or granularity bucketing for datetimes), then combines those scores into a single weighted composite score.
Build clusters by connecting pairs above the threshold. Every pair whose composite score is greater than or equal to the composite match threshold is treated as a match. Matches are grouped transitively: if A matches B and B matches C, all three records collapse into one cluster even if A and C are not directly above the threshold. Each cluster receives a unique _qualytics_entity_id.
Enforce the distinction field. Within each cluster, the platform counts the distinct values of the distinction_field. Clusters where that count is greater than 1 are non-compliant: the cluster groups records the platform thinks describe the same entity, but the data assigns them different distinct identifiers.

The order of operations matters: blocking fields are applied before scoring, so records that disagree on an exact field never enter the same cluster, regardless of how similar their fuzzy fields are.

Target Fields: The Building Block

Every target field config has at least three pieces: the field name, the match_type, and a weight (default 1.0). The match_type determines which similarity formula is used.

String Target Fields

`match_type`	Behavior
`fuzzy` (default)	Fuzzy text similarity between the two strings. Score ranges from `0.0` (no match) to `1.0` (identical). The optional knobs below can either promote a pair to a score of `1.0` (substring containment, homophone match) or adjust how tokens are weighted (term frequency).
`exact`	Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score.

Three optional knobs on fuzzy string fields:

pair_substrings: when true, if one string is contained in the other, the pair's score on this field is treated as 1.0. Useful when a name is sometimes recorded with extra qualifiers ("ACME" vs "ACME Inc.").
pair_homophones: when true, if both strings sound alike (phonetic similarity), the pair's score on this field is treated as 1.0. Useful for names that sound the same but are spelled differently ("Catherine" vs "Katherine").
consider_term_frequency: when true, rare tokens are weighted more heavily than common tokens when comparing the two strings. Useful when common words (e.g. "Inc", "Ltd", "Group") dilute the signal of distinctive words.

Numeric Target Fields

`match_type`	Behavior
`absolute` (default)	The pair scores `1.0` if `\|a − b\| <= offset`, otherwise `0.0`. Use a small `offset` to tolerate rounding or scale noise.
`relative`	The pair scores `1.0` if the relative difference between the two values is within `offset` (interpreted as a fraction, e.g. `0.05` for 5%), otherwise `0.0`. Useful when the same field can take values of very different magnitudes and a fixed delta would not scale.
`exact`	Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score.

Datetime Target Fields

`match_type`	Behavior
`offset` (default)	The pair scores `1.0` if the two timestamps are within `offset_seconds` of each other, otherwise `0.0`.
`granularity`	The pair scores `1.0` if both timestamps fall in the same bucket after truncation to the configured `granularity` (`Day`, `Week`, `Month`, or `Year`), otherwise `0.0`.
`exact`	Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score.

Weights and the Composite Score

For each candidate pair, the platform computes the per-field score for every fuzzy field (exact fields are excluded from scoring because they were already used as a blocking pre-filter), multiplies each by the field's weight, sums the weighted scores, and divides by the total weight to produce the composite score (a value between 0.0 and 1.0):

composite = sum(score_i * weight_i) / sum(weight_i)

If the composite is greater than or equal to the composite match threshold, the pair is treated as a match. Increasing a field's weight increases its influence on whether the pair clears the threshold; decreasing it (down to 0) shrinks its influence accordingly.

The Composite Match Threshold

The composite_match_threshold is a value between 0.0 and 1.0 (default 0.7):

Lower threshold (e.g. 0.6): tolerates more variation. More pairs match, clusters grow larger, more rows risk being grouped incorrectly.
Higher threshold (e.g. 0.9): requires near-identical entities. Fewer pairs match, clusters stay small, and real-world variations may be missed.

Tuning the threshold is the most important single decision when configuring this rule. Start at the default, look at the Source Records of any anomalies the first scan produces, and adjust up or down depending on whether the cluster groupings reflect your business definition of "same entity."

The Filter Clause

The filter clause is a Spark SQL WHERE expression that the platform applies before entity resolution runs. It serves two purposes:

Scoping the check. Restrict resolution to a subset of the data (for example, status = 'active', tenant_id = 42, or created_at >= '2026-01-01'). Rows outside the scope are never paired and cannot trigger an anomaly.
Working around NULL handling on blocking fields. Records where a blocking (match_type: exact) target field is NULL cannot be paired with anything (NULL never equals NULL for blocking purposes). Use the filter to exclude those records explicitly when that matters.

The filter is part of the check definition, so the resulting anomaly's source records reflect only the filtered slice.

How Clusters Become Entities

Once the candidate pairs above the threshold are known, the platform groups them transitively to form clusters:

Two records that pair directly (A ↔ B) end up in the same cluster.
Two records that pair indirectly (A ↔ B ↔ C where A ↔ C is below threshold) still end up in the same cluster because they are reachable through B.

Each cluster gets a unique identifier exposed in the source records as a column called _qualytics_entity_id. The platform treats this as an internal column, so it appears in Source Records (alongside the original fields) but is rendered as a derived column rather than a user field.

The Resulting Shape Anomaly

When the Entity Resolution check fires, it produces a single Shape Anomaly describing the dataset-level violation. The check does not produce Record Anomalies: an entity-resolution violation is a property of a cluster (which spans multiple rows), not of any single row's value.

Anomaly message format

N records were resolved to D distinct entities (composite threshold T: field_a (w=W), field_b (w=W) ...). K of those entities are assigned more than one value of <distinction_field>

When blocking (exact) target fields exist, the message includes them:

N records were resolved to D distinct entities (blocked on [field_x], composite threshold T: field_a (w=W) ...). K of those entities are assigned more than one value of <distinction_field>

When consider_term_frequency is enabled on a string field, the field summary includes +TF:

... composite threshold 0.7: business_name (w=1.0+TF), address (w=0.8) ...

When K = 1, the verb is is; when K > 1, the verb is are.

What the numbers mean

N: the count of distinct records actually analyzed by entity resolution (after the filter, after the null filter on blocking fields, and after de-duplication on the resolution inputs).
D: the number of distinct entity clusters produced.
T: the composite match threshold the check is configured with.
K: the number of clusters where countDistinct(distinction_field) > 1 (the non-compliant clusters).

Source Records: What You Will See in the Anomaly

The Shape Anomaly's Source Records panel surfaces the rows that explain the violation. For Entity Resolution the rule is:

Take only the non-compliant clusters (clusters where the distinction field has more than one distinct value).
Within each non-compliant cluster, keep one example row per distinct value of the distinction field.

A cluster where business_id takes three different values across its records will contribute three rows to the source records (one per business_id), not the full set of records in that cluster. This makes the conflicting values visible at a glance without flooding the panel with redundant duplicates of the same business_id.

Every source record carries the _qualytics_entity_id column so the cluster boundaries are obvious: records sharing the same _qualytics_entity_id are the records the platform thinks describe the same entity.

Performance Considerations

Entity resolution is more expensive than simple field-by-field checks because every candidate pair must be scored. Two practical implications:

Use blocking (exact) target fields when possible. A blocking field (such as country_code or tenant_id) prevents the platform from comparing every record against every other record; only records sharing the blocking value are even considered. This is the single most effective lever for reducing cost on large containers.
Filter to a meaningful scope. If uniqueness across an entire table is not needed (for example, you only want to resolve entities within the current tenant), set the filter to that scope explicitly.

Platform Safeguards on Large Datasets

The analytics engine applies three safeguards so entity resolution stays predictable even when the container is large:

Distinct-record cap. Entity resolution skips execution when the container has more than 5,000,000 distinct combinations across the target fields. No anomalies are produced, but the check does not actually evaluate the data. A message is written to the operation logs explaining that the cap was reached. This prevents runaway execution on unusually wide inputs.
Automatic join-strategy fallback. When the intermediate cluster mapping grows beyond an in-memory threshold, the platform switches to a distributed join so the operation still completes. This protects the run from out-of-memory failures on pathological cases such as one giant cluster spanning millions of records.
Prompt cache release. Intermediate results are released as soon as they are no longer needed, both on the success path and on interrupts or failures. This keeps memory usage stable when the same container is scanned repeatedly.

Low-Match Diagnostic Log

If more than 90% of candidate records remain as singletons (each forming its own cluster instead of joining a shared one), the analytics engine writes a warning to the operation logs suggesting the composite match threshold may be too high or that additional fuzzy matching fields could improve recall. Check the operation logs when you expect an Entity Resolution check to find matches but no anomalies are produced.

Relationship with Other Rule Types

Entity Resolution sits next to a few related rule types in the platform; combining them is common:

Rule Type	Why pair it with Entity Resolution
Unique	Unique guarantees no two rows share a value (or tuple of values) on the selected field(s). Entity Resolution goes further: it tolerates spelling variations and proximity, then asserts that those variations describe the same logical entity. Use Unique on a strict identifier (a primary key) and Entity Resolution on the descriptive fields that should identify the entity if normalized.
Not Null	Records where a blocking (`exact`) target field is NULL cannot be paired, so blocking fields with many NULLs silently skip resolution. Pair Entity Resolution with a Not Null check on those fields to make the omission visible.
Satisfies Expression	Use Satisfies Expression to normalize a field before Entity Resolution runs (for example, lower-casing emails, stripping punctuation, or pre-computing a phonetic key). Pre-normalization reduces the work that fuzzy matching has to do.

Introduction: formal definition, target field types, field scope, and general/anomaly properties.
Examples: three production scenarios with sample data, source records, and resulting anomalies.
API: payload shape and field notes for creating an Entity Resolution check programmatically.
FAQ: short answers to the most frequent questions.