How Entity Resolution Checks Work
This page covers everything the Entity Resolution check does, in detail: how it clusters records, how exact and fuzzy target fields combine into a composite score, how the threshold decides whether two records are the same entity, how the distinction field is enforced after clustering, and what the resulting Shape Anomaly looks like.
If you only need a quick reference, the Introduction page covers the formal definition, field scope, and general/anomaly properties. This page is the detailed reference.
How the Check Evaluates Entity Resolution
Every Entity Resolution check follows the same five-step evaluation flow, regardless of how many target fields you configure:
- Apply the filter clause. If the check has a
filterset, only the rows that match the filter expression continue to the next step. Rows that fall outside the filter are ignored and cannot cause a violation. - Pre-filter on exact (blocking) target fields. If any target field has
match_typeset toexact, only records that share the same value on every exact field can ever be paired. Records with different values on an exact field are blocked from comparison. - Score pairs against the fuzzy target fields. For each remaining candidate pair, the platform computes a similarity score per fuzzy field (fuzzy text similarity for strings, absolute or relative proximity for numerics, offset or granularity bucketing for datetimes), then combines those scores into a single weighted composite score.
- Build clusters by connecting pairs above the threshold. Every pair whose composite score is greater than or equal to the composite match threshold is treated as a match. Matches are grouped transitively: if
AmatchesBandBmatchesC, all three records collapse into one cluster even ifAandCare not directly above the threshold. Each cluster receives a unique_qualytics_entity_id. - Enforce the distinction field. Within each cluster, the platform counts the distinct values of the
distinction_field. Clusters where that count is greater than 1 are non-compliant: the cluster groups records the platform thinks describe the same entity, but the data assigns them different distinct identifiers.
The order of operations matters: blocking fields are applied before scoring, so records that disagree on an exact field never enter the same cluster, regardless of how similar their fuzzy fields are.
Target Fields: The Building Block
Every target field config has at least three pieces: the field name, the match_type, and a weight (default 1.0). The match_type determines which similarity formula is used.
String Target Fields
match_type |
Behavior |
|---|---|
fuzzy (default) |
Fuzzy text similarity between the two strings. Score ranges from 0.0 (no match) to 1.0 (identical). The optional knobs below can either promote a pair to a score of 1.0 (substring containment, homophone match) or adjust how tokens are weighted (term frequency). |
exact |
Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score. |
Three optional knobs on fuzzy string fields:
pair_substrings: whentrue, if one string is contained in the other, the pair's score on this field is treated as1.0. Useful when a name is sometimes recorded with extra qualifiers ("ACME"vs"ACME Inc.").pair_homophones: whentrue, if both strings sound alike (phonetic similarity), the pair's score on this field is treated as1.0. Useful for names that sound the same but are spelled differently ("Catherine"vs"Katherine").consider_term_frequency: whentrue, rare tokens are weighted more heavily than common tokens when comparing the two strings. Useful when common words (e.g. "Inc", "Ltd", "Group") dilute the signal of distinctive words.
Numeric Target Fields
match_type |
Behavior |
|---|---|
absolute (default) |
The pair scores 1.0 if |a − b| <= offset, otherwise 0.0. Use a small offset to tolerate rounding or scale noise. |
relative |
The pair scores 1.0 if the relative difference between the two values is within offset (interpreted as a fraction, e.g. 0.05 for 5%), otherwise 0.0. Useful when the same field can take values of very different magnitudes and a fixed delta would not scale. |
exact |
Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score. |
Datetime Target Fields
match_type |
Behavior |
|---|---|
offset (default) |
The pair scores 1.0 if the two timestamps are within offset_seconds of each other, otherwise 0.0. |
granularity |
The pair scores 1.0 if both timestamps fall in the same bucket after truncation to the configured granularity (Day, Week, Month, or Year), otherwise 0.0. |
exact |
Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score. |
Weights and the Composite Score
For each candidate pair, the platform computes the per-field score for every fuzzy field (exact fields are excluded from scoring because they were already used as a blocking pre-filter), multiplies each by the field's weight, sums the weighted scores, and divides by the total weight to produce the composite score (a value between 0.0 and 1.0):
If the composite is greater than or equal to the composite match threshold, the pair is treated as a match. Increasing a field's weight increases its influence on whether the pair clears the threshold; decreasing it (down to 0) shrinks its influence accordingly.
The Composite Match Threshold
The composite_match_threshold is a value between 0.0 and 1.0 (default 0.7):
- Lower threshold (e.g.
0.6): tolerates more variation. More pairs match, clusters grow larger, more rows risk being grouped incorrectly. - Higher threshold (e.g.
0.9): requires near-identical entities. Fewer pairs match, clusters stay small, and real-world variations may be missed.
Tuning the threshold is the most important single decision when configuring this rule. Start at the default, look at the Source Records of any anomalies the first scan produces, and adjust up or down depending on whether the cluster groupings reflect your business definition of "same entity."
The Filter Clause
The filter clause is a Spark SQL WHERE expression that the platform applies before entity resolution runs. It serves two purposes:
- Scoping the check. Restrict resolution to a subset of the data (for example,
status = 'active',tenant_id = 42, orcreated_at >= '2026-01-01'). Rows outside the scope are never paired and cannot trigger an anomaly. - Working around NULL handling on blocking fields. Records where a blocking (
match_type: exact) target field is NULL cannot be paired with anything (NULL never equals NULL for blocking purposes). Use the filter to exclude those records explicitly when that matters.
The filter is part of the check definition, so the resulting anomaly's source records reflect only the filtered slice.
How Clusters Become Entities
Once the candidate pairs above the threshold are known, the platform groups them transitively to form clusters:
- Two records that pair directly (
A ↔ B) end up in the same cluster. - Two records that pair indirectly (
A ↔ B ↔ CwhereA ↔ Cis below threshold) still end up in the same cluster because they are reachable throughB.
Each cluster gets a unique identifier exposed in the source records as a column called _qualytics_entity_id. The platform treats this as an internal column, so it appears in Source Records (alongside the original fields) but is rendered as a derived column rather than a user field.
The Resulting Shape Anomaly
When the Entity Resolution check fires, it produces a single Shape Anomaly describing the dataset-level violation. The check does not produce Record Anomalies: an entity-resolution violation is a property of a cluster (which spans multiple rows), not of any single row's value.
Anomaly message format
N records were resolved to D distinct entities (composite threshold T: field_a (w=W), field_b (w=W) ...). K of those entities are assigned more than one value of <distinction_field>
When blocking (exact) target fields exist, the message includes them:
N records were resolved to D distinct entities (blocked on [field_x], composite threshold T: field_a (w=W) ...). K of those entities are assigned more than one value of <distinction_field>
When consider_term_frequency is enabled on a string field, the field summary includes +TF:
When K = 1, the verb is is; when K > 1, the verb is are.
What the numbers mean
- N: the count of distinct records actually analyzed by entity resolution (after the filter, after the null filter on blocking fields, and after de-duplication on the resolution inputs).
- D: the number of distinct entity clusters produced.
- T: the composite match threshold the check is configured with.
- K: the number of clusters where
countDistinct(distinction_field) > 1(the non-compliant clusters).
Source Records: What You Will See in the Anomaly
The Shape Anomaly's Source Records panel surfaces the rows that explain the violation. For Entity Resolution the rule is:
- Take only the non-compliant clusters (clusters where the distinction field has more than one distinct value).
- Within each non-compliant cluster, keep one example row per distinct value of the distinction field.
A cluster where business_id takes three different values across its records will contribute three rows to the source records (one per business_id), not the full set of records in that cluster. This makes the conflicting values visible at a glance without flooding the panel with redundant duplicates of the same business_id.
Every source record carries the _qualytics_entity_id column so the cluster boundaries are obvious: records sharing the same _qualytics_entity_id are the records the platform thinks describe the same entity.
Performance Considerations
Entity resolution is more expensive than simple field-by-field checks because every candidate pair must be scored. Two practical implications:
- Use blocking (exact) target fields when possible. A blocking field (such as
country_codeortenant_id) prevents the platform from comparing every record against every other record; only records sharing the blocking value are even considered. This is the single most effective lever for reducing cost on large containers. - Filter to a meaningful scope. If uniqueness across an entire table is not needed (for example, you only want to resolve entities within the current tenant), set the filter to that scope explicitly.
Relationship with Other Rule Types
Entity Resolution sits next to a few related rule types in the platform; combining them is common:
Rule Type |
Why pair it with Entity Resolution |
|---|---|
| Unique | Unique guarantees no two rows share a value (or tuple of values) on the selected field(s). Entity Resolution goes further: it tolerates spelling variations and proximity, then asserts that those variations describe the same logical entity. Use Unique on a strict identifier (a primary key) and Entity Resolution on the descriptive fields that should identify the entity if normalized. |
| Not Null | Records where a blocking (exact) target field is NULL cannot be paired, so blocking fields with many NULLs silently skip resolution. Pair Entity Resolution with a Not Null check on those fields to make the omission visible. |
| Satisfies Expression | Use Satisfies Expression to normalize a field before Entity Resolution runs (for example, lower-casing emails, stripping punctuation, or pre-computing a phonetic key). Pre-normalization reduces the work that fuzzy matching has to do. |
Related
- Introduction: formal definition, target field types, field scope, and general/anomaly properties.
- Examples: three production scenarios with sample data, source records, and resulting anomalies.
- API: payload shape and field notes for creating an Entity Resolution check programmatically.
- FAQ: short answers to the most frequent questions.