Skip to content

Entity Resolution FAQ

Common questions about how the Entity Resolution check clusters records, how target fields combine, and how anomalies are reported.

Behavior

How does Entity Resolution decide that two records are the same entity?

The platform scores every candidate pair on each fuzzy target field, combines the scores into a single weighted composite, and treats the pair as a match whenever the composite is greater than or equal to the composite match threshold. Exact (blocking) target fields filter candidate pairs before scoring rather than contributing to the score themselves. Records connected through any chain of matching pairs end up in the same cluster, so if A matches B and B matches C, all three end up in the cluster {A, B, C} even if the direct match score between A and C is below the threshold.

What is the difference between a fuzzy field and an exact (blocking) field?

A fuzzy field contributes to the composite similarity score. An exact field acts as a hard pre-filter: pairs that disagree on an exact field are never even compared. Use exact fields for hard boundaries such as tenant_id or country_code, where two records on different sides of the boundary should never be treated as the same entity regardless of how similar their remaining fields are. Exact fields also improve performance because they shrink the set of candidate pairs.

What does the composite match threshold control?

The threshold is the cutoff for treating a pair as a match. A composite of 0.7 means a pair is a match only if its weighted composite score is at least 70%. Lowering the threshold widens clusters (more variation tolerated); raising it tightens them (closer to exact matches). The default is 0.7. The Source Records from the first scan are the best place to start tuning.

How are NULLs treated on target fields?

For blocking (exact) target fields, a record with NULL is excluded from pairing. Blocking treats NULL as not equal to itself, which prevents the record from entering any cluster. For fuzzy target fields, NULL on either side of a pair contributes 0.0 to that field's similarity score, but the field's weight is still counted in the composite. A pair with one NULL fuzzy field can still match if the other fuzzy fields score high enough to push the weighted composite above the threshold. If you want to exclude records with a NULL blocking field from resolution entirely, add an IS NOT NULL clause to the filter.

Does the filter clause run before or after entity resolution?

Before. The platform applies the filter first, then runs blocking, scoring, and clustering on the surviving rows. The distinction field is evaluated within the resulting clusters. This lets you scope a check to a meaningful slice (for example, status = 'active') without flagging clusters that exist outside the scope.

Anomaly Reporting

Which rows appear in the Shape Anomaly's Source Records?

The Source Records include only rows from clusters where the distinction field has more than one distinct value. Within each non-compliant cluster, only one example row per distinct value of the distinction field is shown. For example, a cluster where customer_id takes three different values contributes three rows to the Source Records (not the full set of records in the cluster, but enough to make every conflicting value visible).

What is the _qualytics_entity_id column in the Source Records?

It is the cluster identifier the platform assigned to each record. Records sharing the same _qualytics_entity_id are the ones the platform identified as the same entity. The column is displayed as a system-managed column in the UI (it is not a real field on your container), but it appears in the Source Records to make cluster boundaries obvious.

What does the Shape Anomaly message look like?

N records were resolved to D distinct entities (composite threshold T: field_a (w=W), field_b (w=W) ...). K of those entities are assigned more than one value of <distinction_field>
  • N is the count of distinct records analyzed (after the filter, after removing NULLs on blocking fields, and after de-duplication).
  • D is the number of distinct clusters produced.
  • T is the composite match threshold.
  • K is the number of clusters where the distinction field has more than one distinct value.

When the check uses blocking fields, the message includes blocked on [...]. When consider_term_frequency is enabled for a fuzzy string field, its summary entry includes +TF.

Why doesn't Entity Resolution produce a Record Anomaly?

Entity Resolution is a shape-only rule type. The violation is a property of a cluster (multiple records together), not of any single record's value, so the anomaly is reported at the shape level only.

Configuration

Is Coverage supported?

No. The Entity Resolution form has no Coverage option and the API does not accept a coverage value. A cluster is either compliant (one distinct value of the distinction field) or non-compliant; there is no fractional tolerance.

Can I mix fuzzy and exact target fields in the same check?

Yes, and it is the recommended pattern when the data has a hard boundary such as tenant or country. Mark the boundary fields as exact (they become blocking pre-filters) and leave the descriptive fields as fuzzy (they contribute to the composite score). Exact fields do not affect the composite score; their role is to constrain which pairs are eligible for scoring.

Can the same field appear as both a blocking field and a fuzzy field?

No. Each target field has a single match_type. If you need a field to behave differently across scenarios, create separate Entity Resolution checks scoped to those scenarios with a filter.

Does Custom Anomaly Description (the anomaly_message_field payload field) work for Entity Resolution?

No. The Custom Anomaly Description feature only affects Record Anomaly messages. Because Entity Resolution emits only Shape Anomalies, the field has no effect, and the resulting anomaly uses the fixed Shape Anomaly template described above.

Migration from the Single-Field Check

How were my existing Entity Resolution checks migrated?

Earlier versions of the Entity Resolution check compared a single string field with a fixed set of fuzzy options applied to the check as a whole. The current version takes a list of target fields, each with its own type (String, Numeric, or DateTime), match strategy, and weight.

Existing checks were migrated automatically. After migration, each old check uses the new target-fields structure with one String target field that carries forward your previous fuzzy settings (pair_substrings, pair_homophones, spelling_similarity_threshold, plus consider_term_frequency off by default). The field the old check was attached to becomes that target field's field_name.

The platform also derives a composite match threshold for every migrated check, computed from your previous spelling_similarity_threshold so the migrated check accepts the same record pairs as before. Common values:

Previous spelling_similarity_threshold New composite_match_threshold
0.95 0.973
0.80 (old default) 0.891
0.70 0.836
0.50 0.727

After migration, your check works exactly as it did before. You can edit it to add more target fields, tune the composite threshold, or mark fields as exact to use them as blocking pre-filters. Nothing changes until you edit the check.

Did my API payloads change?

Yes. The Entity Resolution payload no longer has top-level pair_substrings, pair_homophones, or spelling_similarity_threshold fields. Their semantics are now carried inside the migrated StringTargetField. Scripts or integrations that POST or PATCH Entity Resolution checks need to send the new target_fields array and composite_match_threshold. See the API reference for the current payload format.

  • Introduction: formal definition, target field types, field scope, and general/anomaly properties.
  • How It Works: full semantics, clustering behavior, threshold tuning, and source-records behavior.
  • Examples: three production scenarios with sample data, source records, and resulting anomalies.
  • API: payload shape and field notes for creating an Entity Resolution check programmatically.