Entity Resolution Examples
Three real-world scenarios that show how the Entity Resolution check is typically used in production: deduplicating a customer master by name and address, consolidating businesses by name with phonetic and substring matching, and matching contacts within a tenant boundary using a blocking field.
Each scenario shows the Source Records that would appear in the resulting Shape Anomaly. Source Records surface one example row per distinct value of the distinction field within each non-compliant cluster, alongside the cluster identifier _qualytics_entity_id so the cluster boundaries are visible.
The situation: Your customers table is the master record for downstream billing. Each row has a customer_id that should be the single identifier per customer, but historic ingestions from multiple sources have produced near-duplicate records with slightly different spellings of the same person's full_name and address. You want Entity Resolution to surface customers where two different customer_id values plausibly describe the same person.
Check configuration
| Field | Value |
|---|---|
| Rule | Entity Resolution |
| Distinction Field | customer_id |
| Target Fields | full_name (String, fuzzy, pair_substrings: true, weight: 1.0), address (String, fuzzy, weight: 0.8) |
| Composite Match Threshold | 0.75 |
| Filter | (none) |
| Custom Anomaly Description | Off |
| Status | Active |
| Owner | (check creator) |
| Anomaly Assignee | (customer-data steward) |
| Tags | pii, master-data |
| Additional Metadata | jira: DATA-4101 |
| Description | Customers with similar names and addresses must share a customer_id |
Payload
{
"description": "Customers with similar names and addresses must share a customer_id",
"rule": "entityResolution",
"fields": [],
"container_id": 145,
"filter": null,
"properties": {
"distinct_field_name": "customer_id",
"composite_match_threshold": 0.75,
"target_fields": [
{
"upickle_type": "StringTargetField",
"field_name": "full_name",
"match_type": "fuzzy",
"pair_substrings": true,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 1.0
},
{
"upickle_type": "StringTargetField",
"field_name": "address",
"match_type": "fuzzy",
"pair_substrings": false,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 0.8
}
]
},
"tags": ["pii", "master-data"],
"additional_metadata": {"jira": "DATA-4101"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 12
}
Source Records
| _qualytics_entity_id | customer_id | full_name | address |
|---|---|---|---|
| ent-a01f | 1001 | Alice Cohen | 142 Maple St |
| ent-a01f | 1057 | Alice C. | 142 Maple Street |
| ent-b73c | 1102 | Catherine Wu | 87 Elm Avenue |
| ent-b73c | 1184 | Catherine Wu | 87 Elm Ave. |
What gets flagged
Two non-compliant clusters appear in the source records:
ent-a01fresolved two records the platform considers the same customer ("Alice Cohen, 142 Maple St"↔"Alice C., 142 Maple Street"). The fuzzy match onfull_namereaches the threshold becausepair_substringspromotes"Alice C."against"Alice Cohen", and the address pair is near-identical. The cluster holds two differentcustomer_idvalues (1001and1057), so it is non-compliant.ent-b73cresolved two records with identical names and only a punctuation difference inaddress. The cluster holds two differentcustomer_idvalues (1102and1184), so it is also non-compliant.
Each non-compliant cluster contributes one row per distinct customer_id to the Source Records panel, four rows total in this scan.
Shape Anomaly
184 records were resolved to 173 distinct entities (composite threshold 0.75: full_name (w=1.0), address (w=0.8)). 2 of those entities are assigned more than one value of customer_id
Flowchart
graph TD
A["Filter: none, evaluate all customers"] --> B["Score every candidate pair on full_name + address"]
B --> C{"Composite score ≥ 0.75?"}
C -->|No| D["Pair is not a match"]
C -->|Yes| E["Connect both records in the same cluster"]
E --> F["Assign cluster _qualytics_entity_id"]
F --> G{"Cluster has more than one customer_id?"}
G -->|No| H["Cluster is compliant"]
G -->|Yes| I["Flag cluster. Source Records gets one row per distinct customer_id."]
The situation: Your businesses table aggregates business records from three vendor feeds. The same business often appears under variant spellings of business_name ("Catherine's Books", "Katherine's Books", "Catherines Books LLC") and each feed assigns its own business_id. You want to surface businesses where the platform believes the names describe the same entity but business_id disagrees.
Check configuration
| Field | Value |
|---|---|
| Rule | Entity Resolution |
| Distinction Field | business_id |
| Target Fields | business_name (String, fuzzy, pair_substrings: true, pair_homophones: true, consider_term_frequency: true, weight: 1.0) |
| Composite Match Threshold | 0.7 |
| Filter | (none) |
| Custom Anomaly Description | Off |
| Status | Active |
| Owner | (check creator) |
| Anomaly Assignee | (business-master steward) |
| Tags | consolidation, vendor-feeds |
| Additional Metadata | jira: DATA-4207 |
| Description | Similar business names should resolve to the same business_id |
Payload
{
"description": "Similar business names should resolve to the same business_id",
"rule": "entityResolution",
"fields": [],
"container_id": 212,
"filter": null,
"properties": {
"distinct_field_name": "business_id",
"composite_match_threshold": 0.7,
"target_fields": [
{
"upickle_type": "StringTargetField",
"field_name": "business_name",
"match_type": "fuzzy",
"pair_substrings": true,
"pair_homophones": true,
"consider_term_frequency": true,
"weight": 1.0
}
]
},
"tags": ["consolidation", "vendor-feeds"],
"additional_metadata": {"jira": "DATA-4207"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 18
}
Source Records
| _qualytics_entity_id | business_id | business_name |
|---|---|---|
| ent-c4d1 | 5001 | Catherine's Books |
| ent-c4d1 | 5042 | Katherine's Books |
| ent-c4d1 | 5108 | Catherines Books LLC |
| ent-e8f2 | 5314 | ACME Boxing |
| ent-e8f2 | 5331 | ACME Boxes |
What gets flagged
Two non-compliant clusters appear in the source records:
ent-c4d1connects three records through pairwise matches:"Catherine's"and"Katherine's"resolve via the homophone rule, and"Catherines Books LLC"resolves to"Catherine's Books"via the substring rule. The three records collapse into a single cluster because their matches form a chain. The cluster holds three differentbusiness_idvalues (5001,5042,5108), so it is non-compliant and contributes three rows to the Source Records.ent-e8f2connects two records ("ACME Boxing"↔"ACME Boxes") where fuzzy text similarity is high enough to clear the threshold. The cluster holds two differentbusiness_idvalues (5314,5331), so it contributes two rows.
Shape Anomaly
2,341 records were resolved to 2,294 distinct entities (composite threshold 0.7: business_name (w=1.0+TF)). 2 of those entities are assigned more than one value of business_id
Flowchart
graph TD
A["Filter: none, evaluate all businesses"] --> B["Compute pair similarity on business_name<br/>(fuzzy text + substring + phonetic overrides)"]
B --> C{"Composite score ≥ 0.7?"}
C -->|No| D["Pair is not a match"]
C -->|Yes| E["Connect both records in the same cluster"]
E --> F["Connected components collapse transitive chains<br/>(A↔B and B↔C become {A,B,C})"]
F --> G["Assign cluster _qualytics_entity_id"]
G --> H{"Cluster has more than one business_id?"}
H -->|No| I["Cluster is compliant"]
H -->|Yes| J["Flag cluster. Source Records gets one row per distinct business_id."]
The situation: Your contacts table is multi-tenant. The same email is allowed to repeat across tenants (different people, different organizations) but never within a single tenant. You want to resolve contacts within each tenant by full_name and email, and tenant_id should act as a hard boundary so cross-tenant collisions never trigger an anomaly.
Check configuration
| Field | Value |
|---|---|
| Rule | Entity Resolution |
| Distinction Field | contact_id |
| Target Fields | tenant_id (Numeric, exact: blocking), full_name (String, fuzzy, pair_substrings: true, weight: 1.0), email (String, fuzzy, weight: 1.0) |
| Composite Match Threshold | 0.8 |
| Filter | status = 'active' |
| Custom Anomaly Description | Off |
| Status | Active |
| Owner | (check creator) |
| Anomaly Assignee | (ingestion on-call) |
| Tags | multi-tenant, contacts |
| Additional Metadata | jira: DATA-4311 |
| Description | Within a tenant, contacts with similar name and email must share a contact_id |
Payload
{
"description": "Within a tenant, contacts with similar name and email must share a contact_id",
"rule": "entityResolution",
"fields": [],
"container_id": 318,
"filter": "status = 'active'",
"properties": {
"distinct_field_name": "contact_id",
"composite_match_threshold": 0.8,
"target_fields": [
{
"upickle_type": "NumericTargetField",
"field_name": "tenant_id",
"match_type": "exact"
},
{
"upickle_type": "StringTargetField",
"field_name": "full_name",
"match_type": "fuzzy",
"pair_substrings": true,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 1.0
},
{
"upickle_type": "StringTargetField",
"field_name": "email",
"match_type": "fuzzy",
"pair_substrings": false,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 1.0
}
]
},
"tags": ["multi-tenant", "contacts"],
"additional_metadata": {"jira": "DATA-4311"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 24
}
Why the blocking field matters
Because tenant_id is match_type: exact, the platform never compares a contact in tenant 7 against a contact in tenant 12. Two contacts named "Jane Doe" with the same email on different tenants are treated as completely separate entities and never cluster together. Blocking on tenant_id is both a correctness guarantee and a performance optimization: candidate pairs are constrained to rows that share the same tenant.
Source Records (filtered to status = 'active')
| _qualytics_entity_id | tenant_id | contact_id | full_name | |
|---|---|---|---|---|
| ent-7a2b | 7 | c-991 | Jane Doe | jane.doe@acme.com |
| ent-7a2b | 7 | c-1042 | J. Doe | jane.doe@acme.com |
The contact c-2071 (tenant_id = 12, full_name = "Jane Doe", email = "jane.doe@acme.com") does not appear in the Source Records: it is in a different tenant, so blocking prevents it from being paired with the rows in tenant 7. It is its own cluster, with its own _qualytics_entity_id, and is compliant.
Shape Anomaly
4,820 records were resolved to 4,791 distinct entities (blocked on [tenant_id], composite threshold 0.8: full_name (w=1.0), email (w=1.0)). 1 of those entities is assigned more than one value of contact_id
Flowchart
graph TD
A["Apply filter: status = 'active'"] --> B["Block pairs by tenant_id<br/>(records in different tenants never compared)"]
B --> C["Score remaining pairs on full_name + email"]
C --> D{"Composite score ≥ 0.8?"}
D -->|No| E["Pair is not a match"]
D -->|Yes| F["Connect both records in the same cluster (per tenant)"]
F --> G["Assign cluster _qualytics_entity_id"]
G --> H{"Cluster has more than one contact_id?"}
H -->|No| I["Cluster is compliant"]
H -->|Yes| J["Flag cluster. Source Records gets one row per distinct contact_id."]
Related
- Introduction: formal definition, target field types, field scope, and general/anomaly properties.
- How It Works: full semantics, clustering behavior, threshold tuning, and source-records behavior.
- API: payload shape and field notes for creating an Entity Resolution check programmatically.
- FAQ: short answers to the most frequent questions.