Entity Resolution Examples

Three real-world scenarios that show how the Entity Resolution check is typically used in production: deduplicating a customer master by name and address, consolidating businesses by name with phonetic and substring matching, and matching contacts within a tenant boundary using a blocking field.

Each scenario shows the Source Records that would appear in the resulting Shape Anomaly. Source Records surface one example row per distinct value of the distinction field within each non-compliant cluster, alongside the cluster identifier _qualytics_entity_id so the cluster boundaries are visible.

Customer Master DeduplicationBusiness Name Consolidation with HomophonesTenant-Scoped Resolution with a Blocking Field

The situation: Your customers table is the master record for downstream billing. Each row has a customer_id that should be the single identifier per customer, but historic ingestions from multiple sources have produced near-duplicate records with slightly different spellings of the same person's full_name and address. You want Entity Resolution to surface customers where two different customer_id values plausibly describe the same person.

Check configuration

Field	Value
Rule	Entity Resolution
Distinction Field	`customer_id`
Target Fields	`full_name` (String, `fuzzy`, `pair_substrings: true`, `weight: 1.0`), `address` (String, `fuzzy`, `weight: 0.8`)
Composite Match Threshold	`0.75`
Filter	(none)
Custom Anomaly Description	Off
Status	Active
Owner	(check creator)
Anomaly Assignee	(customer-data steward)
Tags	`pii`, `master-data`
Additional Metadata	`jira: DATA-4101`
Description	Customers with similar names and addresses must share a customer_id

Payload

{
    "description": "Customers with similar names and addresses must share a customer_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 145,
    "filter": null,
    "properties": {
        "distinct_field_name": "customer_id",
        "composite_match_threshold": 0.75,
        "target_fields": [
            {
                "upickle_type": "StringTargetField",
                "field_name": "full_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "address",
                "match_type": "fuzzy",
                "pair_substrings": false,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 0.8
            }
        ]
    },
    "tags": ["pii", "master-data"],
    "additional_metadata": {"jira": "DATA-4101"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 12
}

Source Records

_qualytics_entity_id	customer_id	full_name	address
ent-a01f	1001	Alice Cohen	142 Maple St
ent-a01f	1057	Alice C.	142 Maple Street
ent-b73c	1102	Catherine Wu	87 Elm Avenue
ent-b73c	1184	Catherine Wu	87 Elm Ave.

What gets flagged

Two non-compliant clusters appear in the source records:

ent-a01f resolved two records the platform considers the same customer ("Alice Cohen, 142 Maple St" ↔ "Alice C., 142 Maple Street"). The fuzzy match on full_name reaches the threshold because pair_substrings promotes "Alice C." against "Alice Cohen", and the address pair is near-identical. The cluster holds two different customer_id values (1001 and 1057), so it is non-compliant.
ent-b73c resolved two records with identical names and only a punctuation difference in address. The cluster holds two different customer_id values (1102 and 1184), so it is also non-compliant.

Each non-compliant cluster contributes one row per distinct customer_id to the Source Records panel, four rows total in this scan.

Shape Anomaly

184 records were resolved to 173 distinct entities (composite threshold 0.75: full_name (w=1.0), address (w=0.8)). 2 of those entities are assigned more than one value of customer_id

Flowchart

graph TD
    A["Filter: none, evaluate all customers"] --> B["Score every candidate pair on full_name + address"]
    B --> C{"Composite score ≥ 0.75?"}
    C -->|No| D["Pair is not a match"]
    C -->|Yes| E["Connect both records in the same cluster"]
    E --> F["Assign cluster _qualytics_entity_id"]
    F --> G{"Cluster has more than one customer_id?"}
    G -->|No| H["Cluster is compliant"]
    G -->|Yes| I["Flag cluster. Source Records gets one row per distinct customer_id."]

The situation: Your businesses table aggregates business records from three vendor feeds. The same business often appears under variant spellings of business_name ("Catherine's Books", "Katherine's Books", "Catherines Books LLC") and each feed assigns its own business_id. You want to surface businesses where the platform believes the names describe the same entity but business_id disagrees.

Check configuration

Field	Value
Rule	Entity Resolution
Distinction Field	`business_id`
Target Fields	`business_name` (String, `fuzzy`, `pair_substrings: true`, `pair_homophones: true`, `consider_term_frequency: true`, `weight: 1.0`)
Composite Match Threshold	`0.7`
Filter	(none)
Custom Anomaly Description	Off
Status	Active
Owner	(check creator)
Anomaly Assignee	(business-master steward)
Tags	`consolidation`, `vendor-feeds`
Additional Metadata	`jira: DATA-4207`
Description	Similar business names should resolve to the same business_id

Payload

{
    "description": "Similar business names should resolve to the same business_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 212,
    "filter": null,
    "properties": {
        "distinct_field_name": "business_id",
        "composite_match_threshold": 0.7,
        "target_fields": [
            {
                "upickle_type": "StringTargetField",
                "field_name": "business_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": true,
                "consider_term_frequency": true,
                "weight": 1.0
            }
        ]
    },
    "tags": ["consolidation", "vendor-feeds"],
    "additional_metadata": {"jira": "DATA-4207"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 18
}

Source Records

_qualytics_entity_id	business_id	business_name
ent-c4d1	5001	Catherine's Books
ent-c4d1	5042	Katherine's Books
ent-c4d1	5108	Catherines Books LLC
ent-e8f2	5314	ACME Boxing
ent-e8f2	5331	ACME Boxes

What gets flagged

Two non-compliant clusters appear in the source records:

ent-c4d1 connects three records through pairwise matches: "Catherine's" and "Katherine's" resolve via the homophone rule, and "Catherines Books LLC" resolves to "Catherine's Books" via the substring rule. The three records collapse into a single cluster because their matches form a chain. The cluster holds three different business_id values (5001, 5042, 5108), so it is non-compliant and contributes three rows to the Source Records.
ent-e8f2 connects two records ("ACME Boxing" ↔ "ACME Boxes") where fuzzy text similarity is high enough to clear the threshold. The cluster holds two different business_id values (5314, 5331), so it contributes two rows.

Shape Anomaly

2,341 records were resolved to 2,294 distinct entities (composite threshold 0.7: business_name (w=1.0+TF)). 2 of those entities are assigned more than one value of business_id

Flowchart

graph TD
    A["Filter: none, evaluate all businesses"] --> B["Compute pair similarity on business_name<br/>(fuzzy text + substring + phonetic overrides)"]
    B --> C{"Composite score ≥ 0.7?"}
    C -->|No| D["Pair is not a match"]
    C -->|Yes| E["Connect both records in the same cluster"]
    E --> F["Connected components collapse transitive chains<br/>(A↔B and B↔C become {A,B,C})"]
    F --> G["Assign cluster _qualytics_entity_id"]
    G --> H{"Cluster has more than one business_id?"}
    H -->|No| I["Cluster is compliant"]
    H -->|Yes| J["Flag cluster. Source Records gets one row per distinct business_id."]

The situation: Your contacts table is multi-tenant. The same email is allowed to repeat across tenants (different people, different organizations) but never within a single tenant. You want to resolve contacts within each tenant by full_name and email, and tenant_id should act as a hard boundary so cross-tenant collisions never trigger an anomaly.

Check configuration

Field	Value
Rule	Entity Resolution
Distinction Field	`contact_id`
Target Fields	`tenant_id` (Numeric, `exact`: blocking), `full_name` (String, `fuzzy`, `pair_substrings: true`, `weight: 1.0`), `email` (String, `fuzzy`, `weight: 1.0`)
Composite Match Threshold	`0.8`
Filter	`status = 'active'`
Custom Anomaly Description	Off
Status	Active
Owner	(check creator)
Anomaly Assignee	(ingestion on-call)
Tags	`multi-tenant`, `contacts`
Additional Metadata	`jira: DATA-4311`
Description	Within a tenant, contacts with similar name and email must share a contact_id

Payload

{
    "description": "Within a tenant, contacts with similar name and email must share a contact_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 318,
    "filter": "status = 'active'",
    "properties": {
        "distinct_field_name": "contact_id",
        "composite_match_threshold": 0.8,
        "target_fields": [
            {
                "upickle_type": "NumericTargetField",
                "field_name": "tenant_id",
                "match_type": "exact"
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "full_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "email",
                "match_type": "fuzzy",
                "pair_substrings": false,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            }
        ]
    },
    "tags": ["multi-tenant", "contacts"],
    "additional_metadata": {"jira": "DATA-4311"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 24
}

Why the blocking field matters

Because tenant_id is match_type: exact, the platform never compares a contact in tenant 7 against a contact in tenant 12. Two contacts named "Jane Doe" with the same email on different tenants are treated as completely separate entities and never cluster together. Blocking on tenant_id is both a correctness guarantee and a performance optimization: candidate pairs are constrained to rows that share the same tenant.

Source Records (filtered to status = 'active')

_qualytics_entity_id	tenant_id	contact_id	full_name	email
ent-7a2b	7	c-991	Jane Doe	jane.doe@acme.com
ent-7a2b	7	c-1042	J. Doe	jane.doe@acme.com

The contact c-2071 (tenant_id = 12, full_name = "Jane Doe", email = "jane.doe@acme.com") does not appear in the Source Records: it is in a different tenant, so blocking prevents it from being paired with the rows in tenant 7. It is its own cluster, with its own _qualytics_entity_id, and is compliant.

Shape Anomaly

4,820 records were resolved to 4,791 distinct entities (blocked on [tenant_id], composite threshold 0.8: full_name (w=1.0), email (w=1.0)). 1 of those entities is assigned more than one value of contact_id

Flowchart

graph TD
    A["Apply filter: status = 'active'"] --> B["Block pairs by tenant_id<br/>(records in different tenants never compared)"]
    B --> C["Score remaining pairs on full_name + email"]
    C --> D{"Composite score ≥ 0.8?"}
    D -->|No| E["Pair is not a match"]
    D -->|Yes| F["Connect both records in the same cluster (per tenant)"]
    F --> G["Assign cluster _qualytics_entity_id"]
    G --> H{"Cluster has more than one contact_id?"}
    H -->|No| I["Cluster is compliant"]
    H -->|Yes| J["Flag cluster. Source Records gets one row per distinct contact_id."]

Introduction: formal definition, target field types, field scope, and general/anomaly properties.
How It Works: full semantics, clustering behavior, threshold tuning, and source-records behavior.
API: payload shape and field notes for creating an Entity Resolution check programmatically.
FAQ: short answers to the most frequent questions.

Entity Resolution Examples

Related