Skip to content

Entity Resolution Examples

Three real-world scenarios that show how the Entity Resolution check is typically used in production: deduplicating a customer master by name and address, consolidating businesses by name with phonetic and substring matching, and matching contacts within a tenant boundary using a blocking field.

Each scenario shows the Source Records that would appear in the resulting Shape Anomaly. Source Records surface one example row per distinct value of the distinction field within each non-compliant cluster, alongside the cluster identifier _qualytics_entity_id so the cluster boundaries are visible.

The situation: Your customers table is the master record for downstream billing. Each row has a customer_id that should be the single identifier per customer, but historic ingestions from multiple sources have produced near-duplicate records with slightly different spellings of the same person's full_name and address. You want Entity Resolution to surface customers where two different customer_id values plausibly describe the same person.

Check configuration

Field Value
Rule Entity Resolution
Distinction Field customer_id
Target Fields full_name (String, fuzzy, pair_substrings: true, weight: 1.0), address (String, fuzzy, weight: 0.8)
Composite Match Threshold 0.75
Filter (none)
Custom Anomaly Description Off
Status Active
Owner (check creator)
Anomaly Assignee (customer-data steward)
Tags pii, master-data
Additional Metadata jira: DATA-4101
Description Customers with similar names and addresses must share a customer_id

Payload

{
    "description": "Customers with similar names and addresses must share a customer_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 145,
    "filter": null,
    "properties": {
        "distinct_field_name": "customer_id",
        "composite_match_threshold": 0.75,
        "target_fields": [
            {
                "upickle_type": "StringTargetField",
                "field_name": "full_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "address",
                "match_type": "fuzzy",
                "pair_substrings": false,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 0.8
            }
        ]
    },
    "tags": ["pii", "master-data"],
    "additional_metadata": {"jira": "DATA-4101"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 12
}

Source Records

_qualytics_entity_id customer_id full_name address
ent-a01f 1001 Alice Cohen 142 Maple St
ent-a01f 1057 Alice C. 142 Maple Street
ent-b73c 1102 Catherine Wu 87 Elm Avenue
ent-b73c 1184 Catherine Wu 87 Elm Ave.

What gets flagged

Two non-compliant clusters appear in the source records:

  • ent-a01f resolved two records the platform considers the same customer ("Alice Cohen, 142 Maple St""Alice C., 142 Maple Street"). The fuzzy match on full_name reaches the threshold because pair_substrings promotes "Alice C." against "Alice Cohen", and the address pair is near-identical. The cluster holds two different customer_id values (1001 and 1057), so it is non-compliant.
  • ent-b73c resolved two records with identical names and only a punctuation difference in address. The cluster holds two different customer_id values (1102 and 1184), so it is also non-compliant.

Each non-compliant cluster contributes one row per distinct customer_id to the Source Records panel, four rows total in this scan.

Shape Anomaly

184 records were resolved to 173 distinct entities (composite threshold 0.75: full_name (w=1.0), address (w=0.8)). 2 of those entities are assigned more than one value of customer_id

Flowchart

graph TD
    A["Filter: none, evaluate all customers"] --> B["Score every candidate pair on full_name + address"]
    B --> C{"Composite score ≥ 0.75?"}
    C -->|No| D["Pair is not a match"]
    C -->|Yes| E["Connect both records in the same cluster"]
    E --> F["Assign cluster _qualytics_entity_id"]
    F --> G{"Cluster has more than one customer_id?"}
    G -->|No| H["Cluster is compliant"]
    G -->|Yes| I["Flag cluster. Source Records gets one row per distinct customer_id."]

The situation: Your businesses table aggregates business records from three vendor feeds. The same business often appears under variant spellings of business_name ("Catherine's Books", "Katherine's Books", "Catherines Books LLC") and each feed assigns its own business_id. You want to surface businesses where the platform believes the names describe the same entity but business_id disagrees.

Check configuration

Field Value
Rule Entity Resolution
Distinction Field business_id
Target Fields business_name (String, fuzzy, pair_substrings: true, pair_homophones: true, consider_term_frequency: true, weight: 1.0)
Composite Match Threshold 0.7
Filter (none)
Custom Anomaly Description Off
Status Active
Owner (check creator)
Anomaly Assignee (business-master steward)
Tags consolidation, vendor-feeds
Additional Metadata jira: DATA-4207
Description Similar business names should resolve to the same business_id

Payload

{
    "description": "Similar business names should resolve to the same business_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 212,
    "filter": null,
    "properties": {
        "distinct_field_name": "business_id",
        "composite_match_threshold": 0.7,
        "target_fields": [
            {
                "upickle_type": "StringTargetField",
                "field_name": "business_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": true,
                "consider_term_frequency": true,
                "weight": 1.0
            }
        ]
    },
    "tags": ["consolidation", "vendor-feeds"],
    "additional_metadata": {"jira": "DATA-4207"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 18
}

Source Records

_qualytics_entity_id business_id business_name
ent-c4d1 5001 Catherine's Books
ent-c4d1 5042 Katherine's Books
ent-c4d1 5108 Catherines Books LLC
ent-e8f2 5314 ACME Boxing
ent-e8f2 5331 ACME Boxes

What gets flagged

Two non-compliant clusters appear in the source records:

  • ent-c4d1 connects three records through pairwise matches: "Catherine's" and "Katherine's" resolve via the homophone rule, and "Catherines Books LLC" resolves to "Catherine's Books" via the substring rule. The three records collapse into a single cluster because their matches form a chain. The cluster holds three different business_id values (5001, 5042, 5108), so it is non-compliant and contributes three rows to the Source Records.
  • ent-e8f2 connects two records ("ACME Boxing""ACME Boxes") where fuzzy text similarity is high enough to clear the threshold. The cluster holds two different business_id values (5314, 5331), so it contributes two rows.

Shape Anomaly

2,341 records were resolved to 2,294 distinct entities (composite threshold 0.7: business_name (w=1.0+TF)). 2 of those entities are assigned more than one value of business_id

Flowchart

graph TD
    A["Filter: none, evaluate all businesses"] --> B["Compute pair similarity on business_name<br/>(fuzzy text + substring + phonetic overrides)"]
    B --> C{"Composite score ≥ 0.7?"}
    C -->|No| D["Pair is not a match"]
    C -->|Yes| E["Connect both records in the same cluster"]
    E --> F["Connected components collapse transitive chains<br/>(A↔B and B↔C become {A,B,C})"]
    F --> G["Assign cluster _qualytics_entity_id"]
    G --> H{"Cluster has more than one business_id?"}
    H -->|No| I["Cluster is compliant"]
    H -->|Yes| J["Flag cluster. Source Records gets one row per distinct business_id."]

The situation: Your contacts table is multi-tenant. The same email is allowed to repeat across tenants (different people, different organizations) but never within a single tenant. You want to resolve contacts within each tenant by full_name and email, and tenant_id should act as a hard boundary so cross-tenant collisions never trigger an anomaly.

Check configuration

Field Value
Rule Entity Resolution
Distinction Field contact_id
Target Fields tenant_id (Numeric, exact: blocking), full_name (String, fuzzy, pair_substrings: true, weight: 1.0), email (String, fuzzy, weight: 1.0)
Composite Match Threshold 0.8
Filter status = 'active'
Custom Anomaly Description Off
Status Active
Owner (check creator)
Anomaly Assignee (ingestion on-call)
Tags multi-tenant, contacts
Additional Metadata jira: DATA-4311
Description Within a tenant, contacts with similar name and email must share a contact_id

Payload

{
    "description": "Within a tenant, contacts with similar name and email must share a contact_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 318,
    "filter": "status = 'active'",
    "properties": {
        "distinct_field_name": "contact_id",
        "composite_match_threshold": 0.8,
        "target_fields": [
            {
                "upickle_type": "NumericTargetField",
                "field_name": "tenant_id",
                "match_type": "exact"
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "full_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "email",
                "match_type": "fuzzy",
                "pair_substrings": false,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            }
        ]
    },
    "tags": ["multi-tenant", "contacts"],
    "additional_metadata": {"jira": "DATA-4311"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 24
}

Why the blocking field matters

Because tenant_id is match_type: exact, the platform never compares a contact in tenant 7 against a contact in tenant 12. Two contacts named "Jane Doe" with the same email on different tenants are treated as completely separate entities and never cluster together. Blocking on tenant_id is both a correctness guarantee and a performance optimization: candidate pairs are constrained to rows that share the same tenant.

Source Records (filtered to status = 'active')

_qualytics_entity_id tenant_id contact_id full_name email
ent-7a2b 7 c-991 Jane Doe jane.doe@acme.com
ent-7a2b 7 c-1042 J. Doe jane.doe@acme.com

The contact c-2071 (tenant_id = 12, full_name = "Jane Doe", email = "jane.doe@acme.com") does not appear in the Source Records: it is in a different tenant, so blocking prevents it from being paired with the rows in tenant 7. It is its own cluster, with its own _qualytics_entity_id, and is compliant.

Shape Anomaly

4,820 records were resolved to 4,791 distinct entities (blocked on [tenant_id], composite threshold 0.8: full_name (w=1.0), email (w=1.0)). 1 of those entities is assigned more than one value of contact_id

Flowchart

graph TD
    A["Apply filter: status = 'active'"] --> B["Block pairs by tenant_id<br/>(records in different tenants never compared)"]
    B --> C["Score remaining pairs on full_name + email"]
    C --> D{"Composite score ≥ 0.8?"}
    D -->|No| E["Pair is not a match"]
    D -->|Yes| F["Connect both records in the same cluster (per tenant)"]
    F --> G["Assign cluster _qualytics_entity_id"]
    G --> H{"Cluster has more than one contact_id?"}
    H -->|No| I["Cluster is compliant"]
    H -->|Yes| J["Flag cluster. Source Records gets one row per distinct contact_id."]
  • Introduction: formal definition, target field types, field scope, and general/anomaly properties.
  • How It Works: full semantics, clustering behavior, threshold tuning, and source-records behavior.
  • API: payload shape and field notes for creating an Entity Resolution check programmatically.
  • FAQ: short answers to the most frequent questions.