Skip to content

Entity Resolution API

The Entity Resolution check is created and managed through the standard Quality Checks API by setting rule to entityResolution. The check is multi-field: rather than listing fields under fields, you list one entry per evaluated field under properties.target_fields and pick the distinction field under properties.distinct_field_name. The fields array on the check itself is auto-populated from target_fields and can be sent as an empty list.

Tip

For complete API documentation, including request and response schemas, visit the API docs.

Endpoints

Method Path Purpose
POST /api/quality-checks Create a new Entity Resolution check.
GET /api/quality-checks/{id} Retrieve an Entity Resolution check by ID.
PUT /api/quality-checks/{id} Update an existing Entity Resolution check.
DELETE /api/quality-checks/{id} Delete (or archive) an Entity Resolution check.

Permission: Author (or above) on the target container's team for POST, PUT, and DELETE; Reporter (or above) for GET.

Payload Example

Create a multi-field Entity Resolution check on full_name (fuzzy) and address (fuzzy), distinguished by customer_id, with POST /api/quality-checks:

{
    "description": "Customers with similar names and addresses must share a customer_id",
    "rule": "entityResolution",
    "fields": [],
    "container_id": 145,
    "filter": null,
    "properties": {
        "distinct_field_name": "customer_id",
        "composite_match_threshold": 0.75,
        "target_fields": [
            {
                "upickle_type": "StringTargetField",
                "field_name": "full_name",
                "match_type": "fuzzy",
                "pair_substrings": true,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 1.0
            },
            {
                "upickle_type": "StringTargetField",
                "field_name": "address",
                "match_type": "fuzzy",
                "pair_substrings": false,
                "pair_homophones": false,
                "consider_term_frequency": false,
                "weight": 0.8
            }
        ]
    },
    "tags": ["pii", "master-data"],
    "additional_metadata": {"jira": "DATA-4101"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 12
}

Top-Level Field Notes

Field Required Notes
description Yes Free-text description shown in the UI.
rule Yes Must be "entityResolution".
fields Yes Send []. The list of evaluated fields is computed from properties.target_fields.
container_id Yes ID of the container (table or file) the check runs against.
filter No Spark SQL WHERE expression. Applied before entity resolution runs, so only filtered rows are clustered. Send null for no filter.
properties.distinct_field_name Yes Name of the field that must hold a single value within each resolved entity cluster. Accepted types: Integral, Fractional, Boolean, String, Date, Timestamp.
properties.composite_match_threshold Yes Fractional value between 0.0 and 1.0. Pairs whose weighted composite score is greater than or equal to this value are treated as matches. Default 0.7.
properties.target_fields Yes Non-empty array. Each entry configures one field with its match_type, weight, and (for strings) optional substring/homophone/term-frequency knobs. See Target Field Notes below.
tags No List of tag names applied to the check for filtering and organization.
additional_metadata No Free-form key-value pairs (typically links to catalog, tickets, governance records).
anomaly_message_field No Name of a source-record field whose value should be used as the anomaly message instead of the system-generated one. Not applicable to Entity Resolution: because the rule emits only Shape Anomalies (which use a fixed message template), this field is silently ignored. Send null.
template_id No ID of a Check Template to associate the check with. null if not using a template.
status No "Active" (default) or "Draft". Draft checks are not evaluated by Scans.
owner_id No ID of the user who owns the check. Defaults to the user creating the check when omitted.
default_anomaly_assignee_id No ID of the user automatically assigned to anomalies produced by the check.

Coverage is not supported

Entity Resolution does not accept a coverage value. The rule evaluates clusters as compliant or non-compliant; there is no fractional tolerance to set.

Target Field Notes

Each entry in target_fields is one of three shapes, identified by its upickle_type discriminator: "StringTargetField", "NumericTargetField", or "DateTimeTargetField". The platform validates that the declared upickle_type matches the actual data type of the field on the container.

String Target Field

{
    "upickle_type": "StringTargetField",
    "field_name": "full_name",
    "match_type": "fuzzy",
    "pair_substrings": true,
    "pair_homophones": false,
    "consider_term_frequency": false,
    "weight": 1.0
}
Field Required Notes
upickle_type Yes Must be "StringTargetField". Identifies the shape so the platform can deserialize this entry.
field_name Yes Name of the string field on the container.
match_type No "fuzzy" (default) or "exact". exact turns the field into a blocking pre-filter: pairs disagreeing on this field are never scored.
pair_substrings No When true, a pair where one string contains the other scores 1.0 on this field. Default false. Applies only to fuzzy.
pair_homophones No When true, a pair whose values sound alike (phonetic similarity) scores 1.0 on this field. Default false. Applies only to fuzzy.
consider_term_frequency No When true, rare tokens carry more weight than common tokens. Default false. Applies only to fuzzy.
weight No Non-negative number. Controls this field's contribution to the composite score. Default 1.0. Ignored when match_type is exact.

Numeric Target Field

{
    "upickle_type": "NumericTargetField",
    "field_name": "phone_number",
    "match_type": "absolute",
    "offset": 0.0,
    "weight": 1.0
}
Field Required Notes
upickle_type Yes Must be "NumericTargetField". Identifies the shape so the platform can deserialize this entry.
field_name Yes Name of the numeric field (Integral or Fractional) on the container.
match_type No "absolute" (default), "relative", or "exact". "absolute" compares with a fixed offset; "relative" compares with a percentage tolerance (e.g. 0.05 for 5%); "exact" turns the field into a blocking pre-filter.
offset No Non-negative numeric tolerance. With match_type: "absolute", the pair scores 1.0 if |a − b| ≤ offset, otherwise 0.0. With match_type: "relative", the value is interpreted as a fraction (e.g. 0.05 for 5%). Default 0.0.
weight No Non-negative number controlling contribution to the composite. Default 1.0. Ignored when match_type is exact.

Datetime Target Field

{
    "upickle_type": "DateTimeTargetField",
    "field_name": "registered_at",
    "match_type": "offset",
    "offset_seconds": 3600,
    "weight": 1.0
}
Field Required Notes
upickle_type Yes Must be "DateTimeTargetField". Identifies the shape so the platform can deserialize this entry.
field_name Yes Name of the Date or Timestamp field on the container.
match_type No "offset" (default), "granularity", or "exact". "offset" compares within a number of seconds; "granularity" compares whether both timestamps fall in the same bucket; "exact" turns the field into a blocking pre-filter.
offset_seconds No Non-negative integer tolerance in seconds. Applies when match_type is "offset": the pair scores 1.0 if the two timestamps are within offset_seconds of each other. Default 0.
granularity No Bucket applied before comparison. Applies when match_type is "granularity". Accepted values: "Day", "Week", "Month", "Year". Omit (or send null) when match_type is not "granularity".
weight No Non-negative number controlling contribution to the composite. Default 1.0. Ignored when match_type is exact.
  • Introduction: formal definition, target field types, field scope, and general/anomaly properties.
  • How It Works: full semantics, clustering behavior, threshold tuning, and source-records behavior.
  • Examples: three production scenarios with sample data, source records, and resulting anomalies.
  • FAQ: short answers to the most frequent questions.