Skip to content

Data Diff Check Examples

Three real-world scenarios that show how the Data Diff check is typically used in production: validating a nightly backup, comparing a system-to-system data transfer, and verifying a post-migration mirror with a scoped filter that suppresses unmatched reference rows.

In every example, the Sample Data table is laid out exactly as the Comparison Source Records view in the Qualytics app renders the anomaly: each row has a status, an identifier, and a Left (target) / Right (reference) pair for every compared field. Only the differing right-side cell carries the anomalous-cell highlight, and missing values render as the literal text missing.

The situation: Your ORDERS table is the system of record for an e-commerce site. Every night at midnight a backup job replicates it to ORDERS_BACKUP in the warehouse datastore. The Sales team builds the morning report from ORDERS_BACKUP, so any row missing from the backup silently understates revenue. A Data Diff check validates that the backup is complete every morning.

Check configuration

Field Value
Rule Data Diff
Fields order_id, customer_id, amount, order_date
Row Identifiers order_id
Reference Datastore Warehouse
Reference Container ORDERS_BACKUP
Filter (none)
Comparators (none)
Status Active
Tags critical, backup
Additional Metadata jira: DATA-3201
Description ORDERS must match ORDERS_BACKUP every morning

Payload

{
    "description": "ORDERS must match ORDERS_BACKUP every morning",
    "rule": "dataDiff",
    "fields": ["order_id", "customer_id", "amount", "order_date"],
    "container_id": 145,
    "filter": null,
    "properties": {
        "ref_datastore_id": 22,
        "ref_container_id": 803,
        "id_field_names": ["order_id"]
    },
    "tags": ["critical", "backup"],
    "additional_metadata": {"jira": "DATA-3201"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 12
}

Sample Data (as rendered in Comparison Source Records)

Row Status order_id customer_id (Left → Right) amount (Left → Right) order_date (Left → Right)
removed 10248 4451 → missing 19.00 → missing 2025-01-15 → missing
removed 10249 4452 → missing 12.00 → missing 2025-01-15 → missing
removed 10250 4453 → missing 29.00 → missing 2025-01-15 → missing

What gets flagged

Three orders that exist on the target side (ORDERS) have no matching order_id on the reference side (ORDERS_BACKUP). Each row is reported with status removed and every right-side field shows the literal missing. The backup job failed to copy these rows into ORDERS_BACKUP.

Shape Anomaly

There are 3 records that differ between ORDERS_BACKUP (1247 records) and ORDERS (1250 records) in Warehouse

Flowchart

graph TD
    A["No filter, evaluate all target rows"] --> B["Read reference container ORDERS_BACKUP"]
    B --> C["Match rows by order_id"]
    C --> D{"Any target row<br/>without a reference match?"}
    D -->|No| E["All target rows pass"]
    D -->|Yes| F["Flag each unmatched row as 'removed'.<br/>3 orders missing from backup."]

Equivalent SQL

-- Rows the Data Diff check would flag as 'removed' for the ORDERS backup.
SELECT t.order_id, t.customer_id, t.amount, t.order_date
FROM orders t
LEFT JOIN orders_backup r ON t.order_id = r.order_id
WHERE r.order_id IS NULL
ORDER BY t.order_id;

The situation: A clinic's scheduling system writes appointments to a SQL Server APPOINTMENTS table, and an hourly ETL copies them into the billing system's APPOINTMENTS_BILLING table. Patient name and doctor are reliably copied, but the insurance plan code is rewritten in transit by a translation layer. A Data Diff check confirms that for each appointment_id, patient_name, doctor, and insurance_plan agree on both sides, and surfaces any field that the transfer mangled.

Check configuration

Field Value
Rule Data Diff
Fields patient_name, doctor, insurance_plan
Row Identifiers appointment_id
Reference Datastore Billing
Reference Container APPOINTMENTS_BILLING
Filter (none)
Comparators (none)
Status Active
Tags clinical, billing
Additional Metadata jira: DATA-4101
Description Appointments must match between scheduling and billing

Payload

{
    "description": "Appointments must match between scheduling and billing",
    "rule": "dataDiff",
    "fields": ["patient_name", "doctor", "insurance_plan"],
    "container_id": 212,
    "filter": null,
    "properties": {
        "ref_datastore_id": 31,
        "ref_container_id": 912,
        "id_field_names": ["appointment_id"]
    },
    "tags": ["clinical", "billing"],
    "additional_metadata": {"jira": "DATA-4101"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 18
}

Sample Data (as rendered in Comparison Source Records)

Row Status appointment_id patient_name (Left → Right) doctor (Left → Right) insurance_plan (Left → Right)
changed A-7781 Robert Martinez → Robert Martinez Dr. Smith → Dr. Smith BlueCross Plan A → BlueCross Plan B

What gets flagged

The pair (A-7781) exists on both sides, so the row's status is changed rather than added or removed. Only insurance_plan differs, so only that right-side cell is highlighted; patient_name and doctor agree and render plainly on both sides.

Shape Anomaly

There is 1 record that differs between APPOINTMENTS_BILLING (2841 records) and APPOINTMENTS (2841 records) in Billing

Flowchart

graph TD
    A["No filter, evaluate all target rows"] --> B["Read reference container APPOINTMENTS_BILLING"]
    B --> C["Match rows by appointment_id"]
    C --> D{"Do all compared fields<br/>match on each pair?"}
    D -->|Yes| E["Row passes"]
    D -->|No| F["Flag the row as 'changed'.<br/>insurance_plan differs for A-7781."]

Equivalent SQL

-- Rows the Data Diff check would flag as 'changed' for appointments.
SELECT
    t.appointment_id,
    t.patient_name AS left_patient_name,  r.patient_name AS right_patient_name,
    t.doctor         AS left_doctor,         r.doctor         AS right_doctor,
    t.insurance_plan AS left_insurance_plan, r.insurance_plan AS right_insurance_plan
FROM appointments t
INNER JOIN appointments_billing r ON t.appointment_id = r.appointment_id
WHERE
       t.patient_name   <> r.patient_name
    OR t.doctor         <> r.doctor
    OR t.insurance_plan <> r.insurance_plan
ORDER BY t.appointment_id;

The situation: The Customers domain is being migrated from a legacy CRM to a new system. The legacy CUSTOMERS_LEGACY table is being phased out, and the new CUSTOMERS_NEW table is the system of record. For the cutover window, both systems are written to, and the migration team needs daily confirmation that today's writes are identical on both sides. Yesterday's rows are intentionally allowed to differ (the legacy system kept being patched), so the check is scoped to created_at = current_date() on the target.

Check configuration

Field Value
Rule Data Diff
Fields customer_name, email, tier
Row Identifiers customer_id
Reference Datastore Legacy CRM
Reference Container CUSTOMERS_LEGACY
Filter created_at = current_date()
Diff Change Types ["removed", "changed"] (ignore added-row noise from the unfiltered legacy reference)
Comparators (none)
Status Active
Tags migration, customers
Additional Metadata jira: DATA-5240
Description Today's new customers must mirror the legacy CRM

Payload

{
    "description": "Today's new customers must mirror the legacy CRM",
    "rule": "dataDiff",
    "fields": ["customer_name", "email", "tier"],
    "container_id": 318,
    "filter": "created_at = current_date()",
    "properties": {
        "ref_datastore_id": 17,
        "ref_container_id": 661,
        "id_field_names": ["customer_id"],
        "diff_change_types": ["removed", "changed"]
    },
    "tags": ["migration", "customers"],
    "additional_metadata": {"jira": "DATA-5240"},
    "anomaly_message_field": null,
    "template_id": null,
    "status": "Active",
    "owner_id": 7,
    "default_anomaly_assignee_id": 24
}

Why the filter and diff_change_types matter

The filter runs before the reference is read into the comparison. The target is scoped to today's CUSTOMERS_NEW rows, and only those rows are matched against CUSTOMERS_LEGACY. A customer created yesterday with a mismatch in either system stays out of scope and does not contribute to the anomaly.

The filter does not narrow the reference, however. Read in full, CUSTOMERS_LEGACY carries every legacy customer ever created, so any legacy row not matched by a CUSTOMERS_NEW row created today would normally surface as added. That noise has nothing to do with the cutover verification, so the check sets diff_change_types to ["removed", "changed"], and only today's NEW customers missing from LEGACY, or today's NEW customers whose values differ in LEGACY, are reported.

Sample Data (filtered to created_at = current_date(), as rendered in Comparison Source Records)

Row Status customer_id customer_name (Left → Right) email (Left → Right) tier (Left → Right)
changed C-9012 Lin Wei → Lin Wei lin@example.com → lin@example.com gold → silver

What gets flagged

C-9012 exists on both sides today; only tier differs, so the row is changed and only the right-side tier cell is highlighted. No removed rows are reported in this batch because every target row from today found a match in the legacy CRM. Legacy customers not present in today's CUSTOMERS_NEW writes would normally be reported as added, but diff_change_types filters that status out for this check.

Shape Anomaly

There is 1 record that differs between CUSTOMERS_LEGACY (38241 records) and CUSTOMERS_NEW (412 records) in Legacy CRM [filter: created_at = current_date()]

Flowchart

graph TD
    A["Apply filter: created_at = current_date()"] --> B["Read reference container CUSTOMERS_LEGACY"]
    B --> C["Match filtered target rows<br/>against reference by customer_id"]
    C --> D{"Any row removed or changed?<br/>(added is suppressed by diff_change_types)"}
    D -->|No| E["All filtered rows pass"]
    D -->|Yes| F["Flag each differing row with its status:<br/>C-9012 (changed)."]

Equivalent SQL

-- Rows the Data Diff check would flag for today's customers,
-- with diff_change_types = ["removed", "changed"]
-- (legacy-only rows are intentionally not flagged).
WITH target AS (
    SELECT * FROM customers_new WHERE created_at = current_date()
)
SELECT
    CASE
        WHEN r.customer_id IS NULL THEN 'removed'
        ELSE 'changed'
    END AS row_status,
    t.customer_id,
    t.customer_name AS left_customer_name, r.customer_name AS right_customer_name,
    t.email         AS left_email,         r.email         AS right_email,
    t.tier          AS left_tier,          r.tier          AS right_tier
FROM target t
LEFT JOIN customers_legacy r ON t.customer_id = r.customer_id
WHERE
       r.customer_id IS NULL
    OR t.customer_name <> r.customer_name
    OR t.email         <> r.email
    OR t.tier          <> r.tier
ORDER BY t.customer_id;
  • Introduction: formal definition, field scope, and general/anomaly properties.
  • How It Works: full semantics, Row Identifiers, Comparators, and edge cases.
  • API: payload shape and field notes for creating a Data Diff check programmatically.
  • FAQ: short answers to the most frequent questions.