Data Diff Check Examples
Three real-world scenarios that show how the Data Diff check is typically used in production: validating a nightly backup, comparing a system-to-system data transfer, and verifying a post-migration mirror with a scoped filter that suppresses unmatched reference rows.
In every example, the Sample Data table is laid out exactly as the Comparison Source Records view in the Qualytics app renders the anomaly: each row has a status, an identifier, and a Left (target) / Right (reference) pair for every compared field. Only the differing right-side cell carries the anomalous-cell highlight, and missing values render as the literal text missing.
The situation: Your ORDERS table is the system of record for an e-commerce site. Every night at midnight a backup job replicates it to ORDERS_BACKUP in the warehouse datastore. The Sales team builds the morning report from ORDERS_BACKUP, so any row missing from the backup silently understates revenue. A Data Diff check validates that the backup is complete every morning.
Check configuration
| Field | Value |
|---|---|
| Rule | Data Diff |
| Fields | order_id, customer_id, amount, order_date |
| Row Identifiers | order_id |
| Reference Datastore | Warehouse |
| Reference Container | ORDERS_BACKUP |
| Filter | (none) |
| Comparators | (none) |
| Status | Active |
| Tags | critical, backup |
| Additional Metadata | jira: DATA-3201 |
| Description | ORDERS must match ORDERS_BACKUP every morning |
Payload
{
"description": "ORDERS must match ORDERS_BACKUP every morning",
"rule": "dataDiff",
"fields": ["order_id", "customer_id", "amount", "order_date"],
"container_id": 145,
"filter": null,
"properties": {
"ref_datastore_id": 22,
"ref_container_id": 803,
"id_field_names": ["order_id"]
},
"tags": ["critical", "backup"],
"additional_metadata": {"jira": "DATA-3201"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 12
}
Sample Data (as rendered in Comparison Source Records)
| Row Status | order_id | customer_id (Left → Right) | amount (Left → Right) | order_date (Left → Right) |
|---|---|---|---|---|
| removed | 10248 | 4451 → missing | 19.00 → missing | 2025-01-15 → missing |
| removed | 10249 | 4452 → missing | 12.00 → missing | 2025-01-15 → missing |
| removed | 10250 | 4453 → missing | 29.00 → missing | 2025-01-15 → missing |
What gets flagged
Three orders that exist on the target side (ORDERS) have no matching order_id on the reference side (ORDERS_BACKUP). Each row is reported with status removed and every right-side field shows the literal missing. The backup job failed to copy these rows into ORDERS_BACKUP.
Shape Anomaly
There are 3 records that differ between ORDERS_BACKUP (1247 records) and ORDERS (1250 records) in Warehouse
Flowchart
graph TD
A["No filter, evaluate all target rows"] --> B["Read reference container ORDERS_BACKUP"]
B --> C["Match rows by order_id"]
C --> D{"Any target row<br/>without a reference match?"}
D -->|No| E["All target rows pass"]
D -->|Yes| F["Flag each unmatched row as 'removed'.<br/>3 orders missing from backup."]
Equivalent SQL
The situation: A clinic's scheduling system writes appointments to a SQL Server APPOINTMENTS table, and an hourly ETL copies them into the billing system's APPOINTMENTS_BILLING table. Patient name and doctor are reliably copied, but the insurance plan code is rewritten in transit by a translation layer. A Data Diff check confirms that for each appointment_id, patient_name, doctor, and insurance_plan agree on both sides, and surfaces any field that the transfer mangled.
Check configuration
| Field | Value |
|---|---|
| Rule | Data Diff |
| Fields | patient_name, doctor, insurance_plan |
| Row Identifiers | appointment_id |
| Reference Datastore | Billing |
| Reference Container | APPOINTMENTS_BILLING |
| Filter | (none) |
| Comparators | (none) |
| Status | Active |
| Tags | clinical, billing |
| Additional Metadata | jira: DATA-4101 |
| Description | Appointments must match between scheduling and billing |
Payload
{
"description": "Appointments must match between scheduling and billing",
"rule": "dataDiff",
"fields": ["patient_name", "doctor", "insurance_plan"],
"container_id": 212,
"filter": null,
"properties": {
"ref_datastore_id": 31,
"ref_container_id": 912,
"id_field_names": ["appointment_id"]
},
"tags": ["clinical", "billing"],
"additional_metadata": {"jira": "DATA-4101"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 18
}
Sample Data (as rendered in Comparison Source Records)
| Row Status | appointment_id | patient_name (Left → Right) | doctor (Left → Right) | insurance_plan (Left → Right) |
|---|---|---|---|---|
| changed | A-7781 | Robert Martinez → Robert Martinez | Dr. Smith → Dr. Smith | BlueCross Plan A → BlueCross Plan B |
What gets flagged
The pair (A-7781) exists on both sides, so the row's status is changed rather than added or removed. Only insurance_plan differs, so only that right-side cell is highlighted; patient_name and doctor agree and render plainly on both sides.
Shape Anomaly
There is 1 record that differs between APPOINTMENTS_BILLING (2841 records) and APPOINTMENTS (2841 records) in Billing
Flowchart
graph TD
A["No filter, evaluate all target rows"] --> B["Read reference container APPOINTMENTS_BILLING"]
B --> C["Match rows by appointment_id"]
C --> D{"Do all compared fields<br/>match on each pair?"}
D -->|Yes| E["Row passes"]
D -->|No| F["Flag the row as 'changed'.<br/>insurance_plan differs for A-7781."]
Equivalent SQL
-- Rows the Data Diff check would flag as 'changed' for appointments.
SELECT
t.appointment_id,
t.patient_name AS left_patient_name, r.patient_name AS right_patient_name,
t.doctor AS left_doctor, r.doctor AS right_doctor,
t.insurance_plan AS left_insurance_plan, r.insurance_plan AS right_insurance_plan
FROM appointments t
INNER JOIN appointments_billing r ON t.appointment_id = r.appointment_id
WHERE
t.patient_name <> r.patient_name
OR t.doctor <> r.doctor
OR t.insurance_plan <> r.insurance_plan
ORDER BY t.appointment_id;
The situation: The Customers domain is being migrated from a legacy CRM to a new system. The legacy CUSTOMERS_LEGACY table is being phased out, and the new CUSTOMERS_NEW table is the system of record. For the cutover window, both systems are written to, and the migration team needs daily confirmation that today's writes are identical on both sides. Yesterday's rows are intentionally allowed to differ (the legacy system kept being patched), so the check is scoped to created_at = current_date() on the target.
Check configuration
| Field | Value |
|---|---|
| Rule | Data Diff |
| Fields | customer_name, email, tier |
| Row Identifiers | customer_id |
| Reference Datastore | Legacy CRM |
| Reference Container | CUSTOMERS_LEGACY |
| Filter | created_at = current_date() |
| Diff Change Types | ["removed", "changed"] (ignore added-row noise from the unfiltered legacy reference) |
| Comparators | (none) |
| Status | Active |
| Tags | migration, customers |
| Additional Metadata | jira: DATA-5240 |
| Description | Today's new customers must mirror the legacy CRM |
Payload
{
"description": "Today's new customers must mirror the legacy CRM",
"rule": "dataDiff",
"fields": ["customer_name", "email", "tier"],
"container_id": 318,
"filter": "created_at = current_date()",
"properties": {
"ref_datastore_id": 17,
"ref_container_id": 661,
"id_field_names": ["customer_id"],
"diff_change_types": ["removed", "changed"]
},
"tags": ["migration", "customers"],
"additional_metadata": {"jira": "DATA-5240"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 24
}
Why the filter and diff_change_types matter
The filter runs before the reference is read into the comparison. The target is scoped to today's CUSTOMERS_NEW rows, and only those rows are matched against CUSTOMERS_LEGACY. A customer created yesterday with a mismatch in either system stays out of scope and does not contribute to the anomaly.
The filter does not narrow the reference, however. Read in full, CUSTOMERS_LEGACY carries every legacy customer ever created, so any legacy row not matched by a CUSTOMERS_NEW row created today would normally surface as added. That noise has nothing to do with the cutover verification, so the check sets diff_change_types to ["removed", "changed"], and only today's NEW customers missing from LEGACY, or today's NEW customers whose values differ in LEGACY, are reported.
Sample Data (filtered to created_at = current_date(), as rendered in Comparison Source Records)
| Row Status | customer_id | customer_name (Left → Right) | email (Left → Right) | tier (Left → Right) |
|---|---|---|---|---|
| changed | C-9012 | Lin Wei → Lin Wei | lin@example.com → lin@example.com | gold → silver |
What gets flagged
C-9012 exists on both sides today; only tier differs, so the row is changed and only the right-side tier cell is highlighted. No removed rows are reported in this batch because every target row from today found a match in the legacy CRM. Legacy customers not present in today's CUSTOMERS_NEW writes would normally be reported as added, but diff_change_types filters that status out for this check.
Shape Anomaly
There is 1 record that differs between CUSTOMERS_LEGACY (38241 records) and CUSTOMERS_NEW (412 records) in Legacy CRM [filter: created_at = current_date()]
Flowchart
graph TD
A["Apply filter: created_at = current_date()"] --> B["Read reference container CUSTOMERS_LEGACY"]
B --> C["Match filtered target rows<br/>against reference by customer_id"]
C --> D{"Any row removed or changed?<br/>(added is suppressed by diff_change_types)"}
D -->|No| E["All filtered rows pass"]
D -->|Yes| F["Flag each differing row with its status:<br/>C-9012 (changed)."]
Equivalent SQL
-- Rows the Data Diff check would flag for today's customers,
-- with diff_change_types = ["removed", "changed"]
-- (legacy-only rows are intentionally not flagged).
WITH target AS (
SELECT * FROM customers_new WHERE created_at = current_date()
)
SELECT
CASE
WHEN r.customer_id IS NULL THEN 'removed'
ELSE 'changed'
END AS row_status,
t.customer_id,
t.customer_name AS left_customer_name, r.customer_name AS right_customer_name,
t.email AS left_email, r.email AS right_email,
t.tier AS left_tier, r.tier AS right_tier
FROM target t
LEFT JOIN customers_legacy r ON t.customer_id = r.customer_id
WHERE
r.customer_id IS NULL
OR t.customer_name <> r.customer_name
OR t.email <> r.email
OR t.tier <> r.tier
ORDER BY t.customer_id;
Related
- Introduction: formal definition, field scope, and general/anomaly properties.
- How It Works: full semantics, Row Identifiers, Comparators, and edge cases.
- API: payload shape and field notes for creating a Data Diff check programmatically.
- FAQ: short answers to the most frequent questions.