Skip to content

How Data Diff Checks Work

This page covers everything the Data Diff check does, in detail: how it matches rows between target and reference datasets, the three diff statuses it produces, how Row Identifiers, Comparators, and diff_change_types change the evaluation, and how the resulting Comparison Source Records are rendered in the UI.

If you only need a quick reference, the Introduction page covers the formal definition, field scope, and general/anomaly properties. This page is the detailed reference.

How the Check Evaluates the Two Datasets

Every Data Diff check follows the same five-step evaluation flow:

  1. Apply the filter clause. If the check has a filter set, only the rows in the target container that match the filter expression continue to the next step. The reference container is read in full (the filter scopes the target, not the reference).
  2. Read the reference container. The platform reads the comparison fields from the reference table or file in the configured reference datastore.
  3. Match rows. With Row Identifiers configured, rows are matched by the combination of identifier values: target rows whose identifier values match a reference row are paired up. Without Row Identifiers, the platform performs a symmetrical set difference on the full set of compared fields.
  4. Compare fields on each matched pair. For every paired row, each listed field is compared between left (target) and right (reference). Comparators (numeric, duration, string) define the tolerance for the comparison; without a Comparator the comparison is strict equality.
  5. Emit a single Shape Anomaly summarizing the diffs. Every row that differs (added, removed, or changed) becomes part of the anomaly's source records. The dataset-level violation is summarized in a single Shape Anomaly message and the per-row detail is surfaced through the Comparison Source Records view.

The order matters: the filter is applied before matching, so rows the filter excludes on the target side cannot pair with any reference row and cannot contribute to the anomaly.

The Three Diff Statuses

Each differing row carries one of three status values. The status is what drives the Comparison Source Records UI and is exposed in the anomaly payload.

Status Meaning
removed The row's identifier exists only on the left (target). The reference dataset is missing this row.
added The row's identifier exists only on the right (reference). The reference dataset has a row the target does not.
changed The same identifier exists on both sides, but at least one listed field has a different value. The differing fields are reported per row.

Without Row Identifiers, only added and removed are produced, because the platform has no key to match rows on. A row that differs in even one field becomes one removed (the row from target) and one added (the row from reference), in a pure symmetrical set difference.

Row Identifiers matter for clarity

Setting Row Identifiers turns most discrepancies into a single changed row rather than a pair of removed/added rows. This is the only way to get a per-field diff (left vs right on the same row) in the Comparison Source Records view. Pick Row Identifiers that uniquely identify each row in both target and reference.

Restricting Anomalies by Status

The diff_change_types property restricts which of the three diff statuses are allowed to produce an anomaly. By default, every differing row (added, removed, or changed) contributes to the anomaly count. Setting diff_change_types to a subset of statuses keeps the comparison running over all rows but only flags the diffs whose status appears in the list.

Value omitted or null Behavior
Property absent from properties All three statuses (added, removed, changed) fire anomalies.
["added"] Only rows present on the reference side but missing on the target are flagged.
["removed"] Only rows present on the target side but missing on the reference are flagged.
["changed"] Only rows present on both sides whose listed fields differ are flagged.
Any combination of the three values Only the listed statuses fire.
[] (empty list) Rejected at the API with HTTP 422; at least one status must be selected.

Typical use cases:

  • Reference is intentionally a superset of the target. The target is a production table whose reference (a staging or canonical source) carries extra QA or onboarding rows that production is not expected to receive. Set diff_change_types to ["removed", "changed"] to suppress added-row noise.
  • Reference is intentionally a subset of the target. A reporting view may exclude soft-deleted or out-of-scope rows. Set diff_change_types to ["added", "changed"] to suppress removed-row noise.
  • Only the per-field diff matters. When row presence on both sides is guaranteed by an upstream contract and only value drift is interesting, set diff_change_types to ["changed"].

The property is dataDiff-only. Sending it on an isReplicaOf check is rejected at the API with HTTP 422. The property is editable through PUT /api/quality-checks/{id}, so the subset can be tuned without recreating the check.

Row Identifiers and Passthrough Fields

Two optional properties shape what the Comparison Source Records view looks like:

Row Identifiers

A list of fields that form the compound key the platform uses to pair target and reference rows. The identifier tuple must exist on both sides. Typical choices are primary keys (customer_id, order_id) or a composite of business-key columns (order_id, line_number).

When Row Identifiers are set:

  • Matched rows produce a changed diff if any listed field differs.
  • Unmatched target rows produce removed.
  • Unmatched reference rows produce added.

Passthrough Fields

Extra fields the platform should carry into the source-records output for context, even though they are not part of the comparison. Passthrough Fields appear in the Comparison Source Records view alongside the diffed fields but are never themselves a reason for the anomaly to fire. Typical use: showing customer_name or created_at next to the differing column so anomaly triagers can identify the row without leaving the page.

The Filter Clause

The filter clause is a Spark SQL WHERE expression applied to the target container before matching. The expression is always evaluated as Spark SQL on the dataplane, even when the underlying datastore (Snowflake, Postgres, etc.) uses a different SQL dialect. It serves two purposes:

  1. Scoping the comparison. Restrict the comparison to a subset of target rows (status = 'active', event_date = current_date(), tenant_id = 42). Rows outside the scope cannot be reported as removed and cannot pair with reference rows.
  2. Avoiding noise from known divergences. Filter out rows that are intentionally allowed to differ between target and reference (for example, a staging_only = true flag), so the check focuses on the rows that must match.

The filter is part of the check definition, so the anomaly message includes the filter expression ([filter: <expression>]) when one is set, making it explicit which slice of the target was evaluated when the anomaly fired.

The filter does not scope the reference

The filter only narrows the target side. The reference container is read in full. If you also need to narrow the reference (for example, comparing only this month's records on both sides), point the check at a view in the reference datastore that already encodes the same scope.

Comparators

Comparators apply a per-field tolerance to the equality check between left and right values. Without a Comparator, the platform compares values strictly: 1.00 and 1.000001 differ, "Australia" and "australia" differ.

### Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type
  • Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
  • Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold

The threshold is the value you set to define the margin of error:

  • When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
  • For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A Value B Difference Are equal?
100 150 50 True
100 90 10 True
100 155 55
False
100 49 51
False
Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A Value B Percentage Change Are equal?
120 132 10% True
150 135 10% True
200 180 10% True
160 150 6.25% True
180 200 11.11%
False

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

  • Millis: Measures time in milliseconds, ideal for high-precision needs.
  • Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
  • Days: Best for longer durations.
Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator
Unit Value A Value B Difference Threshold Are equal?
Millis 500 ms 520 ms 20 ms 25 ms True
Seconds 30 sec 31 sec 1 sec 2 sec True
Days 5 days 7 days 2 days 1 day
False
Millis 1000 ms 1040 ms 40 ms 25 ms
False
Seconds 45 sec 48 sec 3 sec 2 sec
False

String

String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.

Ignore Whitespace

When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.

Illustration

In this example, it compares Value A and Value B according to the defined string comparison to ignore whitespace as True.

Value A Value B Are equal? Has whitespace?
Leonidas Leonidas True No
Beth Beth True Yes
Ana Anna
False
Yes
Joe Joel
False
No

The Resulting Shape Anomaly

When the Data Diff check fires, it produces a single Shape Anomaly describing the dataset-level violation, with the per-row detail attached as Comparison Source Records. The check does not produce Record Anomalies; the diff is a property of the target as a whole.

Anomaly message format

There are N records that differ between <reference_container> (R records) and <target_container> (T records) in <reference_datastore_name>

When a filter is set, the message is followed by [filter: <expression>].

What the numbers mean

  • N: the number of differing rows (added + removed + changed).
  • R: total row count in the reference container after read.
  • T: total row count in the target container after the filter is applied.

Comparison Source Records view

When Row Identifiers are configured, the per-row detail is rendered in the Comparison Source Records view rather than the standard Source Records list. The columns, in order:

  1. Row Status: one of added, removed, changed.
  2. Row Identifier: the identifier value (or tuple) that pairs the target and reference rows.
  3. For each listed field: a parent column spanning two sub-columns, Left (target) and Right (reference).

Only the right-side cell of the differing field is highlighted; the left side and the row identifier are not. When a row is removed, the right-side cell shows the literal text missing; when a row is added, the left-side cell shows missing.

See Comparison Source Records for the full UI reference.

Relationship with Other Rule Types

Data Diff is the only rule type that performs a two-table row-by-row comparison. A few rule types overlap with parts of its job and are worth pairing or substituting depending on the situation:

Rule Type When to use it instead of (or alongside) Data Diff
Is Replica Of Deprecated. Use Data Diff for any new check; Is Replica Of is preserved only for existing checks.
Volumetric Checks Use when you only need to confirm that the row count in the target matches the reference, without comparing each field. Cheaper to evaluate than Data Diff on wide tables.
Aggregation Comparison Use when you want to compare a summary statistic between the two datasets (sum, average) rather than per-row values. Useful for sanity-checking large fact tables where row-level diff is too expensive.
Exists In / Not Exists In Use when you only need referential-integrity semantics (every target row's identifier exists in the reference) rather than full value comparison.
Equal To Field Use when the two sides being compared are two fields on the same row (target and reference are the same container), not two separate containers.

Performance Considerations

Data Diff is the most expensive comparison rule type, because it reads both containers in full and joins them on the Row Identifier tuple (or performs a symmetrical set difference when no identifiers are set). Two practical implications:

  • Choose Row Identifiers that are present on both sides and have low cardinality per row. Pairing on a single integer identifier is the cheapest case; pairing on a wide composite (4+ string fields) costs noticeably more.
  • Filter the target side to the smallest meaningful slice. A filter that narrows the target to the current day or current tenant shrinks the comparison workload proportionally and is a common pattern when uniqueness only matters within a scope.

When per-field diff isn't needed, a Volumetric Check or an Aggregation Comparison can answer "did anything diverge?" at a fraction of the cost.

  • Introduction: formal definition, field scope, and general/anomaly properties.
  • Examples: three production scenarios with sample data and resulting anomalies.
  • API: payload shape and field notes for creating a Data Diff check programmatically.
  • FAQ: short answers to the most frequent questions.