How Data Diff Checks Work

This page covers everything the Data Diff check does, in detail: how it matches rows between target and reference datasets, the three diff statuses it produces, how Row Identifiers, Comparators, and diff_change_types change the evaluation, and how the resulting Comparison Source Records are rendered in the UI.

If you only need a quick reference, the Introduction page covers the formal definition, field scope, and general/anomaly properties. This page is the detailed reference.

How the Check Evaluates the Two Datasets

Every Data Diff check follows the same five-step evaluation flow:

Apply the filter clause. If the check has a filter set, only the rows in the target container that match the filter expression continue to the next step. The reference container is read in full (the filter scopes the target, not the reference).
Read the reference container. The platform reads the comparison fields from the reference table or file in the configured reference datastore.
Match rows. With Row Identifiers configured, rows are matched by the combination of identifier values: target rows whose identifier values match a reference row are paired up. Without Row Identifiers, the platform performs a symmetrical set difference on the full set of compared fields.
Compare fields on each matched pair. For every paired row, each listed field is compared between left (target) and right (reference). Comparators (numeric, duration, string) define the tolerance for the comparison; without a Comparator the comparison is strict equality.
Emit a single Shape Anomaly summarizing the diffs. Every row that differs (added, removed, or changed) becomes part of the anomaly's source records. The dataset-level violation is summarized in a single Shape Anomaly message and the per-row detail is surfaced through the Comparison Source Records view.

The order matters: the filter is applied before matching, so rows the filter excludes on the target side cannot pair with any reference row and cannot contribute to the anomaly.

The Three Diff Statuses

Each differing row carries one of three status values. The status is what drives the Comparison Source Records UI and is exposed in the anomaly payload.

Status	Meaning
`removed`	The row's identifier exists only on the left (target). The reference dataset is missing this row.
`added`	The row's identifier exists only on the right (reference). The reference dataset has a row the target does not.
`changed`	The same identifier exists on both sides, but at least one listed field has a different value. The differing fields are reported per row.

Without Row Identifiers, only added and removed are produced, because the platform has no key to match rows on. A row that differs in even one field becomes one removed (the row from target) and one added (the row from reference), in a pure symmetrical set difference.

Row Identifiers matter for clarity

Setting Row Identifiers turns most discrepancies into a single changed row rather than a pair of removed/added rows. This is the only way to get a per-field diff (left vs right on the same row) in the Comparison Source Records view. Pick Row Identifiers that uniquely identify each row in both target and reference.

Restricting Anomalies by Status

The diff_change_types property restricts which of the three diff statuses are allowed to produce an anomaly. By default, every differing row (added, removed, or changed) contributes to the anomaly count. Setting diff_change_types to a subset of statuses keeps the comparison running over all rows but only flags the diffs whose status appears in the list.

Value omitted or `null`	Behavior
Property absent from `properties`	All three statuses (`added`, `removed`, `changed`) fire anomalies.
`["added"]`	Only rows present on the reference side but missing on the target are flagged.
`["removed"]`	Only rows present on the target side but missing on the reference are flagged.
`["changed"]`	Only rows present on both sides whose listed fields differ are flagged.
Any combination of the three values	Only the listed statuses fire.
`[]` (empty list)	Rejected at the API with HTTP 422; at least one status must be selected.

Typical use cases:

Reference is intentionally a superset of the target. The target is a production table whose reference (a staging or canonical source) carries extra QA or onboarding rows that production is not expected to receive. Set diff_change_types to ["removed", "changed"] to suppress added-row noise.
Reference is intentionally a subset of the target. A reporting view may exclude soft-deleted or out-of-scope rows. Set diff_change_types to ["added", "changed"] to suppress removed-row noise.
Only the per-field diff matters. When row presence on both sides is guaranteed by an upstream contract and only value drift is interesting, set diff_change_types to ["changed"].

The property is dataDiff-only. Sending it on an isReplicaOf check is rejected at the API with HTTP 422. The property is editable through PUT /api/quality-checks/{id}, so the subset can be tuned without recreating the check.

Row Identifiers and Passthrough Fields

Two optional properties shape what the Comparison Source Records view looks like:

Row Identifiers

A list of fields that form the compound key the platform uses to pair target and reference rows. The identifier tuple must exist on both sides. Typical choices are primary keys (customer_id, order_id) or a composite of business-key columns (order_id, line_number).

When Row Identifiers are set:

Matched rows produce a changed diff if any listed field differs.
Unmatched target rows produce removed.
Unmatched reference rows produce added.

Passthrough Fields

Extra fields the platform should carry into the source-records output for context, even though they are not part of the comparison. Passthrough Fields appear in the Comparison Source Records view alongside the diffed fields but are never themselves a reason for the anomaly to fire. Typical use: showing customer_name or created_at next to the differing column so anomaly triagers can identify the row without leaving the page.

The Filter Clause

The filter clause is a Spark SQL WHERE expression applied to the target container before matching. The expression is always evaluated as Spark SQL on the dataplane, even when the underlying datastore (Snowflake, Postgres, etc.) uses a different SQL dialect. It serves two purposes:

Scoping the comparison. Restrict the comparison to a subset of target rows (status = 'active', event_date = current_date(), tenant_id = 42). Rows outside the scope cannot be reported as removed and cannot pair with reference rows.
Avoiding noise from known divergences. Filter out rows that are intentionally allowed to differ between target and reference (for example, a staging_only = true flag), so the check focuses on the rows that must match.

The filter is part of the check definition, so the anomaly message includes the filter expression ([filter: <expression>]) when one is set, making it explicit which slice of the target was evaluated when the anomaly fired.

The filter does not scope the reference

The filter only narrows the target side. The reference container is read in full. If you also need to narrow the reference (for example, comparing only this month's records on both sides), point the check at a view in the reference datastore that already encodes the same scope.

Comparators

Comparators apply a per-field tolerance to the equality check between left and right values. Without a Comparator, the platform compares values strictly: 1.00 and 1.000001 differ, "Australia" and "australia" differ.

### Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

Millis: Measures time in milliseconds, ideal for high-precision needs.
Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
Days: Best for longer durations.

Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator

Unit	Value A	Value B	Difference	Threshold	Are equal?
Millis	500 ms	520 ms	20 ms	25 ms	True
Seconds	30 sec	31 sec	1 sec	2 sec	True
Days	5 days	7 days	2 days	1 day	False
Millis	1000 ms	1040 ms	40 ms	25 ms	False
Seconds	45 sec	48 sec	3 sec	2 sec	False

String

String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.

Ignore Whitespace

When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.

Illustration

In this example, it compares Value A and Value B according to the defined string comparison to ignore whitespace as True.

Value A	Value B	Are equal?	Has whitespace?
`Leonidas`	`Leonidas`	True	No
`Beth`	`Beth`	True	Yes
`Ana`	`Anna`	False	Yes
`Joe`	`Joel`	False	No

The Resulting Shape Anomaly

When the Data Diff check fires, it produces a single Shape Anomaly describing the dataset-level violation, with the per-row detail attached as Comparison Source Records. The check does not produce Record Anomalies; the diff is a property of the target as a whole.

Anomaly message format

There are N records that differ between <reference_container> (R records) and <target_container> (T records) in <reference_datastore_name>

When a filter is set, the message is followed by [filter: <expression>].

What the numbers mean

N: the number of differing rows (added + removed + changed).
R: total row count in the reference container after read.
T: total row count in the target container after the filter is applied.

Comparison Source Records view

When Row Identifiers are configured, the per-row detail is rendered in the Comparison Source Records view rather than the standard Source Records list. The columns, in order:

Row Status: one of added, removed, changed.
Row Identifier: the identifier value (or tuple) that pairs the target and reference rows.
For each listed field: a parent column spanning two sub-columns, Left (target) and Right (reference).

Only the right-side cell of the differing field is highlighted; the left side and the row identifier are not. When a row is removed, the right-side cell shows the literal text missing; when a row is added, the left-side cell shows missing.

See Comparison Source Records for the full UI reference.

Relationship with Other Rule Types

Data Diff is the only rule type that performs a two-table row-by-row comparison. A few rule types overlap with parts of its job and are worth pairing or substituting depending on the situation:

Rule Type	When to use it instead of (or alongside) Data Diff
Is Replica Of	Deprecated. Use Data Diff for any new check; Is Replica Of is preserved only for existing checks.
Volumetric Checks	Use when you only need to confirm that the row count in the target matches the reference, without comparing each field. Cheaper to evaluate than Data Diff on wide tables.
Aggregation Comparison	Use when you want to compare a summary statistic between the two datasets (sum, average) rather than per-row values. Useful for sanity-checking large fact tables where row-level diff is too expensive.
Exists In / Not Exists In	Use when you only need referential-integrity semantics (every target row's identifier exists in the reference) rather than full value comparison.
Equal To Field	Use when the two sides being compared are two fields on the same row (target and reference are the same container), not two separate containers.

Performance Considerations

Data Diff is the most expensive comparison rule type, because it reads both containers in full and joins them on the Row Identifier tuple (or performs a symmetrical set difference when no identifiers are set). Two practical implications:

Choose Row Identifiers that are present on both sides and have low cardinality per row. Pairing on a single integer identifier is the cheapest case; pairing on a wide composite (4+ string fields) costs noticeably more.
Filter the target side to the smallest meaningful slice. A filter that narrows the target to the current day or current tenant shrinks the comparison workload proportionally and is a common pattern when uniqueness only matters within a scope.

When per-field diff isn't needed, a Volumetric Check or an Aggregation Comparison can answer "did anything diverge?" at a fraction of the cost.

Introduction: formal definition, field scope, and general/anomaly properties.
Examples: three production scenarios with sample data and resulting anomalies.
API: payload shape and field notes for creating a Data Diff check programmatically.
FAQ: short answers to the most frequent questions.