Data Diff Check FAQ
Common questions about how the Data Diff check matches rows between target and reference datasets, how it handles missing or differing values, and how anomalies are reported.
Behavior
What is the difference between Data Diff and Is Replica Of?
Data Diff (dataDiff) is the current rule type for two-table comparison. Is Replica Of (isReplicaOf) is deprecated and no longer maintained. Both share the same row-by-row comparison engine and the same configuration properties (Row Identifiers, Passthrough Fields, Comparators), but only Data Diff is actively maintained and only Data Diff supports the diff_change_types property, which restricts which diff statuses fire anomalies. Use Data Diff for any new check; Is Replica Of is preserved only for existing checks.
What are Row Identifiers, and do I need them?
Row Identifiers are the compound key the platform uses to pair each target row with a reference row. Without them, the platform falls back to a symmetrical set difference and can only produce added and removed diffs (a row that differs in one field becomes one removed plus one added row, not one changed row). Setting Row Identifiers is the only way to get per-field, side-by-side diffs (Left vs Right on the same row) in the Comparison Source Records view. See How It Works → Row Identifiers and Passthrough Fields.
What are Passthrough Fields?
Passthrough Fields are extra columns carried into the source-records output for context. They appear alongside the diffed fields in the Comparison Source Records view but are not themselves compared, so they never cause the anomaly to fire. Typical use: showing a customer_name or created_at next to the differing column so anomaly triagers can identify the row without leaving the page.
Does the filter clause scope both target and reference?
No. The filter only narrows the target container. The reference container is always read in full. If you also need to narrow the reference (for example, comparing only today's records on both sides), point the check at a view in the reference datastore that encodes the same scope. See How It Works → The Filter Clause.
How are Comparators applied?
Comparators apply a per-field tolerance to the equality check. The platform supports Numeric, Duration, and String Comparators. Without a Comparator the values are compared strictly: 1.00 and 1.000001 differ, "Australia" and "australia" differ. Use Comparators when small, expected divergences (rounding, casing, whitespace) would otherwise produce noise. See How It Works → Comparators.
Anomaly Reporting
What do the row statuses mean?
| Status | Meaning |
|---|---|
removed |
The identifier exists only on the target (left). The reference is missing this row. |
added |
The identifier exists only on the reference (right). The reference has a row the target does not. |
changed |
The identifier exists on both sides, but at least one compared field has a different value. |
Can I exclude added (or removed, or changed) rows from anomalies?
Yes. The diff_change_types property (inside properties) takes a list of statuses to allow: any subset of ["added", "removed", "changed"]. Rows with a status outside the list still get computed but do not contribute to the anomaly. Set it to ["removed", "changed"] when the reference is intentionally a superset of the target (a staging table with extra QA rows), or to ["changed"] when row presence is guaranteed by an upstream contract and only value drift matters. The property defaults to all three statuses; an empty list is rejected at the API. See How It Works → Restricting Anomalies by Status.
Why do some rows show "missing" instead of a value?
When a row is added, the target side has no row to read, so every left-side cell renders as the literal text missing. When a row is removed, the reference side has no row to read, so every right-side cell renders as missing. This is the same rendering the Comparison Source Records view uses in the Qualytics app.
Why doesn't Data Diff produce a Record Anomaly?
Data Diff is a shape-only rule type: the violation is a property of the target as a whole (the set of rows that diverge from the reference), not of an individual record's value. The per-row detail you see is attached to the Shape Anomaly as Comparison Source Records, not as separate Record Anomalies.
What does the Shape Anomaly message look like?
There are N records that differ between <reference_container> (R records) and <target_container> (T records) in <reference_datastore_name>
Where N is the total number of differing rows (added + removed + changed), R is the reference row count, and T is the target row count after the filter is applied. A [filter: <expression>] suffix is appended when a filter is set.
Configuration
Can I compare two containers in the same datastore?
Yes. Set ref_datastore_id to the same datastore ID as the target's datastore and ref_container_id to the second container. The reference does not have to be in a separate datastore.
Does Custom Anomaly Description (the anomaly_message_field payload field) work for Data Diff?
No. The Custom Anomaly Description toggle (and the corresponding anomaly_message_field payload field) only affects Record Anomaly messages. Because Data Diff emits only Shape Anomalies, the field is silently ignored at evaluation time, and the resulting message uses the fixed Shape Anomaly template (see What does the Shape Anomaly message look like?).
Which fields can I edit on an existing Data Diff check?
PUT /api/quality-checks/{id} can update the description, filter, tags, status, ownership, comparison fields list, all Data Diff-specific properties (Row Identifiers, Passthrough Fields, diff_change_types, Comparators), and the reference container (ref_datastore_id, ref_container_id). The rule type and the target container itself are fixed at creation. See the API page for the full editable/immutable matrix.
Related
- Introduction: formal definition, field scope, and general/anomaly properties.
- How It Works: full semantics, Row Identifiers, Comparators, and edge cases.
- API: payload example and field notes for creating a Data Diff check programmatically.
- Examples: three production scenarios with sample data and resulting anomalies.