Skip to content

How Unique Checks Work

This page covers everything the Unique check does, in detail: how it decides which rows are violations, how single-field and composite-key checks differ, and how NULL values affect grouping. It also covers the filter clause, coverage, the resulting anomaly, and how Unique relates to other rule types.

If you only need a quick reference, the Introduction page covers the formal definition, the two modes, field scope, and general/anomaly properties. This page is the detailed reference.

How the Check Evaluates Uniqueness

Every Unique check follows the same four-step evaluation flow, regardless of how many fields you select:

  1. Apply the filter clause. If the check has a filter set, only the rows that match the filter expression continue to the next step. Rows that fall outside the filter are ignored and cannot cause a violation.
  2. Group the remaining rows by the selected fields. For a single-field check, rows are grouped by that one column's value. For a multi-field check, rows are grouped by the tuple of values across all selected columns.
  3. Find groups with more than one row. Any group of size 2 or larger represents a duplicate: every row in that group has the same key (or tuple of keys) as at least one other row in the filtered set.
  4. Report every row in every duplicate group as a Shape Anomaly. This means the violation count reflects all rows that participate in a duplicate, not just the "extra" copies. If customer_id = 1001 appears on rows 1 and 3, both rows are reported.

The order of operations matters: the filter is applied before the grouping, so rows that the filter excludes cannot contribute to a duplicate group or to the violation count.

Single-Field vs. Composite-Key Semantics

The Unique check supports two modes, switched by the number of fields you list in fields:

Single Field

Every row's value in the selected column must be distinct from every other row's value in that column. This is the textbook uniqueness constraint and is equivalent to a SQL UNIQUE constraint on a single column (with the NULL caveat described below).

Typical use: enforcing a primary key (customer_id, order_id, sku) on a dimension or fact table where the column is the identifier.

Multiple Fields (Composite Key)

The combination of values across all selected fields must be unique on each row. Individual columns may freely repeat, but the tuple as a whole must not. This is equivalent to a composite UNIQUE constraint in SQL.

Typical use: enforcing uniqueness on a junction or line-item table where neither column alone is a key ((order_id, line_number) on order_items, (student_id, course_id) on enrollments, (user_id, event_type, event_date) on daily event aggregates).

Field order doesn't affect duplicate detection

["a", "b"] and ["b", "a"] flag the same rows as duplicates. The anomaly message reflects the order you specified (the field names are joined in the order they appear in fields).

How NULLs Are Handled

This is a common source of confusion with the Unique check, so it deserves its own section.

The platform treats NULL as a real value when grouping rows. Two rows where the selected field is NULL are placed in the same group and are therefore counted as duplicates of each other. The same applies to composite keys: two rows where every selected field is NULL share a key (a tuple of all NULLs) and will be flagged.

This is different from a SQL UNIQUE constraint, where NULL is treated as distinct from every other value, including other NULLs, so a SQL UNIQUE column can hold many NULL rows without violating the constraint. The Unique check is stricter on NULL.

When NULL handling matters

Situation What the check does
A required field that should never be NULL Combine Unique with a Not Null check. Together they enforce true primary-key semantics (UNIQUE + NOT NULL).
An optional field where many NULLs are expected Add a filter clause like external_ref IS NOT NULL so only non-NULL rows are evaluated.
A composite key where one column is optional The tuple (value, NULL) repeats only if another row has the same value in the first column AND NULL in the second column. If you also want to exclude all-NULL tuples or repeated (value, NULL) pairs, filter the optional column with IS NOT NULL. See "Composite keys and partial NULLs" below.

Composite keys and partial NULLs

For a composite key like (a, b), a row is grouped with another row only when the full tuple matches. So:

  • (1, NULL) and (1, NULL) → same group → duplicate.
  • (1, NULL) and (1, 2) → different groups → not a duplicate.
  • (NULL, NULL) repeated → same group → duplicate.

If your data model allows partial NULLs in a composite key and you do not want repeated partial-NULL tuples or the all-NULL tuple to register as duplicates, scope the check with a filter that excludes those rows.

The Filter Clause

The filter clause is a Spark SQL WHERE expression that the platform applies before the uniqueness evaluation. It serves two purposes:

  1. Scoping the check. Restrict uniqueness to a subset of the data (for example, status = 'active', event_date = current_date(), or tenant_id = 42). Rows outside the scope cannot trigger a violation and are not counted in the totals reported in the anomaly message.
  2. Working around NULL semantics. A filter such as email IS NOT NULL makes the Unique check behave like a SQL UNIQUE constraint on the email column by removing NULL rows from consideration entirely.

The filter is part of the check definition, so the anomaly message includes the filter expression ([filter: <expression>]) when one is set, making it explicit which slice of data was evaluated when the anomaly fired.

Coverage and Tolerance

Coverage is a fractional value between 0 and 1 that sets the threshold for what counts as a violating shape:

  • 1.0 (100%): every row must be part of a singleton group. Any duplicate triggers a Shape Anomaly. This is the default and the most common setting.
  • < 1.0: the check tolerates a fraction of records appearing in duplicate groups before flagging an anomaly. For example, 0.995 allows up to 0.5% of rows to participate in duplicates without firing.

Lower coverage values are useful when a small, known fraction of duplicates is expected (legacy data that has not been deduplicated yet, slow-running migrations, or a tolerated overlap between data sources). Use coverage with care: lowering it by, say, 0.5% means a real regression introducing duplicates in up to 0.5% of rows will look identical to the tolerated baseline and won't fire.

The Resulting Shape Anomaly

When the Unique check fires, it produces a single Shape Anomaly describing the dataset-level violation. The Unique check does not produce Record Anomalies: uniqueness is a property of the dataset as a whole (a duplicate group), not of any individual row's value, so the platform reports the violation at the shape level only.

Anomaly message format

Single-field check:

For the field 'order_id', X.XXX% of N records (K) contain duplicate values

Composite-key check:

For the field 'order_id, line_number', X.XXX% of N records (K) contain duplicate values

The template wording uses "field" (singular) even when multiple fields are checked; the joined list inside the quotes is what distinguishes single-field from composite-key checks at a glance.

When a filter is set, the message is followed by [filter: <expression>].

What the numbers mean

  • X.XXX%: the fraction of filtered rows that participate in any duplicate group. If 2 rows out of 4 are part of duplicates, this reads 50.000%.
  • N: the total number of rows the check evaluated (after the filter, if any).
  • K: the number of rows that participate in any duplicate group, counted in full (not "extras only"). For 2 rows sharing the same tuple, K = 2, not 1.

Because K is the full count (and not just the extras), the anomaly's row count matches 1:1 against the rows you would find by running the equivalent SQL WHERE (a, b) IN (duplicate combinations) query.

Relationship with Other Rule Types

The Unique check is most powerful when used together with a small set of complementary rule types:

Rule Type Why pair it with Unique
Not Null Together they enforce true primary-key semantics. Unique alone treats NULL like any other value, so a second NULL row becomes a duplicate of the first. Pairing it with Not Null on the same column rejects every NULL up front and gives you UNIQUE + NOT NULL.
Distinct Count Unique guarantees that the column has no duplicates; Distinct Count asserts the number of distinct values is what you expect. Use Distinct Count when the cardinality is itself meaningful (for example, "this lookup table should always have exactly 7 states").
Exists In / Not Exists In Unique handles intra-table integrity (one table's rows are unique); Exists In and Not Exists In handle referential integrity across tables (foreign keys). The three rule types together model the relational core: primary keys, no duplicates, and valid references.
Satisfies Expression Use a custom Spark SQL expression alongside Unique when uniqueness depends on a derived value (for example, uniqueness of LOWER(email) or of a hashed identifier). The Satisfies Expression check can complement Unique by enforcing the canonical form, and Unique enforces no duplicates on the raw column.

The Unique check can also be created automatically as an AI Managed check when a Profile operation detects that a column is fully unique in the observed data. See the AI Managed Checks Introduction for how AI Managed checks are generated and how editing one converts it to Authored.

Performance Considerations

Uniqueness evaluation requires the platform to group every filtered row by the selected fields, which is a distributed operation. Two practical implications:

  • Composite keys with many fields cost more to evaluate than single-field unique checks, because the grouping work scales with the size of each row's key. Most production composite keys use 2-4 fields; going much wider is rare and worth a second look at the data model.
  • Filtering before evaluation reduces the cost. A filter that narrows the dataset to a meaningful subset (today's records, a single tenant, active rows only) shrinks the grouping workload proportionally. When uniqueness only matters within a scope, set the filter to that scope explicitly.

There is no separate optimized path for single-field Unique versus composite Unique; the same evaluation flow applies. Performance differences come from how much data is grouped and how wide each key is.

  • Introduction: formal definition, modes overview, field scope, and general/anomaly properties.
  • Examples: three production scenarios with sample data and resulting anomalies.
  • API: payload shape and field notes for creating a Unique check programmatically.
  • FAQ: short answers to the most frequent questions.