Skip to content

How Expected Values Checks Work

This page covers everything the Expected Values check does, in detail: how it decides which rows are violations, how it handles NULLs and types, what the filter clause does, how coverage controls whether you get Record or Shape anomalies, and how the check behaves on Array fields.

If you only need a quick reference, the Introduction covers the formal definition, field scope, and general/anomaly properties.

How the Check Evaluates Values

Every Expected Values check follows the same evaluation flow:

  1. Apply the filter clause. If the check has a filter set, only the rows that match the filter expression continue to the next step. Rows outside the filter are ignored and cannot cause a violation.
  2. Skip NULL values. Rows where the selected field is NULL pass the check automatically and are never reported as violations.
  3. Compare each non-NULL value to the list. A row passes when the value is exactly equal to one of the list entries (same type, same case, same whitespace). A row fails when the value is non-NULL and not in the list.
  4. Decide Record or Shape anomaly based on coverage. At 100% coverage every failing row becomes a Record Anomaly; below 100% coverage, the violation is rolled up into a single Shape Anomaly when the failing fraction exceeds the configured threshold (see Coverage and Tolerance).

The order of operations matters: the filter is applied before the value comparison, so rows that the filter excludes cannot contribute to the violation count.

NULL Handling

NULL values pass the Expected Values check. The engine evaluates column IS NULL OR column IN (list), so a NULL is treated as compliant regardless of what is in the list. This is intentional: Expected Values asserts membership in a vocabulary, not presence.

If missing values are also a violation, pair Expected Values with a Not Null check on the same field. Together they enforce "this field must be present and must match the vocabulary."

Type Coercion and Matching Rules

The list is interpreted according to its contents and the column type:

  • All entries numeric → the list is treated as a numeric list. Integers and decimals are compared numerically. 100 matches 100.0.
  • All entries boolean (true, false) → the list is treated as a boolean list. The column must be of Boolean type.
  • Anything else (including mixed or empty) → the list is treated as strings and the column value is compared as a string.

A few practical consequences worth knowing:

  • Case-sensitive. "O" does not match "o". If you need case-insensitive matching, normalize the source data upstream or use a Satisfies Expression check with lower(field) IN (...).
  • Whitespace-sensitive. " O " does not match "O". The Authored Check modal warns when an entry has leading or trailing whitespace, but it does not trim it for you.
  • Type mismatches fail. If the column is integer and the list is ["O", "F", "P"] (strings), Spark cannot coerce "O" to an integer; every non-NULL row will fail the check. Keep the list's type aligned with the field's type.
  • Date and Timestamp. There is no native date list type; values are compared as strings against the column's rendered form. Prefer the ISO-8601 representation that matches how your engine serializes the column (YYYY-MM-DD for Date, YYYY-MM-DD HH:MM:SS for Timestamp).

Array Fields

When the check targets an Array[String] field, the platform automatically switches to element-wise evaluation:

  • The check evaluates array_forall(field, element -> element IN (list)).
  • Every element of the array must be in the list. A row fails as soon as one element is outside the vocabulary, even if the other elements pass.
  • An empty array passes (no elements to evaluate). A NULL array also passes (NULL handling rules above).

Element-wise evaluation is supported only for string element lists today. Array fields with numeric or boolean lists fall through to scalar comparison and are not currently recommended. See Complex Data Types for the broader picture of how rule types interact with arrays.

The Filter Clause

The filter clause is a Spark SQL WHERE expression applied before the value comparison. Use it to:

  1. Scope the check. Restrict the vocabulary requirement to a subset of the data (for example, region = 'NA' to enforce a North-American country code vocabulary only on orders from that region).
  2. Exclude legacy or pending rows. A filter such as created_at >= '2024-01-01' keeps the check from flagging values that were valid under an earlier vocabulary.

The filter is part of the check definition, so the anomaly message includes the filter expression ([filter: <expression>]) when one is set, making it explicit which slice of data was evaluated when the anomaly fired.

Coverage and Tolerance

Coverage is a fractional value between 0 and 1 that controls whether failing rows produce Record anomalies (one per row) or a single Shape anomaly (the dataset-level rollup):

  • 1.0 (100%, default): every non-NULL row must match the list. Each failing row produces a Record Anomaly.
  • < 1.0: the check tolerates a fraction of records failing the comparison. When the actual failing fraction exceeds the threshold, a single Shape Anomaly fires for the dataset. No Record Anomalies are produced in this mode.

Lower coverage values are useful when a small known fraction of out-of-vocabulary values is expected (a slow migration, a deprecation window, a legacy import). Use it with care: lowering coverage by 0.5% means a regression introducing failures in up to 0.5% of rows will look identical to the tolerated baseline and won't fire.

The Resulting Anomalies

Record Anomaly

Emitted at 100% coverage, one per failing row.

The field '<field>' has value '<value>', which is not in the list of expected values

Shape Anomaly

Emitted below 100% coverage, one per dataset, when the failing fraction exceeds the tolerance threshold.

For the field '<field>', X.XXX% of N records (K) have values not in the list of expected values
  • X.XXX%: the fraction of evaluated rows that failed the comparison.
  • N: the number of rows evaluated (after the filter, if any).
  • K: the number of rows whose value was non-NULL and not in the list.

When a filter is set, both messages are followed by [filter: <expression>].

Source Records and How Anomalies Render

When you open the source records for an Expected Values anomaly, the platform highlights only the offending cell for the field the check applies to. Other columns in the same row are rendered normally. This matches how the check semantics work: the violation is about a single field's value, not the entire row.

The Sample Data tables on the Examples page follow the same convention: one row per source record, with only the failing cell visually marked.

Relationship with Other Rule Types

The Expected Values check is most powerful when used together with a small set of complementary rule types:

Rule Type Why pair it with Expected Values
Not Null Expected Values lets NULLs pass. Pair with Not Null on the same field when missing values are also a violation.
Required Values Expected Values asserts that values come from a list (subset). Required Values asserts that the field contains at least the values in a list (superset). They are mirror images: use both when the column must contain every required value and nothing outside the vocabulary.
Matches Pattern Use Matches Pattern when the accepted values follow a regular structure rather than a finite list (for example, SKU-\d{4}). Expected Values is the right tool when the vocabulary is small and enumerable.
Satisfies Expression Use a custom Spark SQL expression when you need case-insensitive matching (lower(field) IN ('o', 'f', 'p')) or any other normalization Expected Values does not perform natively.

The Expected Values check can also be created automatically as an AI Managed check when a Profile operation detects that a column has a small, stable set of distinct values. See the AI Managed Checks Introduction for how AI Managed checks are generated and how editing one converts it to Authored.