Skip to content

Incremental scans for large tables

Goal

Scan only the rows that have changed since the last run, rather than re-scanning billions of rows every night. This is the only practical way to keep daily quality checks running on large warehouses.

Permissions

Step Endpoint Role Team permission
Run scan POST /api/operations/run Member Editor on the datastore's team
Check the container's incremental field config GET /api/containers/{id} Member Reporter

Prerequisites

  • The container has an incremental field configured. This is a column with monotonically increasing values (a timestamp like updated_at, or a batch number / sequence ID). Configure it from the web app under the container's settings, or via API.
  • The data is actually monotonic on that column. If rows can be updated in place after their initial timestamp, incremental scans will miss those updates.

No incremental field, no incremental scan

--incremental requires a configured incremental field. Without one the scan refuses to start with a clear error.

CLI workflow

graph LR
    Last[Last successful scan] --> Mark[Mark high-water value]
    Mark --> Scan[scan --incremental --greater-than-time]
    Scan --> Anom[New anomalies for new rows only]

By timestamp column

qualytics operations scan \
    --datastore-id 42 \
    --incremental \
    --greater-than-time "2026-05-07T00:00:00"

By batch / sequence number

qualytics operations scan \
    --datastore-id 42 \
    --incremental \
    --greater-than-batch 1000000

Wired into a daily script

Track the high-water mark in a file, advance it after each successful scan:

#!/usr/bin/env bash
set -euo pipefail

WATERMARK_FILE=~/.qualytics/orders-watermark
LAST=$(cat "$WATERMARK_FILE" 2>/dev/null || echo "1970-01-01T00:00:00")
NOW=$(date -u +%Y-%m-%dT%H:%M:%S)

qualytics operations scan \
    --datastore-id 42 \
    --container-names orders \
    --incremental \
    --greater-than-time "$LAST"

echo "$NOW" > "$WATERMARK_FILE"

Behind the scenes

CLI step Method Path Notes
Trigger incremental scan POST /api/operations/run Body includes type: scan, incremental: true, and either greater_than_time or greater_than_batch.
Poll status GET /api/operations/{id} Same as a full scan.

The Qualytics scanner pushes the threshold down to the source database as a WHERE clause, so the warehouse only returns the new rows.

Python equivalent

import os
import time
import httpx
from datetime import datetime, timezone

BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN    = os.environ["QUALYTICS_TOKEN"]
HEADERS  = {"Authorization": f"Bearer {TOKEN}"}

with httpx.Client(headers=HEADERS, timeout=60.0) as client:
    body = {
        "type": "scan",
        "datastore_ids": [42],
        "container_names": ["orders"],
        "incremental": True,
        "greater_than_time": "2026-05-07T00:00:00",
    }
    r = client.post(f"{BASE_URL}/api/operations/run", json=body)
    r.raise_for_status()
    op_id = r.json()["id"]

    while True:
        op = client.get(f"{BASE_URL}/api/operations/{op_id}").json()
        if op["result"] in ("success", "failure", "aborted"):
            print(f"incremental scan: {op['result']}")
            break
        time.sleep(10)

Variations and advanced usage

Pair with --auto-resolve-passed-anomalies

When you scan only new rows, old anomalies don't get re-evaluated and won't auto-close. Pair the flags for the closest thing to "always-current" state:

qualytics operations scan \
    --datastore-id 42 \
    --incremental \
    --greater-than-time "$LAST_RUN" \
    --auto-resolve-passed-anomalies

Multiple containers, different watermarks

Each container has its own incremental field configuration; the same --greater-than-time applies to all of them in one CLI call. If your tables drift apart, run separate calls per container.

Profile incrementally too

Profile supports the same flags. Useful for warehouses where new partitions arrive daily:

qualytics operations profile \
    --datastore-id 42 \
    --container-names orders \
    --greater-than-time "$LAST_RUN"

Troubleshooting

Symptom Likely cause Fix
Error: "incremental field not configured" The container doesn't have an incremental field set Configure one in the web app under the container's settings, or via API.
Scan is fast but anomaly count is suspicious Watermark drifted; you skipped some rows Run a full (non-incremental) scan to catch up, then resume incremental.
Updates to existing rows are not flagged The incremental field doesn't change on update Use a column that's updated on every row mutation (e.g., updated_at, not created_at).
Different containers need different watermarks The CLI sends one threshold per call Run a separate scan per container with its own watermark.