Incremental scans for large tables
Goal
Scan only the rows that have changed since the last run, rather than re-scanning billions of rows every night. This is the only practical way to keep daily quality checks running on large warehouses.
Permissions
| Step | Endpoint | Role | Team permission |
|---|---|---|---|
| Run scan | POST /api/operations/run |
Member |
Editor on the datastore's team |
| Check the container's incremental field config | GET /api/containers/{id} |
Member |
Reporter |
Prerequisites
- The container has an incremental field configured. This is a column with monotonically increasing values (a timestamp like
updated_at, or a batch number / sequence ID). Configure it from the web app under the container's settings, or via API. - The data is actually monotonic on that column. If rows can be updated in place after their initial timestamp, incremental scans will miss those updates.
No incremental field, no incremental scan
--incremental requires a configured incremental field. Without one the scan refuses to start with a clear error.
CLI workflow
graph LR
Last[Last successful scan] --> Mark[Mark high-water value]
Mark --> Scan[scan --incremental --greater-than-time]
Scan --> Anom[New anomalies for new rows only]
By timestamp column
qualytics operations scan \
--datastore-id 42 \
--incremental \
--greater-than-time "2026-05-07T00:00:00"
By batch / sequence number
Wired into a daily script
Track the high-water mark in a file, advance it after each successful scan:
#!/usr/bin/env bash
set -euo pipefail
WATERMARK_FILE=~/.qualytics/orders-watermark
LAST=$(cat "$WATERMARK_FILE" 2>/dev/null || echo "1970-01-01T00:00:00")
NOW=$(date -u +%Y-%m-%dT%H:%M:%S)
qualytics operations scan \
--datastore-id 42 \
--container-names orders \
--incremental \
--greater-than-time "$LAST"
echo "$NOW" > "$WATERMARK_FILE"
Behind the scenes
| CLI step | Method | Path | Notes |
|---|---|---|---|
| Trigger incremental scan | POST | /api/operations/run |
Body includes type: scan, incremental: true, and either greater_than_time or greater_than_batch. |
| Poll status | GET | /api/operations/{id} |
Same as a full scan. |
The Qualytics scanner pushes the threshold down to the source database as a WHERE clause, so the warehouse only returns the new rows.
Python equivalent
import os
import time
import httpx
from datetime import datetime, timezone
BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN = os.environ["QUALYTICS_TOKEN"]
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
with httpx.Client(headers=HEADERS, timeout=60.0) as client:
body = {
"type": "scan",
"datastore_ids": [42],
"container_names": ["orders"],
"incremental": True,
"greater_than_time": "2026-05-07T00:00:00",
}
r = client.post(f"{BASE_URL}/api/operations/run", json=body)
r.raise_for_status()
op_id = r.json()["id"]
while True:
op = client.get(f"{BASE_URL}/api/operations/{op_id}").json()
if op["result"] in ("success", "failure", "aborted"):
print(f"incremental scan: {op['result']}")
break
time.sleep(10)
Variations and advanced usage
Pair with --auto-resolve-passed-anomalies
When you scan only new rows, old anomalies don't get re-evaluated and won't auto-close. Pair the flags for the closest thing to "always-current" state:
qualytics operations scan \
--datastore-id 42 \
--incremental \
--greater-than-time "$LAST_RUN" \
--auto-resolve-passed-anomalies
Multiple containers, different watermarks
Each container has its own incremental field configuration; the same --greater-than-time applies to all of them in one CLI call. If your tables drift apart, run separate calls per container.
Profile incrementally too
Profile supports the same flags. Useful for warehouses where new partitions arrive daily:
qualytics operations profile \
--datastore-id 42 \
--container-names orders \
--greater-than-time "$LAST_RUN"
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Error: "incremental field not configured" | The container doesn't have an incremental field set | Configure one in the web app under the container's settings, or via API. |
| Scan is fast but anomaly count is suspicious | Watermark drifted; you skipped some rows | Run a full (non-incremental) scan to catch up, then resume incremental. |
| Updates to existing rows are not flagged | The incremental field doesn't change on update | Use a column that's updated on every row mutation (e.g., updated_at, not created_at). |
| Different containers need different watermarks | The CLI sends one threshold per call | Run a separate scan per container with its own watermark. |
Related
- Operations command reference: every scan flag in detail.
- Daily sync, profile, and scan: the full pipeline.
- Targeted scans by container or tag: combine name/tag filtering with incremental.