Daily sync, profile, and scan
Goal
Run the standard data quality pipeline on a schedule: refresh the catalog (sync), refresh statistics and regenerate AI Managed checks (profile), then run all active checks against current data (scan). This is the recurring workflow most production users hit every day.
Permissions
| Step | Endpoint | Role | Team permission |
|---|---|---|---|
| Trigger any operation | POST /api/operations/run |
Member |
Editor on the datastore's team |
| Poll operation status | GET /api/operations/{id} |
Member |
Reporter |
| List anomalies after the scan | GET /api/anomalies |
Member |
Reporter |
Prerequisites
- The datastore exists and has been synced at least once.
- Your token has
Editorpermission on the datastore's owning team. - The datastore has at least one active check (otherwise
scanruns without producing anomalies). See Bulk-create quality checks.
CLI workflow
graph LR
Sync[1. sync] --> Profile[2. profile]
Profile --> Scan[3. scan]
Scan --> Review[4. List anomalies]
Foreground (good for ad-hoc runs and CI jobs you want to gate on success)
qualytics operations sync --datastore-id 42
qualytics operations profile --datastore-id 42
qualytics operations scan --datastore-id 42
qualytics anomalies list --datastore-id 42 --status Active
Each operation prints progress while it runs and exits non-zero on failure.
Background (good when you don't need to wait or you want parallelism)
qualytics operations sync --datastore-id 42 --background
qualytics operations profile --datastore-id 42 --background
qualytics operations scan --datastore-id 42 --background
Then check status whenever you want:
Across many datastores in one command
operations sync/profile/scan accept a comma-separated list of datastore IDs:
qualytics operations sync --datastore-id 42,43,44
qualytics operations profile --datastore-id 42,43,44 --ai-effort medium
qualytics operations scan --datastore-id 42,43,44
Behind the scenes
| CLI step | Method | Path | Notes |
|---|---|---|---|
operations sync/profile/scan (start) |
POST | /api/operations/run |
Body includes type (sync, profile, scan) and datastore_ids. Returns the operation ID. |
| Poll status | GET | /api/operations/{operation_id} |
The CLI polls every --poll-interval seconds (default 5) up to --timeout (default 1800). |
| List anomalies | GET | /api/anomalies |
Filterable by datastore, status, container, tag, dates. |
Each operation runs asynchronously on the Qualytics side; the CLI is just polling its status.
Python equivalent
import os
import time
import httpx
BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN = os.environ["QUALYTICS_TOKEN"]
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
DATASTORE_ID = 42
def run_op(client, op_type: str, timeout: int = 1800) -> dict:
r = client.post(
f"{BASE_URL}/api/operations/run",
json={"type": op_type, "datastore_ids": [DATASTORE_ID]},
)
r.raise_for_status()
op_id = r.json()["id"]
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
op = client.get(f"{BASE_URL}/api/operations/{op_id}").json()
if op["result"] in ("success", "failure", "aborted"):
return op
time.sleep(5)
raise TimeoutError(f"{op_type} did not finish within {timeout}s")
with httpx.Client(headers=HEADERS, timeout=60.0) as client:
for op_type in ("sync", "profile", "scan"):
result = run_op(client, op_type)
print(f"{op_type:>8}: {result['result']}")
anomalies = client.get(
f"{BASE_URL}/api/anomalies",
params={"datastore_id": DATASTORE_ID, "status": "Active"},
).json()
print(f"{len(anomalies)} active anomalies")
Variations and advanced usage
AI-assisted profiling
qualytics operations profile --datastore-id 42 --ai-effort high
qualytics operations profile --datastore-id 42 --ai-effort high --infer-as-draft
--infer-as-draft means new AI Managed checks land in Draft status, awaiting human review. Use it whenever a person should sign off on the generated rules before they start producing anomalies. See AI Managed Checks for the full conceptual model.
Auto-resolve fixed anomalies
When a previously failing check passes again, optionally close out its anomaly automatically:
Long timeouts for huge datastores
The default operation timeout is 30 minutes. For warehouses with billions of rows:
Or run with --background and don't tie up the shell at all.
Targeted runs
Don't profile or scan the whole datastore if only a few tables changed:
qualytics operations scan --datastore-id 42 --container-names "orders,customers"
qualytics operations scan --datastore-id 42 --container-tags "critical"
See Targeted scans by container or tag.
Crontab on Linux/macOS
# Every day at 3 AM, run the full pipeline on datastore 42
# Load credentials from a restricted file instead of inline:
# echo 'export QUALYTICS_TOKEN=...' > /etc/qualytics-secrets && chmod 600 /etc/qualytics-secrets
0 3 * * * . /etc/qualytics-secrets && QUALYTICS_NO_BANNER=1 /usr/local/bin/qualytics operations sync --datastore-id 42 && /usr/local/bin/qualytics operations profile --datastore-id 42 && /usr/local/bin/qualytics operations scan --datastore-id 42 >> /var/log/qualytics.log 2>&1
For more sophisticated scheduling and exports, see Scheduled metadata exports.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Operation times out at 30 minutes | Default timeout reached | Add --timeout 7200 or run with --background. |
403 Forbidden on operations run |
Missing Editor team permission |
Confirm team membership; ask a team admin to grant Editor. |
scan reports "no checks to run" |
The datastore has no active checks | Confirm with qualytics checks list --datastore-id 42 --status Active; if empty, profile with --ai-effort high or create checks manually. |
profile runs forever |
Default record limit is unset, table is huge | Set --max-records-analyzed-per-partition; see Operations. |
| Anomalies count is huge after first scan | First scan after profile inference can produce many anomalies | Triage with Bulk anomaly triage. |
Related
- Operations command reference: every flag for
sync,profile,scan,materialize,export. - Targeted scans by container or tag: run on subsets, not the whole datastore.
- Incremental scans for large tables: scan only what changed since last run.
- Bulk anomaly triage: handling the output of scans.