Skip to content

Daily sync, profile, and scan

Goal

Run the standard data quality pipeline on a schedule: refresh the catalog (sync), refresh statistics and regenerate AI Managed checks (profile), then run all active checks against current data (scan). This is the recurring workflow most production users hit every day.

Permissions

Step Endpoint Role Team permission
Trigger any operation POST /api/operations/run Member Editor on the datastore's team
Poll operation status GET /api/operations/{id} Member Reporter
List anomalies after the scan GET /api/anomalies Member Reporter

Prerequisites

  • The datastore exists and has been synced at least once.
  • Your token has Editor permission on the datastore's owning team.
  • The datastore has at least one active check (otherwise scan runs without producing anomalies). See Bulk-create quality checks.

CLI workflow

graph LR
    Sync[1. sync] --> Profile[2. profile]
    Profile --> Scan[3. scan]
    Scan --> Review[4. List anomalies]

Foreground (good for ad-hoc runs and CI jobs you want to gate on success)

qualytics operations sync     --datastore-id 42
qualytics operations profile  --datastore-id 42
qualytics operations scan     --datastore-id 42
qualytics anomalies   list    --datastore-id 42 --status Active

Each operation prints progress while it runs and exits non-zero on failure.

Background (good when you don't need to wait or you want parallelism)

qualytics operations sync     --datastore-id 42 --background
qualytics operations profile  --datastore-id 42 --background
qualytics operations scan     --datastore-id 42 --background

Then check status whenever you want:

qualytics operations list --datastore-id 42 --status running
qualytics operations get  --id 9876

Across many datastores in one command

operations sync/profile/scan accept a comma-separated list of datastore IDs:

qualytics operations sync    --datastore-id 42,43,44
qualytics operations profile --datastore-id 42,43,44 --ai-effort medium
qualytics operations scan    --datastore-id 42,43,44

Behind the scenes

CLI step Method Path Notes
operations sync/profile/scan (start) POST /api/operations/run Body includes type (sync, profile, scan) and datastore_ids. Returns the operation ID.
Poll status GET /api/operations/{operation_id} The CLI polls every --poll-interval seconds (default 5) up to --timeout (default 1800).
List anomalies GET /api/anomalies Filterable by datastore, status, container, tag, dates.

Each operation runs asynchronously on the Qualytics side; the CLI is just polling its status.

Python equivalent

import os
import time
import httpx

BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN    = os.environ["QUALYTICS_TOKEN"]
HEADERS  = {"Authorization": f"Bearer {TOKEN}"}
DATASTORE_ID = 42

def run_op(client, op_type: str, timeout: int = 1800) -> dict:
    r = client.post(
        f"{BASE_URL}/api/operations/run",
        json={"type": op_type, "datastore_ids": [DATASTORE_ID]},
    )
    r.raise_for_status()
    op_id = r.json()["id"]

    deadline = time.monotonic() + timeout
    while time.monotonic() < deadline:
        op = client.get(f"{BASE_URL}/api/operations/{op_id}").json()
        if op["result"] in ("success", "failure", "aborted"):
            return op
        time.sleep(5)
    raise TimeoutError(f"{op_type} did not finish within {timeout}s")

with httpx.Client(headers=HEADERS, timeout=60.0) as client:
    for op_type in ("sync", "profile", "scan"):
        result = run_op(client, op_type)
        print(f"{op_type:>8}: {result['result']}")

    anomalies = client.get(
        f"{BASE_URL}/api/anomalies",
        params={"datastore_id": DATASTORE_ID, "status": "Active"},
    ).json()
    print(f"{len(anomalies)} active anomalies")

Variations and advanced usage

AI-assisted profiling

qualytics operations profile --datastore-id 42 --ai-effort high
qualytics operations profile --datastore-id 42 --ai-effort high --infer-as-draft

--infer-as-draft means new AI Managed checks land in Draft status, awaiting human review. Use it whenever a person should sign off on the generated rules before they start producing anomalies. See AI Managed Checks for the full conceptual model.

Auto-resolve fixed anomalies

When a previously failing check passes again, optionally close out its anomaly automatically:

qualytics operations scan --datastore-id 42 --auto-resolve-passed-anomalies

Long timeouts for huge datastores

The default operation timeout is 30 minutes. For warehouses with billions of rows:

qualytics operations scan --datastore-id 42 --timeout 7200

Or run with --background and don't tie up the shell at all.

Targeted runs

Don't profile or scan the whole datastore if only a few tables changed:

qualytics operations scan --datastore-id 42 --container-names "orders,customers"
qualytics operations scan --datastore-id 42 --container-tags "critical"

See Targeted scans by container or tag.

Crontab on Linux/macOS

# Every day at 3 AM, run the full pipeline on datastore 42
# Load credentials from a restricted file instead of inline:
#   echo 'export QUALYTICS_TOKEN=...' > /etc/qualytics-secrets && chmod 600 /etc/qualytics-secrets
0 3 * * * . /etc/qualytics-secrets && QUALYTICS_NO_BANNER=1 /usr/local/bin/qualytics operations sync --datastore-id 42 && /usr/local/bin/qualytics operations profile --datastore-id 42 && /usr/local/bin/qualytics operations scan --datastore-id 42 >> /var/log/qualytics.log 2>&1

For more sophisticated scheduling and exports, see Scheduled metadata exports.

Troubleshooting

Symptom Likely cause Fix
Operation times out at 30 minutes Default timeout reached Add --timeout 7200 or run with --background.
403 Forbidden on operations run Missing Editor team permission Confirm team membership; ask a team admin to grant Editor.
scan reports "no checks to run" The datastore has no active checks Confirm with qualytics checks list --datastore-id 42 --status Active; if empty, profile with --ai-effort high or create checks manually.
profile runs forever Default record limit is unset, table is huge Set --max-records-analyzed-per-partition; see Operations.
Anomalies count is huge after first scan First scan after profile inference can produce many anomalies Triage with Bulk anomaly triage.