Skip to content

Onboard a single datastore

Goal

You have a database (Postgres, Snowflake, BigQuery, etc.) you want Qualytics to start monitoring. This page walks through the full sequence: create a connection, create a datastore, run the first sync, profile, and scan, and review the first set of anomalies.

Permissions

Step Endpoint Role Team permission
Create connection POST /api/connections Manager N/A (system-wide)
Test connection POST /api/connections/{id}/test Manager N/A
Create datastore POST /api/datastores Manager N/A
Run sync / profile / scan POST /api/operations/run Member Editor on the datastore's team
List anomalies GET /api/anomalies Member Reporter

You need Manager to create connections and datastores

A plain Member cannot onboard a new source. Either escalate temporarily or have a Manager run the first three steps and a Member run the operations afterwards.

Prerequisites

  • The CLI is installed and authenticated (see Installing Qualytics CLI).
  • Network connectivity from the Qualytics instance to the source database (test from a VM in the same network if unsure).
  • Database credentials with at least SELECT on the schemas you plan to monitor. See Available Connectors for per-database minimum permissions.

CLI workflow

graph LR
    C[1. Create connection] --> T[2. Test connection]
    T --> D[3. Create datastore]
    D --> S[4. Sync]
    S --> P[5. Profile]
    P --> SC[6. Scan]
    SC --> A[7. Review anomalies]

1. Create the connection

Put credentials in environment variables so they aren't recorded in shell history:

export DB_HOST=warehouse.example.com
export DB_USER=qualytics_reader
export DB_PASSWORD='S3cur3p@ss'

qualytics connections create \
    --type postgresql \
    --name "warehouse-prod-db" \
    --host '${DB_HOST}' \
    --port 5432 \
    --username '${DB_USER}' \
    --password '${DB_PASSWORD}'

The single quotes around '${DB_HOST}' are intentional. They prevent the local shell from expanding the variable; the CLI does the expansion at runtime. Secrets never land on disk in plaintext.

2. Test the connection

qualytics connections test --id 17

A successful test returns the database version and reachable schemas.

3. Create the datastore

qualytics datastores create \
    --name "warehouse-prod" \
    --connection-name "warehouse-prod-db" \
    --database "analytics" \
    --schema "public" \
    --tags "production,warehouse"

4. Sync the catalog

qualytics operations sync --datastore-id 42

sync populates the list of tables and schemas Qualytics knows about. It must succeed before profile or scan will return useful results.

5. Profile the data

qualytics operations profile --datastore-id 42 --ai-effort medium

Profile inspects each container, infers per-field statistics, and (when --ai-effort is on) suggests quality checks based on observed patterns.

6. Run the first scan

qualytics operations scan --datastore-id 42

scan runs every active check and creates an anomaly for each violation.

7. Review the anomalies

qualytics anomalies list --datastore-id 42 --status Active

Behind the scenes

CLI step Method Path
connections create POST /api/connections
connections test POST /api/connections/{connection_id}/test
datastores create POST /api/datastores
operations sync/profile/scan POST /api/operations/run (with type: sync \| profile \| scan in the body)
operations get (polling) GET /api/operations/{operation_id}
anomalies list GET /api/anomalies

Python equivalent

import os
import time
import httpx

BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN    = os.environ["QUALYTICS_TOKEN"]
HEADERS  = {"Authorization": f"Bearer {TOKEN}"}

def run_operation(client, datastore_id: int, op_type: str) -> dict:
    """Trigger an operation and poll until it completes."""
    r = client.post(
        f"{BASE_URL}/api/operations/run",
        json={"type": op_type, "datastore_id": datastore_id},
    )
    r.raise_for_status()
    op_id = r.json()["id"]

    while True:
        r = client.get(f"{BASE_URL}/api/operations/{op_id}")
        r.raise_for_status()
        op = r.json()
        if op["result"] in ("success", "failure", "aborted"):
            return op
        time.sleep(5)

with httpx.Client(headers=HEADERS, timeout=60.0) as client:
    # 1. Connection
    r = client.post(f"{BASE_URL}/api/connections", json={
        "type": "postgresql",
        "name": "warehouse-prod-db",
        "host": os.environ["DB_HOST"],
        "port": 5432,
        "username": os.environ["DB_USER"],
        "password": os.environ["DB_PASSWORD"],
    })
    r.raise_for_status()
    connection_id = r.json()["id"]

    # 2. Test
    client.post(f"{BASE_URL}/api/connections/{connection_id}/test").raise_for_status()

    # 3. Datastore
    r = client.post(f"{BASE_URL}/api/datastores", json={
        "name": "warehouse-prod",
        "connection_id": connection_id,
        "database": "analytics",
        "schema": "public",
        "tags": ["production", "warehouse"],
    })
    r.raise_for_status()
    datastore_id = r.json()["id"]

    # 4-6. Operations pipeline
    for op_type in ("sync", "profile", "scan"):
        op = run_operation(client, datastore_id, op_type)
        print(f"{op_type:>8}: {op['result']}")

Variations and advanced usage

DFS (S3 / GCS / Azure) instead of JDBC

The pattern is identical; only the connection type and fields change:

qualytics connections create \
    --type amazon-s3 \
    --name "data-lake-prod" \
    --uri 's3://my-bucket/data/' \
    --access-key '${AWS_ACCESS_KEY}' \
    --secret-key '${AWS_SECRET_KEY}'

IAM role authentication

For S3, Athena, and Redshift, prefer an IAM role over static credentials:

qualytics connections create \
    --type amazon-s3 \
    --name "data-lake-prod" \
    --uri 's3://my-bucket/data/' \
    --authentication-type IAM_ROLE \
    --role-arn arn:aws:iam::123456789012:role/QualyticsReader \
    --external-id qualytics-prod-external-id

See IAM Role Authentication.

Dry-run before creating

qualytics connections create --type postgresql ... --dry-run
qualytics datastores create --name ... --connection-name ... --dry-run

--dry-run prints the exact API payload without making the call. Useful for review or to copy into Python.

Troubleshooting

Symptom Likely cause Fix
connections test fails with Connection refused Network path missing (no VPN, security group, or peering) Test connectivity from the same network the Qualytics instance runs in.
connections test fails with authentication failed Wrong username/password, or missing privilege Verify with a SQL client outside Qualytics first.
sync succeeds but no containers appear Schema name typo, or no tables in the schema Confirm with qualytics datastores get --id 42. The schema is part of the datastore record.
profile runs forever on huge tables Default per-partition record limit is too low or unset Set --max-records-analyzed-per-partition; see Operations.
scan returns "no checks to run" Profile produced zero AI Managed checks (turn AI Effort up, or write checks manually) qualytics operations profile --ai-effort high, or Bulk-create quality checks.