Onboard a single datastore
Goal
You have a database (Postgres, Snowflake, BigQuery, etc.) you want Qualytics to start monitoring. This page walks through the full sequence: create a connection, create a datastore, run the first sync, profile, and scan, and review the first set of anomalies.
Permissions
| Step | Endpoint | Role | Team permission |
|---|---|---|---|
| Create connection | POST /api/connections |
Manager |
N/A (system-wide) |
| Test connection | POST /api/connections/{id}/test |
Manager |
N/A |
| Create datastore | POST /api/datastores |
Manager |
N/A |
| Run sync / profile / scan | POST /api/operations/run |
Member |
Editor on the datastore's team |
| List anomalies | GET /api/anomalies |
Member |
Reporter |
You need Manager to create connections and datastores
A plain Member cannot onboard a new source. Either escalate temporarily or have a Manager run the first three steps and a Member run the operations afterwards.
Prerequisites
- The CLI is installed and authenticated (see Installing Qualytics CLI).
- Network connectivity from the Qualytics instance to the source database (test from a VM in the same network if unsure).
- Database credentials with at least
SELECTon the schemas you plan to monitor. See Available Connectors for per-database minimum permissions.
CLI workflow
graph LR
C[1. Create connection] --> T[2. Test connection]
T --> D[3. Create datastore]
D --> S[4. Sync]
S --> P[5. Profile]
P --> SC[6. Scan]
SC --> A[7. Review anomalies]
1. Create the connection
Put credentials in environment variables so they aren't recorded in shell history:
export DB_HOST=warehouse.example.com
export DB_USER=qualytics_reader
export DB_PASSWORD='S3cur3p@ss'
qualytics connections create \
--type postgresql \
--name "warehouse-prod-db" \
--host '${DB_HOST}' \
--port 5432 \
--username '${DB_USER}' \
--password '${DB_PASSWORD}'
The single quotes around '${DB_HOST}' are intentional. They prevent the local shell from expanding the variable; the CLI does the expansion at runtime. Secrets never land on disk in plaintext.
2. Test the connection
A successful test returns the database version and reachable schemas.
3. Create the datastore
qualytics datastores create \
--name "warehouse-prod" \
--connection-name "warehouse-prod-db" \
--database "analytics" \
--schema "public" \
--tags "production,warehouse"
4. Sync the catalog
sync populates the list of tables and schemas Qualytics knows about. It must succeed before profile or scan will return useful results.
5. Profile the data
Profile inspects each container, infers per-field statistics, and (when --ai-effort is on) suggests quality checks based on observed patterns.
6. Run the first scan
scan runs every active check and creates an anomaly for each violation.
7. Review the anomalies
Behind the scenes
| CLI step | Method | Path |
|---|---|---|
connections create |
POST | /api/connections |
connections test |
POST | /api/connections/{connection_id}/test |
datastores create |
POST | /api/datastores |
operations sync/profile/scan |
POST | /api/operations/run (with type: sync \| profile \| scan in the body) |
operations get (polling) |
GET | /api/operations/{operation_id} |
anomalies list |
GET | /api/anomalies |
Python equivalent
import os
import time
import httpx
BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN = os.environ["QUALYTICS_TOKEN"]
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
def run_operation(client, datastore_id: int, op_type: str) -> dict:
"""Trigger an operation and poll until it completes."""
r = client.post(
f"{BASE_URL}/api/operations/run",
json={"type": op_type, "datastore_id": datastore_id},
)
r.raise_for_status()
op_id = r.json()["id"]
while True:
r = client.get(f"{BASE_URL}/api/operations/{op_id}")
r.raise_for_status()
op = r.json()
if op["result"] in ("success", "failure", "aborted"):
return op
time.sleep(5)
with httpx.Client(headers=HEADERS, timeout=60.0) as client:
# 1. Connection
r = client.post(f"{BASE_URL}/api/connections", json={
"type": "postgresql",
"name": "warehouse-prod-db",
"host": os.environ["DB_HOST"],
"port": 5432,
"username": os.environ["DB_USER"],
"password": os.environ["DB_PASSWORD"],
})
r.raise_for_status()
connection_id = r.json()["id"]
# 2. Test
client.post(f"{BASE_URL}/api/connections/{connection_id}/test").raise_for_status()
# 3. Datastore
r = client.post(f"{BASE_URL}/api/datastores", json={
"name": "warehouse-prod",
"connection_id": connection_id,
"database": "analytics",
"schema": "public",
"tags": ["production", "warehouse"],
})
r.raise_for_status()
datastore_id = r.json()["id"]
# 4-6. Operations pipeline
for op_type in ("sync", "profile", "scan"):
op = run_operation(client, datastore_id, op_type)
print(f"{op_type:>8}: {op['result']}")
Variations and advanced usage
DFS (S3 / GCS / Azure) instead of JDBC
The pattern is identical; only the connection type and fields change:
qualytics connections create \
--type amazon-s3 \
--name "data-lake-prod" \
--uri 's3://my-bucket/data/' \
--access-key '${AWS_ACCESS_KEY}' \
--secret-key '${AWS_SECRET_KEY}'
IAM role authentication
For S3, Athena, and Redshift, prefer an IAM role over static credentials:
qualytics connections create \
--type amazon-s3 \
--name "data-lake-prod" \
--uri 's3://my-bucket/data/' \
--authentication-type IAM_ROLE \
--role-arn arn:aws:iam::123456789012:role/QualyticsReader \
--external-id qualytics-prod-external-id
Dry-run before creating
qualytics connections create --type postgresql ... --dry-run
qualytics datastores create --name ... --connection-name ... --dry-run
--dry-run prints the exact API payload without making the call. Useful for review or to copy into Python.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
connections test fails with Connection refused |
Network path missing (no VPN, security group, or peering) | Test connectivity from the same network the Qualytics instance runs in. |
connections test fails with authentication failed |
Wrong username/password, or missing privilege | Verify with a SQL client outside Qualytics first. |
sync succeeds but no containers appear |
Schema name typo, or no tables in the schema | Confirm with qualytics datastores get --id 42. The schema is part of the datastore record. |
profile runs forever on huge tables |
Default per-partition record limit is too low or unset | Set --max-records-analyzed-per-partition; see Operations. |
scan returns "no checks to run" |
Profile produced zero AI Managed checks (turn AI Effort up, or write checks manually) | qualytics operations profile --ai-effort high, or Bulk-create quality checks. |
Related
- Connections command reference
- Datastores command reference
- Operations command reference
- Bulk datastore onboarding: when you have many to onboard at once.
- Daily sync, profile, and scan: the recurring follow-up to this one-time setup.