Daily triage automation
Goal
Automate the parts of anomaly triage that are clearly mechanical: archiving stale Active anomalies, auto-acknowledging known-good patterns, and routing fresh anomalies to the right on-call person. The goal is to keep the queue small enough that human attention is reserved for the cases that need it.
Permissions
| Step | Endpoint | Role | Team permission |
|---|---|---|---|
| List anomalies | GET /api/anomalies |
Member |
N/A |
| Bulk update | PATCH /api/anomalies |
Member |
Editor |
| Bulk archive | DELETE /api/anomalies?ids=... |
Member |
Author |
| Look up users (for assignment) | GET /api/users |
Member |
N/A |
Prerequisites
- The CLI is installed and authenticated as a service account with the permissions above.
- A scheduled runner (cron, systemd timer, or a CI scheduled workflow).
- A clear policy for what's automatic vs. human (otherwise you'll automate away signal you needed).
Automate carefully
Automating archive of "stale" anomalies is convenient until it hides a real-but-unattended problem. Pair every auto-archive rule with an alert that triggers if archive volume spikes.
CLI workflow
graph LR
Cron[Daily cron] --> Stale[Archive stale Active >30d]
Cron --> Routine[Auto-acknowledge known patterns]
Cron --> Fresh[Assign fresh ones to on-call]
Cron --> Report[Email summary]
Auto-archive stale Active anomalies
#!/usr/bin/env bash
set -euo pipefail
THIRTY_DAYS_AGO=$(date -u -d '30 days ago' +%Y-%m-%d 2>/dev/null \
|| date -u -v-30d +%Y-%m-%d) # macOS fallback
OLD_IDS=$(qualytics anomalies list \
--datastore-id 42 \
--status Active \
--end-date "$THIRTY_DAYS_AGO" \
--format json \
| jq -r '.[].id' | paste -sd, -)
if [[ -n "$OLD_IDS" ]]; then
qualytics anomalies archive --ids "$OLD_IDS" --status Discarded
echo "Archived $(echo "$OLD_IDS" | tr ',' '\n' | wc -l) stale anomalies"
fi
Auto-acknowledge a known-good pattern
If a specific check generates anomalies you always investigate but never resolve until much later, auto-acknowledge them so they leave the unattended queue but stay visible:
ACK_IDS=$(qualytics anomalies list \
--datastore-id 42 \
--check-id 555 \
--status Active \
--format json | jq -r '.[].id' | paste -sd, -)
if [[ -n "$ACK_IDS" ]]; then
qualytics anomalies update \
--ids "$ACK_IDS" \
--status Acknowledged \
--description "Auto-acknowledged for check 555 (latency-sensitive; investigated weekly)" \
--assignee-ids "$ONCALL_USER_ID" \
--tags "auto-ack"
fi
Route fresh anomalies to the current on-call
TODAY=$(date -u +%Y-%m-%d)
ONCALL=$(./scripts/lookup-oncall-user-id.sh) # however you resolve this
NEW_IDS=$(qualytics anomalies list \
--datastore-id 42 \
--status Active \
--start-date "$TODAY" \
--format json | jq -r '.[].id' | paste -sd, -)
if [[ -n "$NEW_IDS" ]]; then
qualytics anomalies update \
--ids "$NEW_IDS" \
--status Active \
--assignee-ids "$ONCALL"
fi
Cron entry
# Run daily at 6 AM
# Load credentials from a restricted file instead of inline:
# echo 'export QUALYTICS_TOKEN=...' > /etc/qualytics-secrets && chmod 600 /etc/qualytics-secrets
0 6 * * * cd /opt/qualytics && . /etc/qualytics-secrets && QUALYTICS_NO_BANNER=1 ./triage.sh >> /var/log/qualytics-triage.log 2>&1
Behind the scenes
Same endpoints as Bulk anomaly triage, called on a schedule:
| Action | Method | Path |
|---|---|---|
| List by date / status | GET | /api/anomalies?... |
| Bulk acknowledge / reassign | PATCH | /api/anomalies |
| Bulk archive | DELETE | /api/anomalies?ids=...&archive=true&status=... |
Python equivalent
A more sophisticated triage script is much easier in Python than Bash. This one auto-archives stale anomalies and routes fresh ones to the on-call, with a final summary:
import os
import httpx
from datetime import datetime, timedelta, timezone
BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN = os.environ["QUALYTICS_TOKEN"]
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
DATASTORE_ID = 42
ONCALL_USER_ID = 18
today = datetime.now(timezone.utc).date()
stale_cutoff = (today - timedelta(days=30)).isoformat()
today_iso = today.isoformat()
with httpx.Client(headers=HEADERS, timeout=30.0) as client:
# Stale archive
stale = client.get(f"{BASE_URL}/api/anomalies", params={
"datastore_id": DATASTORE_ID, "status": "Active",
"end_date": stale_cutoff,
}).json()
stale_ids = [a["id"] for a in stale]
if stale_ids:
client.request(
"DELETE", f"{BASE_URL}/api/anomalies",
params={"ids": ",".join(map(str, stale_ids)),
"archive": "true", "status": "Discarded"},
).raise_for_status()
# Route fresh
fresh = client.get(f"{BASE_URL}/api/anomalies", params={
"datastore_id": DATASTORE_ID, "status": "Active",
"start_date": today_iso,
}).json()
fresh_ids = [a["id"] for a in fresh]
if fresh_ids:
client.patch(f"{BASE_URL}/api/anomalies", json={
"ids": fresh_ids, "status": "Active",
"assignee_ids": [ONCALL_USER_ID],
}).raise_for_status()
print(f"archived {len(stale_ids)} stale, assigned {len(fresh_ids)} fresh to user {ONCALL_USER_ID}")
Variations and advanced usage
Slack the daily summary
Pipe a humanized summary into a webhook:
SUMMARY=$(qualytics anomalies list --datastore-id 42 --status Active --format json \
| jq -r 'group_by(.container.name) | map("\(.[0].container.name): \(length)") | join("\n")')
curl -s -X POST -H 'Content-Type: application/json' \
-d "{\"text\": \"Active anomalies by container:\n$SUMMARY\"}" \
"$SLACK_WEBHOOK"
Open Jira tickets for fresh high-severity anomalies
If you've configured Jira via the Ticketing integration, tagged anomalies can flow into Jira automatically through Flows. The CLI's job is then just routing and tagging.
Per-team triage rules
Run the same script with different filters per team, each with its own assignees and on-call user. Use --container-tags to scope.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Cron job runs but produces empty output | The CLI banner is being captured before the script logic | Set QUALYTICS_NO_BANNER=1 and CI=true in the cron env. |
| Daily summary always shows zeros | Wrong timezone in date (cron uses system TZ) |
Anchor to UTC explicitly: date -u +%Y-%m-%d. |
| Auto-archive surge after a deploy | Real anomalies got swept up | Tighten the filter: only archive ones with --check-id in a known-noisy list, not all stale. |
| Bash script fails on macOS but works on Linux | date -d '30 days ago' is GNU-specific |
Use date -u -v-30d +%Y-%m-%d on macOS, or run in a Linux container. |
Related
- Bulk anomaly triage: the manual version of this scenario.
- Anomalies command reference
- Scheduled metadata exports: a different kind of automation around anomalies (snapshots for audit/DR).