Skip to content

Daily triage automation

Goal

Automate the parts of anomaly triage that are clearly mechanical: archiving stale Active anomalies, auto-acknowledging known-good patterns, and routing fresh anomalies to the right on-call person. The goal is to keep the queue small enough that human attention is reserved for the cases that need it.

Permissions

Step Endpoint Role Team permission
List anomalies GET /api/anomalies Member N/A
Bulk update PATCH /api/anomalies Member Editor
Bulk archive DELETE /api/anomalies?ids=... Member Author
Look up users (for assignment) GET /api/users Member N/A

Prerequisites

  • The CLI is installed and authenticated as a service account with the permissions above.
  • A scheduled runner (cron, systemd timer, or a CI scheduled workflow).
  • A clear policy for what's automatic vs. human (otherwise you'll automate away signal you needed).

Automate carefully

Automating archive of "stale" anomalies is convenient until it hides a real-but-unattended problem. Pair every auto-archive rule with an alert that triggers if archive volume spikes.

CLI workflow

graph LR
    Cron[Daily cron] --> Stale[Archive stale Active >30d]
    Cron --> Routine[Auto-acknowledge known patterns]
    Cron --> Fresh[Assign fresh ones to on-call]
    Cron --> Report[Email summary]

Auto-archive stale Active anomalies

#!/usr/bin/env bash
set -euo pipefail

THIRTY_DAYS_AGO=$(date -u -d '30 days ago' +%Y-%m-%d 2>/dev/null \
                || date -u -v-30d +%Y-%m-%d)  # macOS fallback

OLD_IDS=$(qualytics anomalies list \
    --datastore-id 42 \
    --status Active \
    --end-date "$THIRTY_DAYS_AGO" \
    --format json \
  | jq -r '.[].id' | paste -sd, -)

if [[ -n "$OLD_IDS" ]]; then
    qualytics anomalies archive --ids "$OLD_IDS" --status Discarded
    echo "Archived $(echo "$OLD_IDS" | tr ',' '\n' | wc -l) stale anomalies"
fi

Auto-acknowledge a known-good pattern

If a specific check generates anomalies you always investigate but never resolve until much later, auto-acknowledge them so they leave the unattended queue but stay visible:

ACK_IDS=$(qualytics anomalies list \
    --datastore-id 42 \
    --check-id 555 \
    --status Active \
    --format json | jq -r '.[].id' | paste -sd, -)

if [[ -n "$ACK_IDS" ]]; then
    qualytics anomalies update \
        --ids "$ACK_IDS" \
        --status Acknowledged \
        --description "Auto-acknowledged for check 555 (latency-sensitive; investigated weekly)" \
        --assignee-ids "$ONCALL_USER_ID" \
        --tags "auto-ack"
fi

Route fresh anomalies to the current on-call

TODAY=$(date -u +%Y-%m-%d)
ONCALL=$(./scripts/lookup-oncall-user-id.sh)  # however you resolve this

NEW_IDS=$(qualytics anomalies list \
    --datastore-id 42 \
    --status Active \
    --start-date "$TODAY" \
    --format json | jq -r '.[].id' | paste -sd, -)

if [[ -n "$NEW_IDS" ]]; then
    qualytics anomalies update \
        --ids "$NEW_IDS" \
        --status Active \
        --assignee-ids "$ONCALL"
fi

Cron entry

# Run daily at 6 AM
# Load credentials from a restricted file instead of inline:
#   echo 'export QUALYTICS_TOKEN=...' > /etc/qualytics-secrets && chmod 600 /etc/qualytics-secrets
0 6 * * * cd /opt/qualytics && . /etc/qualytics-secrets && QUALYTICS_NO_BANNER=1 ./triage.sh >> /var/log/qualytics-triage.log 2>&1

Behind the scenes

Same endpoints as Bulk anomaly triage, called on a schedule:

Action Method Path
List by date / status GET /api/anomalies?...
Bulk acknowledge / reassign PATCH /api/anomalies
Bulk archive DELETE /api/anomalies?ids=...&archive=true&status=...

Python equivalent

A more sophisticated triage script is much easier in Python than Bash. This one auto-archives stale anomalies and routes fresh ones to the on-call, with a final summary:

import os
import httpx
from datetime import datetime, timedelta, timezone

BASE_URL = os.environ["QUALYTICS_URL"].rstrip("/")
TOKEN    = os.environ["QUALYTICS_TOKEN"]
HEADERS  = {"Authorization": f"Bearer {TOKEN}"}
DATASTORE_ID = 42
ONCALL_USER_ID = 18

today        = datetime.now(timezone.utc).date()
stale_cutoff = (today - timedelta(days=30)).isoformat()
today_iso    = today.isoformat()

with httpx.Client(headers=HEADERS, timeout=30.0) as client:
    # Stale archive
    stale = client.get(f"{BASE_URL}/api/anomalies", params={
        "datastore_id": DATASTORE_ID, "status": "Active",
        "end_date":     stale_cutoff,
    }).json()
    stale_ids = [a["id"] for a in stale]
    if stale_ids:
        client.request(
            "DELETE", f"{BASE_URL}/api/anomalies",
            params={"ids": ",".join(map(str, stale_ids)),
                    "archive": "true", "status": "Discarded"},
        ).raise_for_status()

    # Route fresh
    fresh = client.get(f"{BASE_URL}/api/anomalies", params={
        "datastore_id": DATASTORE_ID, "status": "Active",
        "start_date":   today_iso,
    }).json()
    fresh_ids = [a["id"] for a in fresh]
    if fresh_ids:
        client.patch(f"{BASE_URL}/api/anomalies", json={
            "ids": fresh_ids, "status": "Active",
            "assignee_ids": [ONCALL_USER_ID],
        }).raise_for_status()

    print(f"archived {len(stale_ids)} stale, assigned {len(fresh_ids)} fresh to user {ONCALL_USER_ID}")

Variations and advanced usage

Slack the daily summary

Pipe a humanized summary into a webhook:

SUMMARY=$(qualytics anomalies list --datastore-id 42 --status Active --format json \
          | jq -r 'group_by(.container.name) | map("\(.[0].container.name): \(length)") | join("\n")')

curl -s -X POST -H 'Content-Type: application/json' \
     -d "{\"text\": \"Active anomalies by container:\n$SUMMARY\"}" \
     "$SLACK_WEBHOOK"

Open Jira tickets for fresh high-severity anomalies

If you've configured Jira via the Ticketing integration, tagged anomalies can flow into Jira automatically through Flows. The CLI's job is then just routing and tagging.

Per-team triage rules

Run the same script with different filters per team, each with its own assignees and on-call user. Use --container-tags to scope.

Troubleshooting

Symptom Likely cause Fix
Cron job runs but produces empty output The CLI banner is being captured before the script logic Set QUALYTICS_NO_BANNER=1 and CI=true in the cron env.
Daily summary always shows zeros Wrong timezone in date (cron uses system TZ) Anchor to UTC explicitly: date -u +%Y-%m-%d.
Auto-archive surge after a deploy Real anomalies got swept up Tighten the filter: only archive ones with --check-id in a known-noisy list, not all stale.
Bash script fails on macOS but works on Linux date -d '30 days ago' is GNU-specific Use date -u -v-30d +%Y-%m-%d on macOS, or run in a Linux container.