Skip to content

Incident Response

Procedures for identifying, triaging, and resolving production incidents on the Sartiq platform.


Severity Levels

Level Name Description Response Time Examples
S1 Critical Service fully down, data loss risk Immediate - < 30 min Database corruption, full outage
S2 Major Core feature broken, degraded for all users < 2 hours Generation pipeline stuck, auth broken
S3 Minor Non-critical feature broken, workaround exists Next business day Export failures, styling errors
S4 Low Cosmetic or low-impact issue Current or next sprint UI glitch, non-blocking warning

Triage Checklist

When an incident is detected (alert, user report, or manual observation), follow this order:

1. Identify scope

  • Which environment? (production, staging, dev)
  • Which service? (backend, compute, webapp)
  • How many users affected?

2. Check service health

# Backend
curl -f https://api.sartiq.com/api/v1/utils/health-check/

# Compute
curl -f https://compute-api.sartiq.com/health

# Check container status (SSH into VM first)
docker compose ps
docker compose logs --tail=50 backend

3. Check database

# Verify PostgreSQL is responding
docker compose exec db pg_isready -U postgres -d app

# Check active connections
docker compose exec db psql -U postgres -d app -c "SELECT count(*) FROM pg_stat_activity;"

# Check for long-running queries
docker compose exec db psql -U postgres -d app -c "
  SELECT pid, now() - pg_stat_activity.query_start AS duration, query
  FROM pg_stat_activity
  WHERE state != 'idle' AND query_start < now() - interval '1 minute'
  ORDER BY duration DESC;
"

4. Check logs

  • Better Stack: Check the relevant source for error spikes
  • Discord: Check error/warning webhook channels
  • Container logs: docker compose logs --tail=100 --timestamps <service>

Communication

Channel Purpose
Discord #errors Automated error notifications (S1/S2)
Discord #warnings Automated warning notifications (S3)
Discord #ops Operator signals, manual coordination

For S1/S2 incidents, post a brief update in #ops immediately:

[INCIDENT] S1 - Backend DB unreachable on production. Investigating.


Recovery Runbooks

Runbook When to use
DB Recovery Database corruption, accidental data loss, need to restore from snapshot

Post-Incident Review

After resolving any S1 or S2 incident, document:

  1. Timeline -- When detected, when resolved, total downtime
  2. Root cause -- What failed and why
  3. Impact -- Users affected, data lost (if any)
  4. Resolution -- Steps taken to restore service
  5. Action items -- What to change to prevent recurrence