Incident Response¶

Procedures for identifying, triaging, and resolving production incidents on the Sartiq platform.

Severity Levels¶

Level	Name	Description	Response Time	Examples
S1	Critical	Service fully down, data loss risk	Immediate - < 30 min	Database corruption, full outage
S2	Major	Core feature broken, degraded for all users	< 2 hours	Generation pipeline stuck, auth broken
S3	Minor	Non-critical feature broken, workaround exists	Next business day	Export failures, styling errors
S4	Low	Cosmetic or low-impact issue	Current or next sprint	UI glitch, non-blocking warning

Triage Checklist¶

When an incident is detected (alert, user report, or manual observation), follow this order:

1. Identify scope¶

Which environment? (production, staging, dev)
Which service? (backend, compute, webapp)
How many users affected?

2. Check service health¶

# Backend
curl -f https://api.sartiq.com/api/v1/utils/health-check/

# Compute
curl -f https://compute-api.sartiq.com/health

# Check container status (SSH into VM first)
docker compose ps
docker compose logs --tail=50 backend

3. Check database¶

# Verify PostgreSQL is responding
docker compose exec db pg_isready -U postgres -d app

# Check active connections
docker compose exec db psql -U postgres -d app -c "SELECT count(*) FROM pg_stat_activity;"

# Check for long-running queries
docker compose exec db psql -U postgres -d app -c "
  SELECT pid, now() - pg_stat_activity.query_start AS duration, query
  FROM pg_stat_activity
  WHERE state != 'idle' AND query_start < now() - interval '1 minute'
  ORDER BY duration DESC;
"

4. Check logs¶

Better Stack: Check the relevant source for error spikes
Discord: Check error/warning webhook channels
Container logs: docker compose logs --tail=100 --timestamps <service>

Communication¶

Channel	Purpose
Discord #errors	Automated error notifications (S1/S2)
Discord #warnings	Automated warning notifications (S3)
Discord #ops	Operator signals, manual coordination

For S1/S2 incidents, post a brief update in #ops immediately:

[INCIDENT] S1 - Backend DB unreachable on production. Investigating.

Recovery Runbooks¶

Runbook	When to use
DB Recovery	Database corruption, accidental data loss, need to restore from snapshot

Post-Incident Review¶

After resolving any S1 or S2 incident, document:

Timeline -- When detected, when resolved, total downtime
Root cause -- What failed and why
Impact -- Users affected, data lost (if any)
Resolution -- Steps taken to restore service
Action items -- What to change to prevent recurrence

Backup -- How backups are configured
Monitoring -- Alerting and observability
Deployment -- How to deploy fixes
Server Fleet -- SSH access to VMs