Incident Response¶
Procedures for identifying, triaging, and resolving production incidents on the Sartiq platform.
Severity Levels¶
| Level | Name | Description | Response Time | Examples |
|---|---|---|---|---|
| S1 | Critical | Service fully down, data loss risk | Immediate - < 30 min | Database corruption, full outage |
| S2 | Major | Core feature broken, degraded for all users | < 2 hours | Generation pipeline stuck, auth broken |
| S3 | Minor | Non-critical feature broken, workaround exists | Next business day | Export failures, styling errors |
| S4 | Low | Cosmetic or low-impact issue | Current or next sprint | UI glitch, non-blocking warning |
Triage Checklist¶
When an incident is detected (alert, user report, or manual observation), follow this order:
1. Identify scope¶
- Which environment? (production, staging, dev)
- Which service? (backend, compute, webapp)
- How many users affected?
2. Check service health¶
# Backend
curl -f https://api.sartiq.com/api/v1/utils/health-check/
# Compute
curl -f https://compute-api.sartiq.com/health
# Check container status (SSH into VM first)
docker compose ps
docker compose logs --tail=50 backend
3. Check database¶
# Verify PostgreSQL is responding
docker compose exec db pg_isready -U postgres -d app
# Check active connections
docker compose exec db psql -U postgres -d app -c "SELECT count(*) FROM pg_stat_activity;"
# Check for long-running queries
docker compose exec db psql -U postgres -d app -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND query_start < now() - interval '1 minute'
ORDER BY duration DESC;
"
4. Check logs¶
- Better Stack: Check the relevant source for error spikes
- Discord: Check error/warning webhook channels
- Container logs:
docker compose logs --tail=100 --timestamps <service>
Communication¶
| Channel | Purpose |
|---|---|
| Discord #errors | Automated error notifications (S1/S2) |
| Discord #warnings | Automated warning notifications (S3) |
| Discord #ops | Operator signals, manual coordination |
For S1/S2 incidents, post a brief update in #ops immediately:
[INCIDENT] S1 - Backend DB unreachable on production. Investigating.
Recovery Runbooks¶
| Runbook | When to use |
|---|---|
| DB Recovery | Database corruption, accidental data loss, need to restore from snapshot |
Post-Incident Review¶
After resolving any S1 or S2 incident, document:
- Timeline -- When detected, when resolved, total downtime
- Root cause -- What failed and why
- Impact -- Users affected, data lost (if any)
- Resolution -- Steps taken to restore service
- Action items -- What to change to prevent recurrence
Related Documentation¶
- Backup -- How backups are configured
- Monitoring -- Alerting and observability
- Deployment -- How to deploy fixes
- Server Fleet -- SSH access to VMs