Alerting
Design effective alerts and respond to incidents.
Alert Design Principles
Good Alerts Are
- Actionable — Someone can do something about it
- Urgent — Needs attention now
- Unique — Not duplicated by other alerts
- Understandable — Clear what's wrong
Bad Alerts
- Fire frequently but are always ignored
- Require no action
- Are symptoms, not causes
- Wake people up for non-urgent issues
Alert Severity Levels
| Level |
Description |
Response Time |
Example |
| Critical |
Service down, data loss |
Immediate |
Database unreachable |
| High |
Major degradation |
< 15 min |
Error rate > 10% |
| Medium |
Noticeable impact |
< 1 hour |
Latency > 2x normal |
| Low |
Minor issue |
Next business day |
Disk 80% full |
Prometheus Alerting Rules
Basic Rules
# alerts.yml
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "High error rate ({{ $value | printf \"%.2f\" }}%)"
description: "Error rate is above 5% for the last 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: medium
annotations:
summary: "High latency (p95: {{ $value | printf \"%.2f\" }}s)"
description: "95th percentile latency is above 1 second"
- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "API service is down"
description: "The API service has been unreachable for 1 minute"
Infrastructure Alerts
groups:
- name: infrastructure
rules:
- alert: HighCPU
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 15m
labels:
severity: medium
annotations:
summary: "High CPU usage ({{ $value | printf \"%.1f\" }}%)"
- alert: HighMemory
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 15m
labels:
severity: high
annotations:
summary: "High memory usage ({{ $value | printf \"%.1f\" }}%)"
- alert: DiskSpaceLow
expr: |
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
for: 30m
labels:
severity: medium
annotations:
summary: "Disk space low ({{ $value | printf \"%.1f\" }}% used)"
Database Alerts
groups:
- name: database
rules:
- alert: DatabaseConnectionsHigh
expr: |
pg_stat_activity_count > pg_settings_max_connections * 0.8
for: 5m
labels:
severity: high
annotations:
summary: "Database connections at {{ $value | printf \"%.0f\" }}%"
- alert: SlowQueries
expr: |
rate(pg_stat_statements_seconds_total[5m]) / rate(pg_stat_statements_calls_total[5m]) > 0.1
for: 10m
labels:
severity: medium
annotations:
summary: "Slow database queries detected"
- alert: ReplicationLag
expr: pg_replication_lag > 30
for: 5m
labels:
severity: high
annotations:
summary: "Database replication lag ({{ $value | printf \"%.0f\" }}s)"
Alertmanager Configuration
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/...'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: high
receiver: 'slack-high'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
send_resolved: true
- name: 'slack-high'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- routing_key: '<pagerduty-key>'
severity: critical
SLO-Based Alerting
Define SLOs
# Service Level Objectives
- name: API Availability
target: 99.9%
indicator: http_requests_total{status!~"5.."}
- name: API Latency
target: 99% of requests < 500ms
indicator: http_request_duration_seconds
Burn Rate Alerts
# Alert when burning error budget too fast
- alert: HighBurnRate
expr: |
# Burn rate over last hour
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
)
) / (1 - 0.999) # SLO target
> 14.4 # 14.4x = will exhaust 30-day budget in 2 days
for: 5m
labels:
severity: critical
annotations:
summary: "High error budget burn rate"
description: "At current rate, will exhaust monthly error budget in 2 days"
Alert Response
Runbook Template
# Alert: HighErrorRate
## Summary
Error rate exceeds 5% of requests
## Impact
Users experiencing errors, potential revenue loss
## Investigation Steps
1. Check error logs: `kubectl logs -l app=api --tail=100`
2. Check recent deployments: `kubectl rollout history`
3. Check database: `pgcli -c "SELECT * FROM pg_stat_activity"`
4. Check external dependencies: [Status page links]
## Resolution Steps
1. If recent deployment, rollback: `kubectl rollout undo deployment/api`
2. If database issue, check connections and queries
3. If external service, enable circuit breaker
## Escalation
- L1: On-call engineer
- L2: Platform team (#platform-oncall)
- L3: Engineering manager
## Related
- Dashboard: [Grafana link]
- Logs: [Kibana link]
- Previous incidents: [Incident links]
On-Call Best Practices
- Handoff clearly — Document ongoing issues
- Acknowledge quickly — Even if still investigating
- Communicate broadly — Status page, Slack
- Document everything — What you tried, what worked
- Post-mortem — Learn from incidents
Silencing Alerts
Planned Maintenance
# Create silence in Alertmanager
- matchers:
- name: alertname
value: ServiceDown
- name: instance
value: api-1
startsAt: 2024-01-15T10:00:00Z
endsAt: 2024-01-15T12:00:00Z
createdBy: admin
comment: "Planned maintenance window"
Snoozing
# Silence via API
amtool silence add alertname=HighCPU --duration=1h --comment="Investigating"
Testing Alerts
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "critical"
},
"annotations": {
"summary": "This is a test alert"
}
}]'
Best Practices Summary
| Do |
Don't |
| Alert on symptoms |
Alert on causes |
| Set appropriate thresholds |
Use static thresholds blindly |
| Include runbooks |
Assume responder knows everything |
| Test alert routing |
Deploy without testing |
| Review and prune regularly |
Let stale alerts accumulate |
Use for to avoid flapping |
Alert on instant values |