Alerting¶

Design effective alerts and respond to incidents.

Alert Design Principles¶

Good Alerts Are¶

Actionable — Someone can do something about it
Urgent — Needs attention now
Unique — Not duplicated by other alerts
Understandable — Clear what's wrong

Bad Alerts¶

Fire frequently but are always ignored
Require no action
Are symptoms, not causes
Wake people up for non-urgent issues

Alert Severity Levels¶

Level	Description	Response Time	Example
Critical	Service down, data loss	Immediate	Database unreachable
High	Major degradation	< 15 min	Error rate > 10%
Medium	Noticeable impact	< 1 hour	Latency > 2x normal
Low	Minor issue	Next business day	Disk 80% full

Prometheus Alerting Rules¶

Basic Rules¶

# alerts.yml
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "High error rate ({{ $value | printf \"%.2f\" }}%)"
          description: "Error rate is above 5% for the last 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: medium
        annotations:
          summary: "High latency (p95: {{ $value | printf \"%.2f\" }}s)"
          description: "95th percentile latency is above 1 second"

      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API service is down"
          description: "The API service has been unreachable for 1 minute"

Infrastructure Alerts¶

groups:
  - name: infrastructure
    rules:
      - alert: HighCPU
        expr: |
          100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 15m
        labels:
          severity: medium
        annotations:
          summary: "High CPU usage ({{ $value | printf \"%.1f\" }}%)"

      - alert: HighMemory
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 15m
        labels:
          severity: high
        annotations:
          summary: "High memory usage ({{ $value | printf \"%.1f\" }}%)"

      - alert: DiskSpaceLow
        expr: |
          (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
        for: 30m
        labels:
          severity: medium
        annotations:
          summary: "Disk space low ({{ $value | printf \"%.1f\" }}% used)"

Database Alerts¶

groups:
  - name: database
    rules:
      - alert: DatabaseConnectionsHigh
        expr: |
          pg_stat_activity_count > pg_settings_max_connections * 0.8
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Database connections at {{ $value | printf \"%.0f\" }}%"

      - alert: SlowQueries
        expr: |
          rate(pg_stat_statements_seconds_total[5m]) / rate(pg_stat_statements_calls_total[5m]) > 0.1
        for: 10m
        labels:
          severity: medium
        annotations:
          summary: "Slow database queries detected"

      - alert: ReplicationLag
        expr: pg_replication_lag > 30
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Database replication lag ({{ $value | printf \"%.0f\" }}s)"

Alertmanager Configuration¶

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: critical
      receiver: 'slack-critical'

    - match:
        severity: high
      receiver: 'slack-high'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true

  - name: 'slack-high'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: '<pagerduty-key>'
        severity: critical

SLO-Based Alerting¶

Define SLOs¶

# Service Level Objectives
- name: API Availability
  target: 99.9%
  indicator: http_requests_total{status!~"5.."}

- name: API Latency
  target: 99% of requests < 500ms
  indicator: http_request_duration_seconds

Burn Rate Alerts¶

# Alert when burning error budget too fast
- alert: HighBurnRate
  expr: |
    # Burn rate over last hour
    (
      1 - (
        sum(rate(http_requests_total{status!~"5.."}[1h]))
        / sum(rate(http_requests_total[1h]))
      )
    ) / (1 - 0.999)  # SLO target
    > 14.4  # 14.4x = will exhaust 30-day budget in 2 days
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error budget burn rate"
    description: "At current rate, will exhaust monthly error budget in 2 days"

Alert Response¶

Runbook Template¶

# Alert: HighErrorRate

## Summary
Error rate exceeds 5% of requests

## Impact
Users experiencing errors, potential revenue loss

## Investigation Steps
1. Check error logs: `kubectl logs -l app=api --tail=100`
2. Check recent deployments: `kubectl rollout history`
3. Check database: `pgcli -c "SELECT * FROM pg_stat_activity"`
4. Check external dependencies: [Status page links]

## Resolution Steps
1. If recent deployment, rollback: `kubectl rollout undo deployment/api`
2. If database issue, check connections and queries
3. If external service, enable circuit breaker

## Escalation
- L1: On-call engineer
- L2: Platform team (#platform-oncall)
- L3: Engineering manager

## Related
- Dashboard: [Grafana link]
- Logs: [Kibana link]
- Previous incidents: [Incident links]

On-Call Best Practices¶

Handoff clearly — Document ongoing issues
Acknowledge quickly — Even if still investigating
Communicate broadly — Status page, Slack
Document everything — What you tried, what worked
Post-mortem — Learn from incidents

Silencing Alerts¶

Planned Maintenance¶

# Create silence in Alertmanager
- matchers:
    - name: alertname
      value: ServiceDown
    - name: instance
      value: api-1
  startsAt: 2024-01-15T10:00:00Z
  endsAt: 2024-01-15T12:00:00Z
  createdBy: admin
  comment: "Planned maintenance window"

Snoozing¶

# Silence via API
amtool silence add alertname=HighCPU --duration=1h --comment="Investigating"

Testing Alerts¶

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "critical"
    },
    "annotations": {
      "summary": "This is a test alert"
    }
  }]'

Best Practices Summary¶

Do	Don't
Alert on symptoms	Alert on causes
Set appropriate thresholds	Use static thresholds blindly
Include runbooks	Assume responder knows everything
Test alert routing	Deploy without testing
Review and prune regularly	Let stale alerts accumulate
Use `for` to avoid flapping	Alert on instant values