Skip to content

Alerting

Design effective alerts and respond to incidents.

Alert Design Principles

Good Alerts Are

  1. Actionable — Someone can do something about it
  2. Urgent — Needs attention now
  3. Unique — Not duplicated by other alerts
  4. Understandable — Clear what's wrong

Bad Alerts

  • Fire frequently but are always ignored
  • Require no action
  • Are symptoms, not causes
  • Wake people up for non-urgent issues

Alert Severity Levels

Level Description Response Time Example
Critical Service down, data loss Immediate Database unreachable
High Major degradation < 15 min Error rate > 10%
Medium Noticeable impact < 1 hour Latency > 2x normal
Low Minor issue Next business day Disk 80% full

Prometheus Alerting Rules

Basic Rules

# alerts.yml
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "High error rate ({{ $value | printf \"%.2f\" }}%)"
          description: "Error rate is above 5% for the last 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: medium
        annotations:
          summary: "High latency (p95: {{ $value | printf \"%.2f\" }}s)"
          description: "95th percentile latency is above 1 second"

      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API service is down"
          description: "The API service has been unreachable for 1 minute"

Infrastructure Alerts

groups:
  - name: infrastructure
    rules:
      - alert: HighCPU
        expr: |
          100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 15m
        labels:
          severity: medium
        annotations:
          summary: "High CPU usage ({{ $value | printf \"%.1f\" }}%)"

      - alert: HighMemory
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 15m
        labels:
          severity: high
        annotations:
          summary: "High memory usage ({{ $value | printf \"%.1f\" }}%)"

      - alert: DiskSpaceLow
        expr: |
          (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
        for: 30m
        labels:
          severity: medium
        annotations:
          summary: "Disk space low ({{ $value | printf \"%.1f\" }}% used)"

Database Alerts

groups:
  - name: database
    rules:
      - alert: DatabaseConnectionsHigh
        expr: |
          pg_stat_activity_count > pg_settings_max_connections * 0.8
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Database connections at {{ $value | printf \"%.0f\" }}%"

      - alert: SlowQueries
        expr: |
          rate(pg_stat_statements_seconds_total[5m]) / rate(pg_stat_statements_calls_total[5m]) > 0.1
        for: 10m
        labels:
          severity: medium
        annotations:
          summary: "Slow database queries detected"

      - alert: ReplicationLag
        expr: pg_replication_lag > 30
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Database replication lag ({{ $value | printf \"%.0f\" }}s)"

Alertmanager Configuration

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: critical
      receiver: 'slack-critical'

    - match:
        severity: high
      receiver: 'slack-high'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true

  - name: 'slack-high'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: '<pagerduty-key>'
        severity: critical

SLO-Based Alerting

Define SLOs

# Service Level Objectives
- name: API Availability
  target: 99.9%
  indicator: http_requests_total{status!~"5.."}

- name: API Latency
  target: 99% of requests < 500ms
  indicator: http_request_duration_seconds

Burn Rate Alerts

# Alert when burning error budget too fast
- alert: HighBurnRate
  expr: |
    # Burn rate over last hour
    (
      1 - (
        sum(rate(http_requests_total{status!~"5.."}[1h]))
        / sum(rate(http_requests_total[1h]))
      )
    ) / (1 - 0.999)  # SLO target
    > 14.4  # 14.4x = will exhaust 30-day budget in 2 days
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error budget burn rate"
    description: "At current rate, will exhaust monthly error budget in 2 days"

Alert Response

Runbook Template

# Alert: HighErrorRate

## Summary
Error rate exceeds 5% of requests

## Impact
Users experiencing errors, potential revenue loss

## Investigation Steps
1. Check error logs: `kubectl logs -l app=api --tail=100`
2. Check recent deployments: `kubectl rollout history`
3. Check database: `pgcli -c "SELECT * FROM pg_stat_activity"`
4. Check external dependencies: [Status page links]

## Resolution Steps
1. If recent deployment, rollback: `kubectl rollout undo deployment/api`
2. If database issue, check connections and queries
3. If external service, enable circuit breaker

## Escalation
- L1: On-call engineer
- L2: Platform team (#platform-oncall)
- L3: Engineering manager

## Related
- Dashboard: [Grafana link]
- Logs: [Kibana link]
- Previous incidents: [Incident links]

On-Call Best Practices

  1. Handoff clearly — Document ongoing issues
  2. Acknowledge quickly — Even if still investigating
  3. Communicate broadly — Status page, Slack
  4. Document everything — What you tried, what worked
  5. Post-mortem — Learn from incidents

Silencing Alerts

Planned Maintenance

# Create silence in Alertmanager
- matchers:
    - name: alertname
      value: ServiceDown
    - name: instance
      value: api-1
  startsAt: 2024-01-15T10:00:00Z
  endsAt: 2024-01-15T12:00:00Z
  createdBy: admin
  comment: "Planned maintenance window"

Snoozing

# Silence via API
amtool silence add alertname=HighCPU --duration=1h --comment="Investigating"

Testing Alerts

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "critical"
    },
    "annotations": {
      "summary": "This is a test alert"
    }
  }]'

Best Practices Summary

Do Don't
Alert on symptoms Alert on causes
Set appropriate thresholds Use static thresholds blindly
Include runbooks Assume responder knows everything
Test alert routing Deploy without testing
Review and prune regularly Let stale alerts accumulate
Use for to avoid flapping Alert on instant values