Deploy Monitoring¶

Observe deployments and detect issues early.

Deployment Observability¶

Key Metrics During Deploy¶

┌─────────────────────────────────────────────────────────────┐
│                    Deployment Timeline                       │
├─────────┬─────────┬─────────┬─────────┬─────────┬──────────┤
│ Deploy  │ Rollout │ Warm-up │ Verify  │ Monitor │ Stable   │
│ Start   │         │         │         │         │          │
├─────────┴─────────┴─────────┴─────────┴─────────┴──────────┤
│ Watch: Error rate, latency, pod health, traffic shift       │
└─────────────────────────────────────────────────────────────┘

What to Monitor¶

Metric	Normal	Alert Threshold
Error rate	< 0.1%	> 1%
P99 latency	< 500ms	> 2x baseline
Pod restarts	0	> 2
CPU usage	< 70%	> 90%
Memory usage	< 80%	> 95%

Health Checks¶

Kubernetes Probes¶

# deployment.yaml
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8000
        initialDelaySeconds: 10
        periodSeconds: 10
        failureThreshold: 3

      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8000
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 3

      startupProbe:
        httpGet:
          path: /health/live
          port: 8000
        failureThreshold: 30
        periodSeconds: 10

Health Endpoint¶

from fastapi import FastAPI, Response
from datetime import datetime

app = FastAPI()
deploy_time = datetime.utcnow()

@app.get("/health/live")
async def liveness():
    """Basic liveness check."""
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    """Readiness with dependency checks."""
    checks = {}

    # Database
    try:
        await db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"

    # Redis
    try:
        await redis.ping()
        checks["redis"] = "ok"
    except Exception as e:
        checks["redis"] = f"error: {e}"

    all_ok = all(v == "ok" for v in checks.values())

    return Response(
        content={"status": "ready" if all_ok else "not_ready", "checks": checks},
        status_code=200 if all_ok else 503,
    )

@app.get("/health/info")
async def info():
    """Deployment information."""
    return {
        "version": os.getenv("APP_VERSION", "unknown"),
        "commit": os.getenv("GIT_COMMIT", "unknown"),
        "deployed_at": deploy_time.isoformat(),
        "uptime_seconds": (datetime.utcnow() - deploy_time).total_seconds(),
    }

Smoke Tests¶

Post-Deploy Verification¶

#!/bin/bash
# smoke-tests.sh

BASE_URL=${1:-"https://api.example.com"}
FAILURES=0

echo "Running smoke tests against $BASE_URL"

# Health check
echo -n "Health check... "
if curl -sf "$BASE_URL/health/ready" > /dev/null; then
    echo "✓"
else
    echo "✗"
    ((FAILURES++))
fi

# API endpoint
echo -n "API response... "
RESPONSE=$(curl -sf "$BASE_URL/api/v1/status")
if [ $? -eq 0 ] && echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
    echo "✓"
else
    echo "✗"
    ((FAILURES++))
fi

# Auth flow
echo -n "Auth endpoint... "
if curl -sf -X POST "$BASE_URL/auth/health" > /dev/null; then
    echo "✓"
else
    echo "✗"
    ((FAILURES++))
fi

if [ $FAILURES -gt 0 ]; then
    echo "FAILED: $FAILURES smoke tests failed"
    exit 1
fi

echo "All smoke tests passed"

GitHub Actions Smoke Tests¶

- name: Run smoke tests
  run: |
    ./scripts/smoke-tests.sh ${{ vars.API_URL }}

- name: Rollback on failure
  if: failure()
  run: |
    kubectl rollout undo deployment/api
    echo "Smoke tests failed, rolled back"

Deployment Annotations¶

Grafana Annotations¶

- name: Create Grafana annotation
  run: |
    curl -X POST "$GRAFANA_URL/api/annotations" \
      -H "Authorization: Bearer ${{ secrets.GRAFANA_API_KEY }}" \
      -H "Content-Type: application/json" \
      -d '{
        "dashboardUID": "api-overview",
        "time": '$(date +%s000)',
        "tags": ["deployment", "${{ github.ref_name }}"],
        "text": "Deployed ${{ github.sha }} by ${{ github.actor }}"
      }'

Datadog Events¶

- name: Send Datadog event
  run: |
    curl -X POST "https://api.datadoghq.com/api/v1/events" \
      -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
      -d '{
        "title": "Deployment to production",
        "text": "Version ${{ github.sha }} deployed",
        "tags": ["env:production", "service:api"],
        "alert_type": "info"
      }'

Error Rate Monitoring¶

Check Error Rate After Deploy¶

- name: Deploy
  run: ./deploy.sh

- name: Wait for metrics
  run: sleep 120  # Wait for metrics to populate

- name: Check error rate
  run: |
    ERROR_RATE=$(curl -s "$PROMETHEUS_URL/api/v1/query" \
      --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \
      | jq '.data.result[0].value[1] | tonumber')

    if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
      echo "Error rate too high: $ERROR_RATE"
      exit 1
    fi

Prometheus Alert Rules¶

# alerts.yml
groups:
  - name: deployment
    rules:
      - alert: HighErrorRateAfterDeploy
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected after deployment"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

      - alert: HighLatencyAfterDeploy
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected after deployment"

Rollback Triggers¶

Automatic Rollback¶

- name: Deploy and monitor
  id: deploy
  run: |
    ./deploy.sh

    # Monitor for 5 minutes
    for i in {1..30}; do
      sleep 10

      # Check error rate
      ERROR_RATE=$(./check-error-rate.sh)
      if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
        echo "error_rate_exceeded=true" >> $GITHUB_OUTPUT
        exit 1
      fi

      # Check pod health
      UNHEALTHY=$(kubectl get pods -l app=api --no-headers | grep -v Running | wc -l)
      if [ "$UNHEALTHY" -gt 0 ]; then
        echo "unhealthy_pods=true" >> $GITHUB_OUTPUT
        exit 1
      fi
    done

- name: Rollback on failure
  if: failure()
  run: |
    echo "Deployment monitoring failed, rolling back..."
    kubectl rollout undo deployment/api

    # Notify
    ./notify-slack.sh "Deployment rolled back due to monitoring failure"

Deployment Dashboard¶

Key Visualizations¶

## Deployment Dashboard Panels

### Row 1: Health
- Pod status (ready/not ready)
- Container restarts (last hour)
- Deployment rollout status

### Row 2: Traffic
- Request rate (before/after deploy)
- Error rate with deployment annotations
- Latency percentiles (p50, p95, p99)

### Row 3: Resources
- CPU usage by pod
- Memory usage by pod
- Network I/O

### Row 4: Business
- Active users
- Transactions/minute
- Revenue (if applicable)

Grafana Dashboard JSON¶

{
  "annotations": {
    "list": [
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "changes(kube_deployment_status_observed_generation{deployment=\"api\"}[1m]) > 0",
        "name": "Deployments",
        "tagKeys": "deployment"
      }
    ]
  },
  "panels": [
    {
      "title": "Error Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
          "legendFormat": "Error Rate"
        }
      ]
    }
  ]
}

Deployment Notifications¶

Slack Integration¶

- name: Notify deployment start
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": ":rocket: Deployment started",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*Deployment Started*\n• Environment: ${{ inputs.environment }}\n• Version: `${{ github.sha }}`\n• Author: ${{ github.actor }}"
            }
          }
        ]
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

- name: Notify deployment complete
  if: success()
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": ":white_check_mark: Deployment successful",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*Deployment Successful*\n<${{ vars.API_URL }}|View in browser>"
            }
          }
        ]
      }

- name: Notify deployment failed
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": ":x: Deployment failed",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*Deployment Failed*\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View logs>"
            }
          }
        ]
      }

Post-Deploy Checklist¶

- name: Post-deploy verification
  run: |
    echo "Running post-deploy checklist..."

    # 1. Health check
    curl -f $API_URL/health/ready

    # 2. Version verification
    DEPLOYED_VERSION=$(curl -s $API_URL/health/info | jq -r '.commit')
    if [ "$DEPLOYED_VERSION" != "${{ github.sha }}" ]; then
      echo "Version mismatch!"
      exit 1
    fi

    # 3. Smoke tests
    ./smoke-tests.sh $API_URL

    # 4. Check logs for errors
    kubectl logs -l app=api --since=5m | grep -i error && exit 1 || true

    # 5. Verify metrics flowing
    curl -s "$PROMETHEUS_URL/api/v1/query?query=up{job='api'}" | jq -e '.data.result | length > 0'

    echo "All post-deploy checks passed!"

Best Practices Summary¶

Practice	Benefit
Health endpoints	Kubernetes integration
Smoke tests	Verify functionality
Deployment annotations	Correlate metrics
Error rate monitoring	Detect issues early
Auto-rollback	Quick recovery
Deployment dashboard	Visibility
Notifications	Team awareness
Post-deploy checklist	Comprehensive verification