Deploy Monitoring
Observe deployments and detect issues early.
Deployment Observability
Key Metrics During Deploy
┌─────────────────────────────────────────────────────────────┐
│ Deployment Timeline │
├─────────┬─────────┬─────────┬─────────┬─────────┬──────────┤
│ Deploy │ Rollout │ Warm-up │ Verify │ Monitor │ Stable │
│ Start │ │ │ │ │ │
├─────────┴─────────┴─────────┴─────────┴─────────┴──────────┤
│ Watch: Error rate, latency, pod health, traffic shift │
└─────────────────────────────────────────────────────────────┘
What to Monitor
| Metric |
Normal |
Alert Threshold |
| Error rate |
< 0.1% |
> 1% |
| P99 latency |
< 500ms |
> 2x baseline |
| Pod restarts |
0 |
> 2 |
| CPU usage |
< 70% |
> 90% |
| Memory usage |
< 80% |
> 95% |
Health Checks
Kubernetes Probes
# deployment.yaml
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health/live
port: 8000
failureThreshold: 30
periodSeconds: 10
Health Endpoint
from fastapi import FastAPI, Response
from datetime import datetime
app = FastAPI()
deploy_time = datetime.utcnow()
@app.get("/health/live")
async def liveness():
"""Basic liveness check."""
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
"""Readiness with dependency checks."""
checks = {}
# Database
try:
await db.execute("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {e}"
# Redis
try:
await redis.ping()
checks["redis"] = "ok"
except Exception as e:
checks["redis"] = f"error: {e}"
all_ok = all(v == "ok" for v in checks.values())
return Response(
content={"status": "ready" if all_ok else "not_ready", "checks": checks},
status_code=200 if all_ok else 503,
)
@app.get("/health/info")
async def info():
"""Deployment information."""
return {
"version": os.getenv("APP_VERSION", "unknown"),
"commit": os.getenv("GIT_COMMIT", "unknown"),
"deployed_at": deploy_time.isoformat(),
"uptime_seconds": (datetime.utcnow() - deploy_time).total_seconds(),
}
Smoke Tests
Post-Deploy Verification
#!/bin/bash
# smoke-tests.sh
BASE_URL=${1:-"https://api.example.com"}
FAILURES=0
echo "Running smoke tests against $BASE_URL"
# Health check
echo -n "Health check... "
if curl -sf "$BASE_URL/health/ready" > /dev/null; then
echo "✓"
else
echo "✗"
((FAILURES++))
fi
# API endpoint
echo -n "API response... "
RESPONSE=$(curl -sf "$BASE_URL/api/v1/status")
if [ $? -eq 0 ] && echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
echo "✓"
else
echo "✗"
((FAILURES++))
fi
# Auth flow
echo -n "Auth endpoint... "
if curl -sf -X POST "$BASE_URL/auth/health" > /dev/null; then
echo "✓"
else
echo "✗"
((FAILURES++))
fi
if [ $FAILURES -gt 0 ]; then
echo "FAILED: $FAILURES smoke tests failed"
exit 1
fi
echo "All smoke tests passed"
GitHub Actions Smoke Tests
- name: Run smoke tests
run: |
./scripts/smoke-tests.sh ${{ vars.API_URL }}
- name: Rollback on failure
if: failure()
run: |
kubectl rollout undo deployment/api
echo "Smoke tests failed, rolled back"
Deployment Annotations
Grafana Annotations
- name: Create Grafana annotation
run: |
curl -X POST "$GRAFANA_URL/api/annotations" \
-H "Authorization: Bearer ${{ secrets.GRAFANA_API_KEY }}" \
-H "Content-Type: application/json" \
-d '{
"dashboardUID": "api-overview",
"time": '$(date +%s000)',
"tags": ["deployment", "${{ github.ref_name }}"],
"text": "Deployed ${{ github.sha }} by ${{ github.actor }}"
}'
Datadog Events
- name: Send Datadog event
run: |
curl -X POST "https://api.datadoghq.com/api/v1/events" \
-H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
-d '{
"title": "Deployment to production",
"text": "Version ${{ github.sha }} deployed",
"tags": ["env:production", "service:api"],
"alert_type": "info"
}'
Error Rate Monitoring
Check Error Rate After Deploy
- name: Deploy
run: ./deploy.sh
- name: Wait for metrics
run: sleep 120 # Wait for metrics to populate
- name: Check error rate
run: |
ERROR_RATE=$(curl -s "$PROMETHEUS_URL/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \
| jq '.data.result[0].value[1] | tonumber')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high: $ERROR_RATE"
exit 1
fi
Prometheus Alert Rules
# alerts.yml
groups:
- name: deployment
rules:
- alert: HighErrorRateAfterDeploy
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected after deployment"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
- alert: HighLatencyAfterDeploy
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected after deployment"
Rollback Triggers
Automatic Rollback
- name: Deploy and monitor
id: deploy
run: |
./deploy.sh
# Monitor for 5 minutes
for i in {1..30}; do
sleep 10
# Check error rate
ERROR_RATE=$(./check-error-rate.sh)
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
echo "error_rate_exceeded=true" >> $GITHUB_OUTPUT
exit 1
fi
# Check pod health
UNHEALTHY=$(kubectl get pods -l app=api --no-headers | grep -v Running | wc -l)
if [ "$UNHEALTHY" -gt 0 ]; then
echo "unhealthy_pods=true" >> $GITHUB_OUTPUT
exit 1
fi
done
- name: Rollback on failure
if: failure()
run: |
echo "Deployment monitoring failed, rolling back..."
kubectl rollout undo deployment/api
# Notify
./notify-slack.sh "Deployment rolled back due to monitoring failure"
Deployment Dashboard
Key Visualizations
## Deployment Dashboard Panels
### Row 1: Health
- Pod status (ready/not ready)
- Container restarts (last hour)
- Deployment rollout status
### Row 2: Traffic
- Request rate (before/after deploy)
- Error rate with deployment annotations
- Latency percentiles (p50, p95, p99)
### Row 3: Resources
- CPU usage by pod
- Memory usage by pod
- Network I/O
### Row 4: Business
- Active users
- Transactions/minute
- Revenue (if applicable)
Grafana Dashboard JSON
{
"annotations": {
"list": [
{
"datasource": "Prometheus",
"enable": true,
"expr": "changes(kube_deployment_status_observed_generation{deployment=\"api\"}[1m]) > 0",
"name": "Deployments",
"tagKeys": "deployment"
}
]
},
"panels": [
{
"title": "Error Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"legendFormat": "Error Rate"
}
]
}
]
}
Deployment Notifications
Slack Integration
- name: Notify deployment start
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":rocket: Deployment started",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Deployment Started*\n• Environment: ${{ inputs.environment }}\n• Version: `${{ github.sha }}`\n• Author: ${{ github.actor }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
- name: Notify deployment complete
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":white_check_mark: Deployment successful",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Deployment Successful*\n<${{ vars.API_URL }}|View in browser>"
}
}
]
}
- name: Notify deployment failed
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":x: Deployment failed",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Deployment Failed*\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View logs>"
}
}
]
}
Post-Deploy Checklist
- name: Post-deploy verification
run: |
echo "Running post-deploy checklist..."
# 1. Health check
curl -f $API_URL/health/ready
# 2. Version verification
DEPLOYED_VERSION=$(curl -s $API_URL/health/info | jq -r '.commit')
if [ "$DEPLOYED_VERSION" != "${{ github.sha }}" ]; then
echo "Version mismatch!"
exit 1
fi
# 3. Smoke tests
./smoke-tests.sh $API_URL
# 4. Check logs for errors
kubectl logs -l app=api --since=5m | grep -i error && exit 1 || true
# 5. Verify metrics flowing
curl -s "$PROMETHEUS_URL/api/v1/query?query=up{job='api'}" | jq -e '.data.result | length > 0'
echo "All post-deploy checks passed!"
Best Practices Summary
| Practice |
Benefit |
| Health endpoints |
Kubernetes integration |
| Smoke tests |
Verify functionality |
| Deployment annotations |
Correlate metrics |
| Error rate monitoring |
Detect issues early |
| Auto-rollback |
Quick recovery |
| Deployment dashboard |
Visibility |
| Notifications |
Team awareness |
| Post-deploy checklist |
Comprehensive verification |