CI/CD Troubleshooting¶
Common issues and solutions for CI/CD pipelines.
GitHub Actions Issues¶
Workflow Not Triggering¶
Problem: Push doesn't trigger workflow
Solutions:
# Check trigger configuration
on:
push:
branches:
- main
- 'feature/**' # Glob patterns need quotes
paths:
- 'src/**'
- '!src/**/*.md' # Exclude markdown
# Check for [skip ci] in commit message
# Commits with this won't trigger
Debug:
# Check if workflow file is valid
gh workflow list
# View workflow runs
gh run list --workflow=ci.yml
Permission Denied¶
Problem: Permission denied or 403 errors
Solutions:
# Add permissions to job
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
id-token: write # For OIDC
# For GITHUB_TOKEN
permissions:
contents: write # For pushing tags
pull-requests: write # For PR comments
Secret Not Available¶
Problem: Secret not found or empty value
Check:
# Secrets are case-sensitive
env:
API_KEY: ${{ secrets.API_KEY }} # Correct
api_key: ${{ secrets.api_key }} # Different secret!
# Secrets not available in forks
- if: github.event.pull_request.head.repo.full_name == github.repository
env:
SECRET: ${{ secrets.MY_SECRET }}
Cache Not Working¶
Problem: Cache miss every run
Solutions:
# Verify cache key is consistent
- uses: actions/cache@v4
with:
path: ~/.cache/pip
# Key must be deterministic
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
# Fallback keys
restore-keys: |
${{ runner.os }}-pip-
# Check if path exists
- run: ls -la ~/.cache/pip || echo "Cache directory doesn't exist"
Timeout Issues¶
Problem: Job times out
Solutions:
jobs:
build:
timeout-minutes: 30 # Default is 360
steps:
- name: Long running step
timeout-minutes: 10 # Step-level timeout
run: ./long-script.sh
Docker Build Issues¶
Build Fails with OOM¶
Problem: Docker build killed due to memory
Solutions:
# Use larger runner
jobs:
build:
runs-on: ubuntu-latest-4-cores # Or custom runner
# Optimize Dockerfile
FROM python:3.12-slim # Use slim images
# Reduce build parallelism
RUN pip install --no-cache-dir -j 2 -r requirements.txt
Layer Caching Not Working¶
Problem: Docker rebuilds all layers
Solutions:
# Order Dockerfile for better caching
# Dependencies first (changes less often)
COPY requirements.txt .
RUN pip install -r requirements.txt
# Application code last (changes more often)
COPY . .
# Enable BuildKit caching
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
Multi-Platform Build Fails¶
Problem: ARM build fails or is slow
Solutions:
# Use QEMU for cross-compilation
- uses: docker/setup-qemu-action@v3
# Build platforms separately if one fails
strategy:
matrix:
platform: [linux/amd64, linux/arm64]
steps:
- uses: docker/build-push-action@v5
with:
platforms: ${{ matrix.platform }}
Test Failures¶
Flaky Tests¶
Problem: Tests pass locally but fail in CI
Debug:
- name: Run tests with verbose output
run: pytest -v --tb=long -x # Stop on first failure, long traceback
- name: Upload test artifacts
if: failure()
uses: actions/upload-artifact@v4
with:
name: test-logs
path: |
pytest.log
screenshots/
Solutions:
# Add retry for flaky tests
import pytest
@pytest.mark.flaky(reruns=3, reruns_delay=1)
def test_external_api():
pass
Database Connection Refused¶
Problem: Can't connect to service container
Solutions:
services:
postgres:
image: postgres:16
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
# Wait for service to be ready
- name: Wait for Postgres
run: |
for i in {1..30}; do
pg_isready -h localhost -p 5432 && break
sleep 1
done
- name: Run tests
env:
DATABASE_URL: postgresql://postgres:postgres@localhost:5432/postgres
Out of Disk Space¶
Problem: No space left on device
Solutions:
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf /usr/local/share/boost
docker system prune -af
df -h
Deployment Issues¶
Kubernetes Deployment Stuck¶
Problem: Rollout never completes
Debug:
- name: Check deployment status
if: failure()
run: |
kubectl describe deployment api
kubectl get pods -l app=api
kubectl logs -l app=api --tail=100
Solutions:
# Add proper health checks
spec:
containers:
- readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
# Check for resource limits
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Image Pull Errors¶
Problem: ImagePullBackOff or ErrImagePull
Solutions:
# Check image exists
- name: Verify image
run: docker pull myregistry/myapp:${{ github.sha }}
# Check registry credentials
- name: Login to registry
uses: docker/login-action@v3
with:
registry: myregistry.com
username: ${{ secrets.REGISTRY_USER }}
password: ${{ secrets.REGISTRY_PASSWORD }}
# Create image pull secret in Kubernetes
kubectl create secret docker-registry regcred \
--docker-server=myregistry.com \
--docker-username=$USER \
--docker-password=$PASSWORD
Migration Failures¶
Problem: Database migration fails during deploy
Solutions:
# Run migrations separately
- name: Run migrations
run: alembic upgrade head
continue-on-error: false
- name: Verify migration
run: alembic current
# Only deploy if migrations succeed
- name: Deploy
if: success()
run: ./deploy.sh
Debugging Techniques¶
Enable Debug Logging¶
# Repository secret: ACTIONS_STEP_DEBUG = true
# Or in workflow
- name: Debug step
run: echo "::debug::This is a debug message"
env:
ACTIONS_STEP_DEBUG: true
SSH into Runner¶
Dump Context¶
- name: Dump GitHub context
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Dump job context
env:
JOB_CONTEXT: ${{ toJson(job) }}
run: echo "$JOB_CONTEXT"
Artifact Debug Logs¶
- name: Upload debug logs
if: always()
uses: actions/upload-artifact@v4
with:
name: debug-logs
path: |
/var/log/
~/.npm/_logs/
pytest.log
retention-days: 5
Performance Issues¶
Slow Checkout¶
Problem: Checkout takes too long
Solutions:
- uses: actions/checkout@v4
with:
fetch-depth: 1 # Shallow clone
# Or for specific needs:
fetch-depth: 0 # Full history (for versioning)
Slow Dependency Install¶
Solutions:
# Use caching
- uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'
# Use faster package managers
- uses: oven-sh/setup-bun@v2
# Install only production deps in CI
- run: pip install --no-dev -r requirements.txt
Parallel Jobs Not Running¶
Problem: Jobs run sequentially instead of parallel
Check:
# Jobs are parallel by default
jobs:
job1:
runs-on: ubuntu-latest
job2:
runs-on: ubuntu-latest
# These run in parallel
# needs: creates dependency
job3:
needs: [job1, job2] # Waits for both
Quick Reference¶
Common Error Messages¶
| Error | Cause | Fix |
|---|---|---|
Permission denied |
Missing permissions | Add permissions: block |
Secret not found |
Wrong secret name | Check case, environment |
Context deadline exceeded |
Timeout | Increase timeout or optimize |
No space left on device |
Disk full | Clean up or use larger runner |
Connection refused |
Service not ready | Add health check, wait |
ImagePullBackOff |
Can't pull image | Check registry auth, image tag |
OOMKilled |
Out of memory | Increase limits or optimize |
Useful Commands¶
# GitHub CLI debugging
gh run list --workflow=ci.yml
gh run view <run-id>
gh run view <run-id> --log
# Kubernetes debugging
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --sort-by='.lastTimestamp'
# Docker debugging
docker logs <container-id>
docker inspect <image-name>
docker system df