Skip to content

CI/CD Troubleshooting

Common issues and solutions for CI/CD pipelines.

GitHub Actions Issues

Workflow Not Triggering

Problem: Push doesn't trigger workflow

Solutions:

# Check trigger configuration
on:
  push:
    branches:
      - main
      - 'feature/**'  # Glob patterns need quotes
    paths:
      - 'src/**'
      - '!src/**/*.md'  # Exclude markdown

# Check for [skip ci] in commit message
# Commits with this won't trigger

Debug:

# Check if workflow file is valid
gh workflow list

# View workflow runs
gh run list --workflow=ci.yml

Permission Denied

Problem: Permission denied or 403 errors

Solutions:

# Add permissions to job
jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write  # For OIDC

# For GITHUB_TOKEN
permissions:
  contents: write  # For pushing tags
  pull-requests: write  # For PR comments

Secret Not Available

Problem: Secret not found or empty value

Check:

# Secrets are case-sensitive
env:
  API_KEY: ${{ secrets.API_KEY }}  # Correct
  api_key: ${{ secrets.api_key }}  # Different secret!

# Secrets not available in forks
- if: github.event.pull_request.head.repo.full_name == github.repository
  env:
    SECRET: ${{ secrets.MY_SECRET }}

Cache Not Working

Problem: Cache miss every run

Solutions:

# Verify cache key is consistent
- uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    # Key must be deterministic
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    # Fallback keys
    restore-keys: |
      ${{ runner.os }}-pip-

# Check if path exists
- run: ls -la ~/.cache/pip || echo "Cache directory doesn't exist"

Timeout Issues

Problem: Job times out

Solutions:

jobs:
  build:
    timeout-minutes: 30  # Default is 360

    steps:
      - name: Long running step
        timeout-minutes: 10  # Step-level timeout
        run: ./long-script.sh

Docker Build Issues

Build Fails with OOM

Problem: Docker build killed due to memory

Solutions:

# Use larger runner
jobs:
  build:
    runs-on: ubuntu-latest-4-cores  # Or custom runner

# Optimize Dockerfile
FROM python:3.12-slim  # Use slim images

# Reduce build parallelism
RUN pip install --no-cache-dir -j 2 -r requirements.txt

Layer Caching Not Working

Problem: Docker rebuilds all layers

Solutions:

# Order Dockerfile for better caching
# Dependencies first (changes less often)
COPY requirements.txt .
RUN pip install -r requirements.txt

# Application code last (changes more often)
COPY . .

# Enable BuildKit caching
- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

Multi-Platform Build Fails

Problem: ARM build fails or is slow

Solutions:

# Use QEMU for cross-compilation
- uses: docker/setup-qemu-action@v3

# Build platforms separately if one fails
strategy:
  matrix:
    platform: [linux/amd64, linux/arm64]
steps:
  - uses: docker/build-push-action@v5
    with:
      platforms: ${{ matrix.platform }}

Test Failures

Flaky Tests

Problem: Tests pass locally but fail in CI

Debug:

- name: Run tests with verbose output
  run: pytest -v --tb=long -x  # Stop on first failure, long traceback

- name: Upload test artifacts
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: test-logs
    path: |
      pytest.log
      screenshots/

Solutions:

# Add retry for flaky tests
import pytest

@pytest.mark.flaky(reruns=3, reruns_delay=1)
def test_external_api():
    pass

Database Connection Refused

Problem: Can't connect to service container

Solutions:

services:
  postgres:
    image: postgres:16
    ports:
      - 5432:5432
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

steps:
  # Wait for service to be ready
  - name: Wait for Postgres
    run: |
      for i in {1..30}; do
        pg_isready -h localhost -p 5432 && break
        sleep 1
      done

  - name: Run tests
    env:
      DATABASE_URL: postgresql://postgres:postgres@localhost:5432/postgres

Out of Disk Space

Problem: No space left on device

Solutions:

- name: Free disk space
  run: |
    sudo rm -rf /usr/share/dotnet
    sudo rm -rf /opt/ghc
    sudo rm -rf /usr/local/share/boost
    docker system prune -af
    df -h

Deployment Issues

Kubernetes Deployment Stuck

Problem: Rollout never completes

Debug:

- name: Check deployment status
  if: failure()
  run: |
    kubectl describe deployment api
    kubectl get pods -l app=api
    kubectl logs -l app=api --tail=100

Solutions:

# Add proper health checks
spec:
  containers:
    - readinessProbe:
        httpGet:
          path: /health
          port: 8000
        initialDelaySeconds: 5
        periodSeconds: 5

# Check for resource limits
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Image Pull Errors

Problem: ImagePullBackOff or ErrImagePull

Solutions:

# Check image exists
- name: Verify image
  run: docker pull myregistry/myapp:${{ github.sha }}

# Check registry credentials
- name: Login to registry
  uses: docker/login-action@v3
  with:
    registry: myregistry.com
    username: ${{ secrets.REGISTRY_USER }}
    password: ${{ secrets.REGISTRY_PASSWORD }}

# Create image pull secret in Kubernetes
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.com \
  --docker-username=$USER \
  --docker-password=$PASSWORD

Migration Failures

Problem: Database migration fails during deploy

Solutions:

# Run migrations separately
- name: Run migrations
  run: alembic upgrade head
  continue-on-error: false

- name: Verify migration
  run: alembic current

# Only deploy if migrations succeed
- name: Deploy
  if: success()
  run: ./deploy.sh

Debugging Techniques

Enable Debug Logging

# Repository secret: ACTIONS_STEP_DEBUG = true

# Or in workflow
- name: Debug step
  run: echo "::debug::This is a debug message"
  env:
    ACTIONS_STEP_DEBUG: true

SSH into Runner

- name: Setup tmate session
  if: failure()
  uses: mxschmitt/action-tmate@v3
  timeout-minutes: 15

Dump Context

- name: Dump GitHub context
  env:
    GITHUB_CONTEXT: ${{ toJson(github) }}
  run: echo "$GITHUB_CONTEXT"

- name: Dump job context
  env:
    JOB_CONTEXT: ${{ toJson(job) }}
  run: echo "$JOB_CONTEXT"

Artifact Debug Logs

- name: Upload debug logs
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: debug-logs
    path: |
      /var/log/
      ~/.npm/_logs/
      pytest.log
    retention-days: 5

Performance Issues

Slow Checkout

Problem: Checkout takes too long

Solutions:

- uses: actions/checkout@v4
  with:
    fetch-depth: 1  # Shallow clone
    # Or for specific needs:
    fetch-depth: 0  # Full history (for versioning)

Slow Dependency Install

Solutions:

# Use caching
- uses: actions/setup-python@v5
  with:
    python-version: '3.12'
    cache: 'pip'

# Use faster package managers
- uses: oven-sh/setup-bun@v2

# Install only production deps in CI
- run: pip install --no-dev -r requirements.txt

Parallel Jobs Not Running

Problem: Jobs run sequentially instead of parallel

Check:

# Jobs are parallel by default
jobs:
  job1:
    runs-on: ubuntu-latest
  job2:
    runs-on: ubuntu-latest
  # These run in parallel

# needs: creates dependency
  job3:
    needs: [job1, job2]  # Waits for both

Quick Reference

Common Error Messages

Error Cause Fix
Permission denied Missing permissions Add permissions: block
Secret not found Wrong secret name Check case, environment
Context deadline exceeded Timeout Increase timeout or optimize
No space left on device Disk full Clean up or use larger runner
Connection refused Service not ready Add health check, wait
ImagePullBackOff Can't pull image Check registry auth, image tag
OOMKilled Out of memory Increase limits or optimize

Useful Commands

# GitHub CLI debugging
gh run list --workflow=ci.yml
gh run view <run-id>
gh run view <run-id> --log

# Kubernetes debugging
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --sort-by='.lastTimestamp'

# Docker debugging
docker logs <container-id>
docker inspect <image-name>
docker system df