Skip to content

Monitoring

Overview of Sartiq's observability infrastructure, including logging, metrics, and alerting.


Overview

All logs are centralized in Better Stack for unified observability across services.

flowchart LR
    subgraph Applications
        Backend[Backend API]
        Compute[Compute Server]
    end

    subgraph BetterStack["Better Stack"]
        Logs[(Log Storage)]
        Search[Search & Query]
        Alerts[Alerting]
        Dashboard[Dashboards]
    end

    Backend --> Logs
    Compute --> Logs
    Logs --> Search
    Logs --> Alerts
    Logs --> Dashboard

Logging

Log Aggregation

All application logs are shipped to Better Stack:

Source Log Types
Backend API Request logs, errors, business events
Compute Server Task execution, provider calls, workflow events
Celery Workers Task processing, retries, failures

Log Format

Structured JSON logging for easy querying:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "service": "backend",
  "message": "Generation completed",
  "generation_id": "uuid",
  "duration_ms": 15000,
  "trace_id": "abc123"
}

Key Fields

Field Description
timestamp ISO 8601 timestamp
level Log level (debug, info, warning, error)
service Service name (backend, compute)
trace_id Request/workflow trace ID
user_id Associated user (if applicable)
org_id Associated organization

Alerting

Alert Configuration

Alert Condition Severity
High Error Rate > 5% errors in 5min Critical
Slow Responses p99 latency > 10s Warning
Queue Backlog > 1000 pending tasks Warning
Database Connections > 80% capacity Warning
Generation Failures > 5% failure rate Warning
Provider Errors Provider error spike Warning

Notification Channels

  • Slack integration for team alerts
  • Email for critical alerts
  • PagerDuty integration (if configured)

Key Metrics

Application Metrics

Metric Description Alert Threshold
request_latency_p99 99th percentile response time > 10s
error_rate Percentage of 5xx responses > 5%
generation_success_rate Successful generations < 95%
generation_latency_p99 Generation completion time > 60s

Infrastructure Metrics

Metric Description Alert Threshold
cpu_utilization Server CPU usage > 80%
memory_utilization Server memory usage > 85%
disk_utilization Disk space usage > 80%
db_connections Active database connections > 80% capacity

Queue Metrics

Metric Description Alert Threshold
queue_depth Tasks waiting in queue > 100
task_processing_time Average task duration > 30s
task_failure_rate Failed tasks percentage > 5%

Tracing

Each request and generation has a trace ID for end-to-end debugging:

trace_id: gen-{generation_id}
├── backend.create_generation
├── compute.submit_workflow
├── compute.task.background_removal
├── compute.task.generation
├── compute.task.face_enhancement
├── compute.task.garment_enhancement
├── compute.task.upscale
└── backend.handle_completion

Dashboards

Key Dashboards

Dashboard Purpose
System Overview High-level health metrics
Generation Pipeline Generation success/failure rates, timing
API Performance Request latency, error rates
Queue Health Task queue depth, processing times
Provider Status AI provider availability and errors