Monitoring¶

Overview of Sartiq's observability infrastructure, including logging, metrics, and alerting.

Overview¶

All logs are centralized in Better Stack for unified observability across services.

flowchart LR
    subgraph Applications
        Backend[Backend API]
        Compute[Compute Server]
    end

    subgraph BetterStack["Better Stack"]
        Logs[(Log Storage)]
        Search[Search & Query]
        Alerts[Alerting]
        Dashboard[Dashboards]
    end

    Backend --> Logs
    Compute --> Logs
    Logs --> Search
    Logs --> Alerts
    Logs --> Dashboard

Logging¶

Log Aggregation¶

All application logs are shipped to Better Stack:

Source	Log Types
Backend API	Request logs, errors, business events
Compute Server	Task execution, provider calls, workflow events
Celery Workers	Task processing, retries, failures

Log Format¶

Structured JSON logging for easy querying:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "service": "backend",
  "message": "Generation completed",
  "generation_id": "uuid",
  "duration_ms": 15000,
  "trace_id": "abc123"
}

Key Fields¶

Field	Description
`timestamp`	ISO 8601 timestamp
`level`	Log level (debug, info, warning, error)
`service`	Service name (backend, compute)
`trace_id`	Request/workflow trace ID
`user_id`	Associated user (if applicable)
`org_id`	Associated organization

Alerting¶

Alert Configuration¶

Alert	Condition	Severity
High Error Rate	> 5% errors in 5min	Critical
Slow Responses	p99 latency > 10s	Warning
Queue Backlog	> 1000 pending tasks	Warning
Database Connections	> 80% capacity	Warning
Generation Failures	> 5% failure rate	Warning
Provider Errors	Provider error spike	Warning

Notification Channels¶

Slack integration for team alerts
Email for critical alerts
PagerDuty integration (if configured)

Key Metrics¶

Application Metrics¶

Metric	Description	Alert Threshold
`request_latency_p99`	99th percentile response time	> 10s
`error_rate`	Percentage of 5xx responses	> 5%
`generation_success_rate`	Successful generations	< 95%
`generation_latency_p99`	Generation completion time	> 60s

Infrastructure Metrics¶

Metric	Description	Alert Threshold
`cpu_utilization`	Server CPU usage	> 80%
`memory_utilization`	Server memory usage	> 85%
`disk_utilization`	Disk space usage	> 80%
`db_connections`	Active database connections	> 80% capacity

Queue Metrics¶

Metric	Description	Alert Threshold
`queue_depth`	Tasks waiting in queue	> 100
`task_processing_time`	Average task duration	> 30s
`task_failure_rate`	Failed tasks percentage	> 5%

Tracing¶

Each request and generation has a trace ID for end-to-end debugging:

trace_id: gen-{generation_id}
├── backend.create_generation
├── compute.submit_workflow
├── compute.task.background_removal
├── compute.task.generation
├── compute.task.face_enhancement
├── compute.task.garment_enhancement
├── compute.task.upscale
└── backend.handle_completion

Dashboards¶

Key Dashboards¶

Dashboard	Purpose
System Overview	High-level health metrics
Generation Pipeline	Generation success/failure rates, timing
API Performance	Request latency, error rates
Queue Health	Task queue depth, processing times
Provider Status	AI provider availability and errors

Infrastructure Overview — Cloud providers and architecture
Incident Response — How to respond to alerts
Backend Architecture — Backend service details