Monitoring
Overview of Sartiq's observability infrastructure, including logging, metrics, and alerting.
Overview
All logs are centralized in Better Stack for unified observability across services.
flowchart LR
subgraph Applications
Backend[Backend API]
Compute[Compute Server]
end
subgraph BetterStack["Better Stack"]
Logs[(Log Storage)]
Search[Search & Query]
Alerts[Alerting]
Dashboard[Dashboards]
end
Backend --> Logs
Compute --> Logs
Logs --> Search
Logs --> Alerts
Logs --> Dashboard
Logging
Log Aggregation
All application logs are shipped to Better Stack:
| Source |
Log Types |
| Backend API |
Request logs, errors, business events |
| Compute Server |
Task execution, provider calls, workflow events |
| Celery Workers |
Task processing, retries, failures |
Structured JSON logging for easy querying:
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "info",
"service": "backend",
"message": "Generation completed",
"generation_id": "uuid",
"duration_ms": 15000,
"trace_id": "abc123"
}
Key Fields
| Field |
Description |
timestamp |
ISO 8601 timestamp |
level |
Log level (debug, info, warning, error) |
service |
Service name (backend, compute) |
trace_id |
Request/workflow trace ID |
user_id |
Associated user (if applicable) |
org_id |
Associated organization |
Alerting
Alert Configuration
| Alert |
Condition |
Severity |
| High Error Rate |
> 5% errors in 5min |
Critical |
| Slow Responses |
p99 latency > 10s |
Warning |
| Queue Backlog |
> 1000 pending tasks |
Warning |
| Database Connections |
> 80% capacity |
Warning |
| Generation Failures |
> 5% failure rate |
Warning |
| Provider Errors |
Provider error spike |
Warning |
Notification Channels
- Slack integration for team alerts
- Email for critical alerts
- PagerDuty integration (if configured)
Key Metrics
Application Metrics
| Metric |
Description |
Alert Threshold |
request_latency_p99 |
99th percentile response time |
> 10s |
error_rate |
Percentage of 5xx responses |
> 5% |
generation_success_rate |
Successful generations |
< 95% |
generation_latency_p99 |
Generation completion time |
> 60s |
Infrastructure Metrics
| Metric |
Description |
Alert Threshold |
cpu_utilization |
Server CPU usage |
> 80% |
memory_utilization |
Server memory usage |
> 85% |
disk_utilization |
Disk space usage |
> 80% |
db_connections |
Active database connections |
> 80% capacity |
Queue Metrics
| Metric |
Description |
Alert Threshold |
queue_depth |
Tasks waiting in queue |
> 100 |
task_processing_time |
Average task duration |
> 30s |
task_failure_rate |
Failed tasks percentage |
> 5% |
Tracing
Each request and generation has a trace ID for end-to-end debugging:
trace_id: gen-{generation_id}
├── backend.create_generation
├── compute.submit_workflow
├── compute.task.background_removal
├── compute.task.generation
├── compute.task.face_enhancement
├── compute.task.garment_enhancement
├── compute.task.upscale
└── backend.handle_completion
Dashboards
Key Dashboards
| Dashboard |
Purpose |
| System Overview |
High-level health metrics |
| Generation Pipeline |
Generation success/failure rates, timing |
| API Performance |
Request latency, error rates |
| Queue Health |
Task queue depth, processing times |
| Provider Status |
AI provider availability and errors |