SRE Dashboard
The SRE Dashboard at /admin/sre provides visibility into system health using Google’s Four Golden Signals.
The Four Golden Signals
Section titled “The Four Golden Signals”1. Latency
Section titled “1. Latency”| Metric | Good | Concerning | Bad |
|---|---|---|---|
| P50 | <50ms | 50-200ms | >200ms |
| P95 | <200ms | 200-1000ms | >1000ms |
| P99 | <500ms | 500-2000ms | >2000ms |
Key insight: Average lies. P95/P99 tell the truth about user experience.
2. Traffic (RPM)
Section titled “2. Traffic (RPM)”Requests per minute. Watch for spikes (viral content or attacks) and drops (site might be broken).
3. Errors
Section titled “3. Errors”| Error Rate | Assessment |
|---|---|
| <0.1% | Excellent |
| 0.1-1% | Good |
| 1-5% | Investigate |
| >5% | Something is broken |
4xx errors are often expected (bad password, 404). 5xx errors need immediate attention.
4. Saturation
Section titled “4. Saturation”Database connection pool usage. If >80% checked out or overflow > 0, you’re at capacity.
Metrics Architecture
Section titled “Metrics Architecture”Request → MetricsMiddleware (records timing) → MetricsService (in-memory deque, 10k items)→ /api/v1/monitoring/sre-metrics → SRE Dashboard (polls every 10s)Key Files
Section titled “Key Files”| File | Purpose |
|---|---|
backend/app/services/metrics.py | Core metrics collection |
backend/app/middleware/metrics.py | Request timing middleware |
frontend/src/pages/admin/SREDashboard.tsx | Dashboard UI |