Skip to content

SRE Dashboard

The SRE Dashboard at /admin/sre provides visibility into system health using Google’s Four Golden Signals.

MetricGoodConcerningBad
P50<50ms50-200ms>200ms
P95<200ms200-1000ms>1000ms
P99<500ms500-2000ms>2000ms

Key insight: Average lies. P95/P99 tell the truth about user experience.

Requests per minute. Watch for spikes (viral content or attacks) and drops (site might be broken).

Error RateAssessment
<0.1%Excellent
0.1-1%Good
1-5%Investigate
>5%Something is broken

4xx errors are often expected (bad password, 404). 5xx errors need immediate attention.

Database connection pool usage. If >80% checked out or overflow > 0, you’re at capacity.

Request → MetricsMiddleware (records timing) → MetricsService (in-memory deque, 10k items)
→ /api/v1/monitoring/sre-metrics → SRE Dashboard (polls every 10s)
FilePurpose
backend/app/services/metrics.pyCore metrics collection
backend/app/middleware/metrics.pyRequest timing middleware
frontend/src/pages/admin/SREDashboard.tsxDashboard UI