Model Evaluations
Evals are systematic tests for AI model outputs. Without them, you can’t compare models, measure prompt improvements, or catch quality regressions.
Why Custom Evals?
Section titled “Why Custom Evals?”- Domain-specific metrics — calorie error, protein error, hallucination rate (generic frameworks don’t have these)
- Zero cost — just API costs, no $50-500/mo platform fees
- Tight integration — uses the same code paths as production
- No vendor lock-in — JSON datasets, Python code, easy to migrate to Braintrust later
Quick Start
Section titled “Quick Start”cd backendpython run_evals.py --model openai # Single modelpython run_evals.py --compare --models openai,claude # Comparepython run_evals.py --model openai --report # HTML reportpython run_evals.py --model openai --sample 3 # Quick testMetrics
Section titled “Metrics”| Metric | Target | Critical? |
|---|---|---|
| Accuracy (pass rate) | >90% | Yes |
| Hallucination Rate | <5% | Yes |
| Avg Cost | <$0.01/meal | Important |
| p95 Latency | <3s | Important |
Dataset Structure
Section titled “Dataset Structure”{ "name": "meal_analysis_v1", "cases": [{ "id": "meal_001", "image_url": "evals/datasets/images/meal_001.jpg", "ground_truth": { "foods": ["Grilled Chicken", "Brown Rice", "Broccoli"], "macros": { "calories": 450, "protein": 55 } }, "tags": ["high-protein"], "difficulty": "easy" }]}Model Comparison Example
Section titled “Model Comparison Example”| Model | Pass Rate | Calorie Error | Cost/Meal | p95 Latency |
|---|---|---|---|---|
| GPT-4o-mini | 87.5% | 8.3% | $0.004 | 2.3s |
| Claude Sonnet | 92.0% | 6.1% | $0.016 | 3.1s |
| Gemini Flash | 78.3% | 11.2% | $0.001 | 1.8s |
CI/CD Integration
Section titled “CI/CD Integration”Evals run automatically on push to main or when commit message contains [evals]. Deploys blocked if pass rate drops below 85%.