Skip to content

Model Evaluations

Evals are systematic tests for AI model outputs. Without them, you can’t compare models, measure prompt improvements, or catch quality regressions.

  • Domain-specific metrics — calorie error, protein error, hallucination rate (generic frameworks don’t have these)
  • Zero cost — just API costs, no $50-500/mo platform fees
  • Tight integration — uses the same code paths as production
  • No vendor lock-in — JSON datasets, Python code, easy to migrate to Braintrust later
Terminal window
cd backend
python run_evals.py --model openai # Single model
python run_evals.py --compare --models openai,claude # Compare
python run_evals.py --model openai --report # HTML report
python run_evals.py --model openai --sample 3 # Quick test
MetricTargetCritical?
Accuracy (pass rate)>90%Yes
Hallucination Rate<5%Yes
Avg Cost<$0.01/mealImportant
p95 Latency<3sImportant
{
"name": "meal_analysis_v1",
"cases": [{
"id": "meal_001",
"image_url": "evals/datasets/images/meal_001.jpg",
"ground_truth": {
"foods": ["Grilled Chicken", "Brown Rice", "Broccoli"],
"macros": { "calories": 450, "protein": 55 }
},
"tags": ["high-protein"],
"difficulty": "easy"
}]
}
ModelPass RateCalorie ErrorCost/Mealp95 Latency
GPT-4o-mini87.5%8.3%$0.0042.3s
Claude Sonnet92.0%6.1%$0.0163.1s
Gemini Flash78.3%11.2%$0.0011.8s

Evals run automatically on push to main or when commit message contains [evals]. Deploys blocked if pass rate drops below 85%.