Skip to content

Evaluations Overview

DuraGraph Evals provides a comprehensive evaluation framework for AI agents. Test your agents automatically, collect human feedback, and monitor quality over time.

AI agents can fail in subtle ways. Evaluations help you:

  • Catch regressions before deploying to production
  • Compare versions to pick the best prompt or model
  • Monitor quality continuously in production
  • Collect feedback to improve over time

Heuristic

Rule-based checks like exact match, contains, regex, JSON validation

LLM Judge

Use GPT-4 or Claude to evaluate subjective quality

Human Feedback

Collect ratings, thumbs up/down, and comments

Comparison

A/B test between assistant versions

First, create a dataset with test cases:

Terminal window
curl -X POST http://localhost:8081/api/v1/datasets \
-H "Content-Type: application/json" \
-d '{
"name": "customer_support_cases",
"description": "Common customer support scenarios"
}'

Add input/expected output pairs:

Terminal window
curl -X POST http://localhost:8081/api/v1/datasets/{dataset_id}/examples \
-H "Content-Type: application/json" \
-d '{
"examples": [
{
"input": {"message": "How do I reset my password?"},
"expected_output": {"contains": "reset", "action": "password_reset"},
"tags": ["password", "account"]
},
{
"input": {"message": "I want a refund"},
"expected_output": {"action": "refund_request"},
"tags": ["billing", "refund"]
}
]
}'

Create and run an evaluation against your assistant:

Terminal window
curl -X POST http://localhost:8081/api/v1/evals \
-H "Content-Type: application/json" \
-d '{
"name": "support_agent_v1.2",
"type": "heuristic",
"assistant_id": "your-assistant-id",
"dataset_id": "your-dataset-id",
"criteria": [
{
"name": "contains_action",
"scorer": "json_schema",
"config": {
"schema": {
"type": "object",
"required": ["action"]
}
}
},
{
"name": "response_quality",
"scorer": "llm_judge",
"config": {
"model": "gpt-4o",
"rubric": "Rate helpfulness 1-5"
}
}
],
"run_immediately": true
}'

Check evaluation results in the dashboard or via API:

Terminal window
curl http://localhost:8081/api/v1/evals/{eval_id}
{
"id": "eval-123",
"name": "support_agent_v1.2",
"status": "completed",
"summary": {
"total_examples": 50,
"passed": 45,
"failed": 5,
"pass_rate": 0.9,
"avg_score": 4.2,
"by_criterion": {
"contains_action": { "pass_rate": 0.96 },
"response_quality": { "avg_score": 4.2 }
}
}
}

The Evals Dashboard provides visualization and analysis:

  • Overview: Pass rate trends, score distribution, regressions
  • Eval Detail: Drill down into individual results
  • Comparison: Side-by-side A/B testing
  • Feedback: Human annotation queue

Scorers Reference

Learn about all available scorers