Evaluation Scorers
Scorers evaluate agent outputs and produce scores. DuraGraph supports both heuristic (rule-based) and LLM-based scorers.
Heuristic Scorers
Section titled “Heuristic Scorers”Fast, deterministic scorers for objective criteria.
exact_match
Section titled “exact_match”Checks if output exactly matches expected value.
{ "name": "answer_correct", "scorer": "exact_match", "config": { "field": "answer", "case_sensitive": false }}contains
Section titled “contains”Checks if output contains a substring.
{ "name": "mentions_refund", "scorer": "contains", "config": { "substring": "refund", "case_sensitive": false }}Checks if output matches a regular expression.
{ "name": "valid_email", "scorer": "regex", "config": { "pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" }}json_valid
Section titled “json_valid”Checks if output is valid JSON.
{ "name": "valid_json", "scorer": "json_valid"}json_schema
Section titled “json_schema”Validates output against a JSON Schema.
{ "name": "valid_response", "scorer": "json_schema", "config": { "schema": { "type": "object", "required": ["action", "message"], "properties": { "action": { "type": "string", "enum": ["reply", "transfer", "escalate"] }, "message": { "type": "string", "minLength": 1 } } } }}length
Section titled “length”Checks if output length is within bounds.
{ "name": "appropriate_length", "scorer": "length", "config": { "min": 50, "max": 500 }}LLM Judge Scorer
Section titled “LLM Judge Scorer”Uses an LLM to evaluate subjective qualities.
Basic Usage
Section titled “Basic Usage”{ "name": "helpfulness", "scorer": "llm_judge", "config": { "model": "gpt-4o", "criteria": ["helpfulness", "accuracy", "clarity"] }}Custom Rubric
Section titled “Custom Rubric”{ "name": "custom_quality", "scorer": "llm_judge", "config": { "model": "gpt-4o", "rubric": "Evaluate the response on:\n1. Does it address the user's question?\n2. Is the tone professional?\n3. Is the information accurate?\n\nScore 1-5 for each criterion." }}Multi-Criteria
Section titled “Multi-Criteria”{ "name": "comprehensive", "scorer": "llm_judge", "config": { "model": "claude-3-sonnet", "criteria": [ { "name": "relevance", "description": "How relevant is the response to the query?" }, { "name": "completeness", "description": "Does the response fully address the question?" }, { "name": "safety", "description": "Is the response safe and appropriate?" } ] }}Using Scorers in Python SDK
Section titled “Using Scorers in Python SDK”from duragraph.evals import ( EvalRunner, ExactMatch, Contains, Regex, JSONSchema,)
runner = EvalRunner(graph=my_agent,scorers=[ExactMatch(field="action"),Contains(substring="thank you", case_sensitive=False),JSONSchema(schema={"type": "object", "required": ["response"]}),],)
results = await runner.run(dataset)from duragraph.evals import EvalRunner, LLMJudge
runner = EvalRunner( graph=my_agent, scorers=[ LLMJudge( model="gpt-4o", criteria=["helpfulness", "accuracy"], rubric="Rate each criterion 1-5", ), ],)
results = await runner.run(dataset)Score Output
Section titled “Score Output”All scorers produce a standardized score:
@dataclassclass Score: criterion: str # Name of the criterion value: float # Score value (0-1 for pass/fail, 1-5 for ratings) passed: bool # Whether the score passes threshold explanation: str # Human-readable explanationCustom Scorers
Section titled “Custom Scorers”Create custom scorers by implementing the Scorer interface:
from duragraph.evals import Scorer, Score
class SentimentScorer(Scorer): """Custom scorer that checks sentiment."""
def __init__(self, target_sentiment: str = "positive"): self.target_sentiment = target_sentiment
async def score( self, output: str, expected: str | None, config: dict | None, ) -> Score: # Your scoring logic here sentiment = analyze_sentiment(output) passed = sentiment == self.target_sentiment
return Score( criterion="sentiment", value=1.0 if passed else 0.0, passed=passed, explanation=f"Detected sentiment: {sentiment}", )
# Use custom scorerrunner = EvalRunner( graph=my_agent, scorers=[SentimentScorer(target_sentiment="positive")],)Best Practices
Section titled “Best Practices”- Start simple: Begin with heuristic scorers, add LLM judge for nuanced evaluation
- Be specific: Clear criteria produce more reliable scores
- Calibrate thresholds: Adjust pass/fail thresholds based on your quality bar
- Monitor drift: Track scores over time to catch regressions