Evaluators¶
Evaluators score agent outputs against expected behavior. AgentProbe includes several built-in evaluators and supports custom implementations.
Built-in Evaluators¶
Rules-Based Evaluator¶
The RuleBasedEvaluator applies declarative rules with weighted scoring. Each rule checks a specific aspect of the output and contributes to the overall score.
from agentprobe.eval.rules import RuleBasedEvaluator, RuleSpec
evaluator = RuleBasedEvaluator(
name="my-rules",
rules=[
RuleSpec(
rule_type="contains_any",
params={"substrings": ["hello", "hi", "hey"]},
weight=1.0,
description="Output should contain a greeting",
),
RuleSpec(
rule_type="not_contains",
params={"substrings": ["error", "fail"]},
weight=2.0,
description="Output should not contain error terms",
),
RuleSpec(
rule_type="max_length",
params={"max_length": 500},
weight=0.5,
description="Output should be concise",
),
],
)
result = await evaluator.evaluate(test_case, trace)
Available rule types:
| Rule Type | Parameters | Description |
|---|---|---|
contains_any |
substrings: list[str] |
Output contains at least one substring |
not_contains |
substrings: list[str] |
Output does not contain any substring |
max_length |
max_length: int |
Output length does not exceed limit |
regex |
pattern: str |
Output matches a regex pattern |
json_valid |
(none) | Output is valid JSON |
Embedding Similarity Evaluator¶
Compares agent output against expected output using embedding vector cosine similarity:
from agentprobe.eval.embedding import EmbeddingSimilarityEvaluator
evaluator = EmbeddingSimilarityEvaluator(
model="text-embedding-3-small",
provider="openai",
threshold=0.8,
)
result = await evaluator.evaluate(test_case, trace)
# result.score is the cosine similarity (0.0 to 1.0)
Requires expected_output to be set on the TestCase. Uses caching to avoid redundant API calls.
Judge Evaluator¶
Uses a language model to evaluate agent output against a rubric:
from agentprobe.eval.llm_judge import LLMJudge
evaluator = LLMJudge(
model="claude-sonnet-4-5-20250929",
provider="anthropic",
temperature=0.0,
rubric="Evaluate whether the response is helpful, accurate, and concise.",
)
result = await evaluator.evaluate(test_case, trace)
The judge returns a structured verdict (PASS, FAIL, PARTIAL) with a score and reasoning.
Statistical Evaluator¶
Wraps another evaluator and runs it across multiple traces to compute aggregate statistics:
from agentprobe import StatisticalEvaluator
from agentprobe.eval.rules import RuleBasedEvaluator
inner = RuleBasedEvaluator(rules=[...])
evaluator = StatisticalEvaluator(inner, pass_threshold=0.7)
summary = await evaluator.evaluate_multiple(test_case, traces)
# summary.mean, summary.std_dev, summary.median, summary.p5, summary.p95
Trace Comparison Evaluator¶
Compares a trace against a reference trace across multiple dimensions:
from agentprobe import TraceComparisonEvaluator
evaluator = TraceComparisonEvaluator(
reference_trace=baseline_trace,
pass_threshold=0.7,
weights={
"tool_sequence": 0.3,
"tool_parameters": 0.2,
"output_similarity": 0.35,
"cost_deviation": 0.15,
},
)
result = await evaluator.evaluate(test_case, current_trace)
Configuring Default Evaluators¶
Set default evaluators in agentprobe.yaml:
eval:
default_evaluators:
- rules
judge:
model: claude-sonnet-4-5-20250929
provider: anthropic
temperature: 0.0
max_tokens: 1024
Evaluation Results¶
Every evaluator returns an EvalResult:
| Field | Type | Description |
|---|---|---|
evaluator_name |
str |
Name of the evaluator |
verdict |
EvalVerdict |
PASS, FAIL, PARTIAL, or ERROR |
score |
float |
Numeric score (0.0 to 1.0) |
reason |
str |
Human-readable explanation |
metadata |
dict |
Additional evaluator-specific data |
Verdicts:
PASS--- Output meets all criteriaFAIL--- Output does not meet criteriaPARTIAL--- Output partially meets criteriaERROR--- Evaluation itself failed