Evaluation¶
Evaluators for assessing agent output quality.
Rules-Based Evaluator¶
agentprobe.eval.rules
¶
Rule-based evaluator with configurable rules and weighted scoring.
Provides a declarative evaluation approach using built-in rule handlers
like contains_any, not_contains, max_length, regex, and
json_valid.
RuleSpec
¶
Bases: BaseModel
Specification for a single evaluation rule.
Attributes:
| Name | Type | Description |
|---|---|---|
rule_type |
str
|
The type of rule (e.g. 'contains_any', 'regex'). |
params |
dict[str, Any]
|
Parameters for the rule handler. |
weight |
float
|
Relative weight of this rule in the overall score. |
description |
str
|
Human-readable description of what this rule checks. |
Source code in src/agentprobe/eval/rules.py
RuleBasedEvaluator
¶
Bases: BaseEvaluator
Evaluator that applies a set of declarative rules with weighted scoring.
Each rule is checked against the agent output. The final score is the weighted average of passing rules.
Attributes:
| Name | Type | Description |
|---|---|---|
rules |
List of rule specifications to evaluate. |
Source code in src/agentprobe/eval/rules.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | |
Embedding Evaluator¶
agentprobe.eval.embedding
¶
Embedding similarity evaluator using cosine similarity.
Compares agent output embeddings against expected output embeddings to produce a similarity score.
EmbeddingSimilarityEvaluator
¶
Bases: BaseEvaluator
Evaluator that compares embeddings via cosine similarity.
Obtains embeddings for expected and actual outputs from an embedding API, then computes cosine similarity. A threshold determines pass/fail.
Attributes:
| Name | Type | Description |
|---|---|---|
model |
Embedding model identifier. |
|
provider |
API provider ('openai'). |
|
threshold |
Minimum similarity score to pass. |
Source code in src/agentprobe/eval/embedding.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | |
__init__(*, model='text-embedding-3-small', provider='openai', api_key=None, threshold=0.8, name='embedding-similarity')
¶
Initialize the embedding similarity evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Embedding model name. |
'text-embedding-3-small'
|
provider
|
str
|
API provider. |
'openai'
|
api_key
|
str | None
|
API key. Read from environment if None. |
None
|
threshold
|
float
|
Minimum similarity to pass. |
0.8
|
name
|
str
|
Evaluator name. |
'embedding-similarity'
|
Source code in src/agentprobe/eval/embedding.py
cosine_similarity(vec_a, vec_b)
¶
Compute cosine similarity between two vectors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vec_a
|
list[float]
|
First vector. |
required |
vec_b
|
list[float]
|
Second vector. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Cosine similarity score in [-1.0, 1.0]. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If vectors have different lengths or are empty. |
Source code in src/agentprobe/eval/embedding.py
Judge Evaluator¶
agentprobe.eval.llm_judge
¶
Judge evaluator that uses a language model to assess agent outputs.
Sends the agent's output along with a rubric to a judge model and parses the structured JSON response into an EvalResult.
LLMJudge
¶
Bases: BaseEvaluator
Evaluator that uses a language model as a judge.
Calls an external model API (Anthropic or OpenAI) with the agent's output and a rubric, then parses the JSON verdict response.
Attributes:
| Name | Type | Description |
|---|---|---|
model |
The judge model identifier. |
|
provider |
API provider ('anthropic' or 'openai'). |
|
temperature |
Sampling temperature for the judge. |
|
max_tokens |
Maximum response tokens. |
|
rubric |
Evaluation rubric/criteria text. |
Source code in src/agentprobe/eval/llm_judge.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | |
__init__(*, model='claude-sonnet-4-5-20250929', provider='anthropic', api_key=None, temperature=0.0, max_tokens=1024, rubric='', name='llm-judge')
¶
Initialize the judge evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Judge model identifier. |
'claude-sonnet-4-5-20250929'
|
provider
|
str
|
API provider name. |
'anthropic'
|
api_key
|
str | None
|
API key. Read from environment if None. |
None
|
temperature
|
float
|
Sampling temperature. |
0.0
|
max_tokens
|
int
|
Max response tokens. |
1024
|
rubric
|
str
|
Evaluation criteria text. |
''
|
name
|
str
|
Evaluator name. |
'llm-judge'
|
Source code in src/agentprobe/eval/llm_judge.py
Statistical Evaluator¶
agentprobe.eval.statistical
¶
Statistical evaluator for repeated evaluation with aggregated metrics.
Wraps an inner evaluator and runs it multiple times across pre-collected traces, computing mean, standard deviation, percentiles, and confidence intervals from the score distribution.
StatisticalEvaluator
¶
Bases: BaseEvaluator
Evaluator that runs an inner evaluator multiple times and aggregates stats.
Wraps another evaluator and runs it against multiple traces for the same test case, computing distributional statistics on the resulting scores.
Attributes:
| Name | Type | Description |
|---|---|---|
inner |
BaseEvaluator
|
The wrapped evaluator instance. |
pass_threshold |
BaseEvaluator
|
Minimum mean score to consider a pass. |
Source code in src/agentprobe/eval/statistical.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
inner
property
¶
Return the wrapped evaluator.
__init__(inner, *, name=None, pass_threshold=0.7)
¶
Initialize the statistical evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inner
|
BaseEvaluator
|
The evaluator to wrap and run repeatedly. |
required |
name
|
str | None
|
Optional name override. Defaults to 'statistical-{inner.name}'. |
None
|
pass_threshold
|
float
|
Minimum mean score for a pass verdict. |
0.7
|
Source code in src/agentprobe/eval/statistical.py
evaluate_multiple(test_case, traces)
async
¶
Evaluate multiple traces and compute aggregate statistics.
Runs the inner evaluator on each trace, collects scores, and computes mean, standard deviation, median, percentiles, and a 95% confidence interval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_case
|
TestCase
|
The test case specification. |
required |
traces
|
Sequence[Trace]
|
Pre-collected traces to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
StatisticalSummary
|
A statistical summary of the score distribution. |
Source code in src/agentprobe/eval/statistical.py
summary_to_eval_result(summary)
¶
Convert a statistical summary into a standard EvalResult.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
summary
|
StatisticalSummary
|
The summary to convert. |
required |
Returns:
| Type | Description |
|---|---|
EvalResult
|
An EvalResult with the mean score and appropriate verdict. |
Source code in src/agentprobe/eval/statistical.py
Trace Comparison Evaluator¶
agentprobe.eval.trace_compare
¶
Trace comparison evaluator with weighted multi-dimension scoring.
Compares two traces across tool sequences, tool parameters, output similarity, and cost deviation, producing a weighted composite score.
TraceComparisonEvaluator
¶
Bases: BaseEvaluator
Evaluator that compares a trace against a reference trace.
Computes similarity across multiple dimensions with configurable weights: tool sequence, tool parameters, output text, and cost.
Attributes:
| Name | Type | Description |
|---|---|---|
reference_trace |
The reference trace to compare against. |
|
weights |
Per-dimension weight configuration. |
Source code in src/agentprobe/eval/trace_compare.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
__init__(reference_trace, *, name='trace-compare', weights=None, pass_threshold=0.7)
¶
Initialize the trace comparison evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_trace
|
Trace
|
The baseline trace to compare against. |
required |
name
|
str
|
Evaluator name. |
'trace-compare'
|
weights
|
dict[str, float] | None
|
Dimension weight overrides. |
None
|
pass_threshold
|
float
|
Minimum score for a pass verdict. |
0.7
|
Source code in src/agentprobe/eval/trace_compare.py
Base Evaluator¶
agentprobe.eval.base
¶
Abstract base evaluator with template-method pattern.
Subclasses implement _evaluate() while the base class handles
timing, error wrapping, and consistent result construction.
BaseEvaluator
¶
Bases: ABC
Abstract base class for all evaluators.
Provides a public evaluate() template method that delegates to
the subclass-defined _evaluate(), adding timing and error handling.
Attributes:
| Name | Type | Description |
|---|---|---|
_name |
The evaluator's name, used in results and logging. |
Source code in src/agentprobe/eval/base.py
name
property
¶
Return the evaluator name.
__init__(name)
¶
Initialize the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A unique name identifying this evaluator instance. |
required |
evaluate(test_case, trace)
async
¶
Evaluate an agent trace for a given test case.
This template method times the evaluation, catches errors, and ensures a consistent EvalResult is always returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_case
|
TestCase
|
The test case that was executed. |
required |
trace
|
Trace
|
The execution trace to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
EvalResult
|
An evaluation result with score and verdict. |