Metrics¶
Metric collection, aggregation, and trend analysis.
Collector¶
agentprobe.metrics.collector
¶
Stateless metric collector that extracts measurements from traces and results.
Converts traces, test results, and agent runs into MetricValue instances for storage and analysis.
MetricCollector
¶
Extracts metric values from traces, results, and runs.
Stateless: receives objects and returns lists of MetricValue. Does not store or persist anything.
Source code in src/agentprobe/metrics/collector.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
collect_from_trace(trace)
¶
Extract metric values from a single trace.
Collects latency, tool call count, and response length metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trace
|
Trace
|
The execution trace to extract metrics from. |
required |
Returns:
| Type | Description |
|---|---|
list[MetricValue]
|
A list of metric values extracted from the trace. |
Source code in src/agentprobe/metrics/collector.py
collect_from_result(result)
¶
Extract metric values from a test result.
Collects latency, eval score, and any trace-level metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
TestResult
|
The test result to extract metrics from. |
required |
Returns:
| Type | Description |
|---|---|
list[MetricValue]
|
A list of metric values extracted from the result. |
Source code in src/agentprobe/metrics/collector.py
collect_from_run(run)
¶
Extract metric values from a complete agent run.
Collects pass rate plus per-result metrics for all results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
AgentRun
|
The agent run to extract metrics from. |
required |
Returns:
| Type | Description |
|---|---|
list[MetricValue]
|
A list of metric values extracted from the run. |
Source code in src/agentprobe/metrics/collector.py
Aggregator¶
agentprobe.metrics.aggregator
¶
Metric aggregation: computes statistical summaries from metric values.
Uses stdlib statistics module for calculations — no numpy dependency.
MetricAggregator
¶
Computes statistical aggregations over collections of metric values.
Supports mean, median, min, max, p95, p99, and standard deviation.
All computations use the stdlib statistics module.
Source code in src/agentprobe/metrics/aggregator.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
aggregate(values)
¶
Aggregate a list of metric values into summary statistics.
All values must share the same metric_name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[MetricValue]
|
List of metric values to aggregate. |
required |
Returns:
| Type | Description |
|---|---|
MetricAggregation
|
A MetricAggregation with computed statistics. |
Raises:
| Type | Description |
|---|---|
MetricsError
|
If values is empty or metric names are inconsistent. |
Source code in src/agentprobe/metrics/aggregator.py
aggregate_by_name(values)
¶
Group metric values by name and aggregate each group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[MetricValue]
|
List of metric values (may contain multiple metric names). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, MetricAggregation]
|
A dictionary mapping metric names to their aggregations. |
Raises:
| Type | Description |
|---|---|
MetricsError
|
If values is empty. |
Source code in src/agentprobe/metrics/aggregator.py
Trend Analysis¶
agentprobe.metrics.trend
¶
Metric trend analysis: detects improving, degrading, or stable trends.
Compares recent metric values against a historical window to determine whether performance is changing over time.
MetricTrend
¶
Analyzes metric trends by comparing recent vs historical values.
Uses a split-window approach: divides a time-ordered series of values into a historical window and a recent window, then compares means.
Attributes:
| Name | Type | Description |
|---|---|---|
threshold |
Minimum relative change to flag as improving/degrading. |
Source code in src/agentprobe/metrics/trend.py
__init__(threshold=0.1)
¶
Initialize the trend analyzer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
Minimum relative change (fraction) to consider a trend as improving or degrading. Defaults to 0.1 (10%). |
0.1
|
Source code in src/agentprobe/metrics/trend.py
analyze(values, lower_is_better=True)
¶
Analyze the trend direction for a series of metric values.
Splits the values in half (by order) and compares means.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[MetricValue]
|
Time-ordered list of metric values (oldest first). |
required |
lower_is_better
|
bool
|
Whether lower values indicate improvement. |
True
|
Returns:
| Type | Description |
|---|---|
TrendDirection
|
The detected trend direction. |
Raises:
| Type | Description |
|---|---|
MetricsError
|
If fewer than 2 values are provided. |
Source code in src/agentprobe/metrics/trend.py
analyze_series(raw_values, lower_is_better=True)
¶
Analyze the trend from a raw numeric series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_values
|
list[float]
|
Time-ordered list of numeric values (oldest first). |
required |
lower_is_better
|
bool
|
Whether lower values indicate improvement. |
True
|
Returns:
| Type | Description |
|---|---|
TrendDirection
|
The detected trend direction. |
Source code in src/agentprobe/metrics/trend.py
Built-in Definitions¶
agentprobe.metrics.definitions
¶
Built-in metric definitions for common agent performance measurements.
Provides a registry of standard metrics that can be collected automatically during test execution, covering latency, cost, token usage, and scores.
get_builtin_definitions()
¶
Return all built-in metric definitions.
Returns:
| Type | Description |
|---|---|
dict[str, MetricDefinition]
|
A dictionary mapping metric names to their definitions. |
get_definition(name)
¶
Look up a built-in metric definition by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The metric name to look up. |
required |
Returns:
| Type | Description |
|---|---|
MetricDefinition | None
|
The metric definition if found, otherwise None. |