Regression Testing¶
AgentProbe's regression testing system helps you detect behavioral changes between agent versions by comparing test results against saved baselines.
Overview¶
The regression testing workflow:
- Run your test suite and save results as a baseline
- Make changes to your agent
- Run the test suite again and compare against the baseline
- Review detected regressions and improvements
Managing Baselines¶
Save a Baseline¶
List Baselines¶
Delete a Baseline¶
Using the Python API¶
from agentprobe import BaselineManager
manager = BaselineManager(baseline_dir=".agentprobe/baselines")
# Save results as a baseline
path = manager.save("v1.0", test_results)
# Load a baseline
baseline_results = manager.load("v1.0")
# Check if a baseline exists
if manager.exists("v1.0"):
print("Baseline found")
# List all baselines
baselines = manager.list_baselines()
Detecting Regressions¶
The RegressionDetector compares current test results against a baseline and flags significant score changes:
from agentprobe import RegressionDetector
detector = RegressionDetector(threshold=0.05)
report = detector.compare(
baseline_name="v1.0",
baseline_results=baseline_results,
current_results=current_results,
)
The threshold parameter controls sensitivity --- a score delta must exceed this value to be flagged. Default is 0.05 (5%).
Configuration¶
| Key | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
false |
Enable regression detection |
baseline_dir |
string |
.agentprobe/baselines |
Directory for baseline files |
threshold |
float |
0.05 |
Score delta threshold (0.0--1.0) |
Best Practices¶
- Save baselines at release points --- name them after versions
- Set appropriate thresholds --- too low causes noise, too high misses real regressions
- Integrate into CI/CD --- compare against the latest stable baseline on every PR
- Review improvements too --- unexpected score increases can indicate test issues