Multi-Agent Testing¶
AgentProbe supports testing multiple agents and comparing their behavior side-by-side.
Testing Multiple Agents¶
Use adapters to connect AgentProbe to different agent frameworks:
from agentprobe.adapters.langchain import LangChainAdapter
from agentprobe.core.runner import TestRunner
from agentprobe.core.config import load_config
config = load_config()
adapters = [
LangChainAdapter(name="agent-v1"),
LangChainAdapter(name="agent-v2"),
]
runner = TestRunner(config=config)
for adapter in adapters:
run = await runner.run(test_cases, adapter)
print(f"{adapter.name}: {run.passed}/{run.total_tests} passed")
Comparing Agent Versions¶
Use trace comparison to quantify differences between agent versions:
from agentprobe import TraceComparisonEvaluator
trace_v1 = await adapter_v1.invoke("What is the weather?")
trace_v2 = await adapter_v2.invoke("What is the weather?")
evaluator = TraceComparisonEvaluator(
reference_trace=trace_v1,
pass_threshold=0.7,
)
result = await evaluator.evaluate(test_case, trace_v2)
print(f"Similarity score: {result.score:.2f}")
Multi-Turn Conversations¶
Test agents across multi-turn dialogues:
from agentprobe import ConversationRunner
runner = ConversationRunner(
adapter=my_adapter,
evaluators=[my_evaluator],
)
result = await runner.run(
turns=[
"Book a flight to London",
"Make it for next Tuesday",
"Economy class, please",
],
agent_name="booking-agent",
)
print(f"Turns passed: {result.passed_turns}/{result.total_turns}")
print(f"Aggregate score: {result.aggregate_score:.2f}")
Statistical Comparison¶
Compare agents across multiple runs using the StatisticalEvaluator:
from agentprobe import StatisticalEvaluator
from agentprobe.eval.rules import RuleBasedEvaluator
inner = RuleBasedEvaluator(rules=[...])
evaluator = StatisticalEvaluator(inner, pass_threshold=0.7)
traces = [await adapter.invoke(input_text) for _ in range(10)]
summary = await evaluator.evaluate_multiple(test_case, traces)
print(f"Mean score: {summary.mean:.3f} +/- {summary.std_dev:.3f}")
Best Practices¶
- Use the same test cases across agents for fair comparison
- Run multiple iterations to account for non-deterministic behavior
- Compare costs alongside quality scores
- Track metrics over time with the metrics system
- Save baselines per agent version for regression tracking