Evaluations
Agent Health evaluates AI agents by comparing their execution trajectories against expected outcomes using an LLM judge. This page explains the evaluation pipeline and core concepts.
How evaluations work
Section titled “How evaluations work”An evaluation follows this flow:
- A test case provides a prompt, context, and expected outcomes
- The agent executes the prompt and produces a trajectory (sequence of steps)
- An LLM judge compares the trajectory against expected outcomes
- The judge returns a score with pass/fail status, accuracy, reasoning, and improvement suggestions
graph LR
TC["Test Case"] -->|prompt + context| Agent
Agent -->|trajectory| Judge["LLM Judge"]
TC -->|expected outcomes| Judge
Judge -->|score| Results["Pass/Fail + Accuracy + Reasoning"]
Golden Path concept
Section titled “Golden Path concept”A “Golden Path” is the expected trajectory an agent should follow to successfully complete a task. It defines:
- Which tools the agent should invoke
- What reasoning steps are expected
- What the final response should contain
The LLM judge doesn’t require an exact match — it evaluates whether the agent’s actual trajectory achieves the expected outcomes through reasonable steps, even if the specific path differs.
LLM Judge output
Section titled “LLM Judge output”Each evaluation produces a judge result with:
| Field | Description |
|---|---|
| Pass/Fail | Whether the agent met the expected outcomes |
| Accuracy | Performance score (0-100%) |
| Reasoning | Detailed analysis of the agent’s trajectory |
| Improvements | Suggestions for better agent performance |
Demo Judge vs Production Judge
Section titled “Demo Judge vs Production Judge”| Demo Judge | Production Judge | |
|---|---|---|
| Backend | In-memory mock | AWS Bedrock |
| Credentials | None required | AWS credentials |
| Scoring | Simulated scores | Real LLM evaluation |
| Use case | Testing and exploration | Production evaluation |
To use the production judge, configure AWS credentials in your .env file:
AWS_REGION=us-west-2AWS_ACCESS_KEY_ID=your_keyAWS_SECRET_ACCESS_KEY=your_secretNext steps
Section titled “Next steps”- Test Cases — create and manage evaluation scenarios
- Experiments — run batch evaluations and compare results