Skip to content

Evaluations

Agent Health evaluates AI agents by comparing their execution trajectories against expected outcomes using an LLM judge. This page explains the evaluation pipeline and core concepts.

An evaluation follows this flow:

  1. A test case provides a prompt, context, and expected outcomes
  2. The agent executes the prompt and produces a trajectory (sequence of steps)
  3. An LLM judge compares the trajectory against expected outcomes
  4. The judge returns a score with pass/fail status, accuracy, reasoning, and improvement suggestions
graph LR
    TC["Test Case"] -->|prompt + context| Agent
    Agent -->|trajectory| Judge["LLM Judge"]
    TC -->|expected outcomes| Judge
    Judge -->|score| Results["Pass/Fail + Accuracy + Reasoning"]

A “Golden Path” is the expected trajectory an agent should follow to successfully complete a task. It defines:

  • Which tools the agent should invoke
  • What reasoning steps are expected
  • What the final response should contain

The LLM judge doesn’t require an exact match — it evaluates whether the agent’s actual trajectory achieves the expected outcomes through reasonable steps, even if the specific path differs.

Each evaluation produces a judge result with:

FieldDescription
Pass/FailWhether the agent met the expected outcomes
AccuracyPerformance score (0-100%)
ReasoningDetailed analysis of the agent’s trajectory
ImprovementsSuggestions for better agent performance
Demo JudgeProduction Judge
BackendIn-memory mockAWS Bedrock
CredentialsNone requiredAWS credentials
ScoringSimulated scoresReal LLM evaluation
Use caseTesting and explorationProduction evaluation

To use the production judge, configure AWS credentials in your .env file:

Terminal window
AWS_REGION=us-west-2
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
  • Test Cases — create and manage evaluation scenarios
  • Experiments — run batch evaluations and compare results