Skip to content

Getting Started

This guide walks you through using Agent Health to evaluate AI agents. The application includes a Travel Planner multi-agent demo so you can explore all features without configuring external services.

Required:

Optional (for production use):

  • AWS credentials (for Bedrock LLM Judge)
  • OpenSearch cluster (for persistence and traces)
Terminal window
node --version # Should be v18.0.0 or higher
npm --version # Should be v8.0.0 or higher

Run Agent Health with npx (no installation needed):

Terminal window
npx @opensearch-project/agent-health@latest

What happens:

  1. Downloads Agent Health (if first run)
  2. Starts the server on port 4001
  3. Opens your browser to http://localhost:4001
  4. Loads sample data automatically

For frequent use, install globally:

Terminal window
npm install -g @opensearch-project/agent-health
agent-health

Agent Health includes a built-in Travel Planner multi-agent demo, along with a Demo Judge, for testing without external services.

  • Simulates a multi-agent Travel Planner system with realistic trajectories
  • Agent types: Travel Coordinator, Weather Agent, Events Agent, Booking Agent, Budget Agent
  • No external endpoint required — select “Demo Agent” in the agent dropdown
  • Provides mock evaluation scores without AWS Bedrock
  • Automatically selected when using Demo Agent
  • No AWS credentials required

The Travel Planner demo includes pre-loaded sample data:

Data TypeCountDescription
Test Cases5Travel Planner multi-agent scenarios
Experiments2Demo experiments with completed runs
Runs6Completed evaluation results across experiments
Traces5OpenTelemetry trace trees for visualization

Sample data IDs start with demo- prefix and are read-only.

Agent Health Dashboard

The main dashboard displays:

  • Active experiments and their status
  • Recent evaluation runs
  • Quick statistics on pass/fail rates
  1. Click Evals in the sidebar
  2. Click New Evaluation
  3. Configure:
    • Agent: Select “Demo Agent”
    • Model: Select “Demo Model”
    • Test Case: Select any Travel Planner scenario
  4. Click Run Evaluation

The agent streams its execution in real-time. You’ll see thinking steps, tool calls, and responses, followed by an LLM judge evaluation with pass/fail status and accuracy score.

Terminal window
# List available test cases
npx @opensearch-project/agent-health list test-cases
# Run a specific test case
npx @opensearch-project/agent-health run -t demo-otel-001 -a demo
# View the results in the UI
open http://localhost:4001/runs

Click on any evaluation result to view the detailed trajectory:

Step TypeDescriptionExample
thinkingAgent’s internal reasoning”I need to check the weather forecast…“
actionTool invocationsearchFlights({ destination: "Paris", dates: "Mar 15-18" })
tool_resultTool response{ flights: [...], cheapest: "$450" }
responseFinal conclusion”Here’s your optimized 3-day Paris itinerary…”

Each step shows timestamp, duration, tool arguments (for actions), full tool output (for tool_results), and the judge’s evaluation reasoning.

Experiment Detail