Skip to content

Test Cases

A test case (displayed as “Use Case” in the UI) defines an evaluation scenario for your agent. Each test case includes a prompt, supporting context, and expected outcomes that the LLM judge uses for scoring.

FieldDescription
NameDescriptive title for the scenario
Initial PromptThe question or task sent to the agent
ContextSupporting data the agent needs (logs, metrics, architecture info)
Expected OutcomesList of what the agent should discover or accomplish
LabelsCategorization tags (e.g., category:RCA, difficulty:Medium)
  1. Go to Settings > Use Cases
  2. Click New Use Case
  3. Fill in the form:
    • Name: A descriptive scenario title
    • Initial Prompt: The question for your agent
    • Context: Supporting data your agent needs
    • Expected Outcomes: What the agent should accomplish
    • Labels: Tags like category:MyCategory, difficulty:Medium
  4. Click Save

Create Test Case

Test Cases List

Test cases can be defined in a JSON file for import:

[
{
"name": "My Test Case",
"category": "RCA",
"difficulty": "Medium",
"initialPrompt": "Investigate the latency spike...",
"expectedOutcomes": ["Identifies database as root cause"],
"context": [
{ "description": "Error logs", "value": "..." }
]
}
]

Import test cases from a file and run a benchmark in a single command:

Terminal window
# Import and benchmark
npx @opensearch-project/agent-health benchmark -f ./test-cases.json -a my-agent
# Export from an existing benchmark
npx @opensearch-project/agent-health export -b "My Benchmark" -o test-cases.json

The export format is compatible with import, so you can round-trip test cases between benchmarks:

Terminal window
# Export from one benchmark
npx @opensearch-project/agent-health export -b my-benchmark -o test-cases.json
# Import into a new benchmark run
npx @opensearch-project/agent-health benchmark -f test-cases.json -a another-agent
  • Make prompts specific and unambiguous — avoid vague instructions
  • Include all necessary context data — the agent shouldn’t need to guess
  • Define clear, measurable expected outcomes — the judge needs concrete criteria
  • Start with simple cases, add complexity gradually — build confidence before testing edge cases
  • Use labels for organization — filter and group test cases by category, difficulty, or domain