Skip to content

AI Observability

The Observability Stack provides end-to-end tooling for AI agent observability - from instrumenting your code to viewing traces, scoring quality, and running evaluations.

flowchart LR
    subgraph instrument["1   Instrument"]
        direction TB
        A1["<b>GenAI SDK</b><br/>@observe, enrich(),<br/>auto-instrument LLMs"]
    end
    subgraph normalize["2 &nbsp; Normalize"]
        direction TB
        B1["<b>OTEL Collector</b><br/>Standardize spans,<br/>semantic conventions"]
    end
    subgraph local["3 &nbsp; Local Tooling"]
        direction TB
        C1["<b>Agent Health</b><br/>Debugging, scoring,<br/>evaluations"]
    end
    subgraph process["4 &nbsp; Process"]
        direction TB
        D1["<b>Data Prepper</b><br/>Service maps, trace<br/>correlation, aggregation"]
    end
    subgraph analyze["5 &nbsp; Analyze"]
        direction TB
        E1["<b>OpenSearch Dashboards</b><br/>Agent traces, graphs,<br/>evals, APM"]
    end
    instrument --> normalize --> process --> analyze
    normalize --> local
    local --> process
  • Agent tracing - visualize LLM agent execution as trace trees, DAG graphs, and timelines
  • GenAI semantic conventions - standard gen_ai.* attributes for model, tokens, tools, and sessions
  • Evaluation & scoring - attach quality scores to traces, run experiments against datasets
  • Trace retrieval - query stored traces from OpenSearch for evaluation pipelines
  • Auto-instrumentation - OpenAI, Anthropic, Bedrock, LangChain, and 20+ libraries traced automatically
  • MCP server - query OpenSearch from AI agents via the built-in Model Context Protocol server

Start here for a hands-on walkthrough from pip install to seeing traces and scoring quality:

  • Getting Started - instrument an agent, view traces, score quality in 5 minutes

Send agent trace data to the observability stack:

Explore traces in OpenSearch Dashboards:

Score agent quality and run experiments:

  • Evaluation & Scoring - score(), evaluate(), Benchmark, trace retrieval
  • Agent Health - Golden Path trajectory comparison, LLM judge scoring, batch experiments via UI and CLI
  • MCP Server - query OpenSearch from AI agents via MCP