Skip to content

Getting Started

This guide walks you through instrumenting an AI agent with the Python SDK, viewing traces in OpenSearch Dashboards, and scoring agent quality. By the end you’ll have a working observability pipeline for your AI application.

From code to insight - the platform covers the full AI observability lifecycle:

flowchart LR
    subgraph instrument["1   Instrument"]
        direction TB
        A1["<b>GenAI SDK</b><br/>In-code instrumentation,<br/>agents, tools, one-line OTEL,<br/>SigV4 auto-detect"]
    end
    subgraph normalize["2 &nbsp; Normalize"]
        direction TB
        B1["<b>OTEL Collector</b><br/>Standardize spans &<br/>attributes, semantic<br/>conventions, enrichment"]
    end
    subgraph local["3 &nbsp; Local Tooling"]
        direction TB
        C1["<b>Agent Health & Evals</b><br/>Local debugging, scoring,<br/>evaluations, trace inspection"]
    end
    subgraph process["4 &nbsp; Process"]
        direction TB
        D1["<b>OTEL Middleware</b><br/>Metrics, topology map,<br/>service maps, trace<br/>correlation & aggregation"]
    end
    subgraph analyze["5 &nbsp; Analyze"]
        direction TB
        E1["<b>OpenSearch Analytics<br/>& Dashboards</b><br/>Logs, metrics, traces,<br/>agent traces, evals,<br/>APM, correlations"]
    end
    instrument --> normalize --> process --> analyze
    normalize --> local
    local --> process

The GenAI Observability SDK handles the full lifecycle of agent observability:

flowchart LR
    subgraph instrument["Instrument"]
        A1["register() + @observe<br/>One-line setup, trace<br/>agents, tools, LLM calls"]
    end
    subgraph enrich_block["Enrich"]
        B1["enrich()<br/>Model, tokens, provider,<br/>session, tool definitions"]
    end
    subgraph evaluate["Evaluate"]
        C1["score() + evaluate()<br/>Attach quality scores,<br/>run experiments"]
    end
    subgraph analyze["Analyze"]
        D1["OpenSearch Dashboards<br/>Agent traces, graphs,<br/>timelines, PPL queries"]
    end
    instrument --> enrich_block --> evaluate --> analyze
CapabilityFunctionWhat it does
Pipeline setupregister()Configures OTEL tracer, exporter, and auto-instrumentation in one call
Trace agents & tools@observeDecorator/context manager that creates spans with GenAI semantic attributes
Enrich spansenrich()Sets model, tokens, provider, session ID, and other GenAI attributes on the active span
Auto-instrument LLMsregister(auto_instrument=True)OpenAI, Anthropic, Bedrock, LangChain, and 20+ libraries traced automatically
Score tracesscore()Attaches evaluation scores to traces through the OTLP pipeline
Run experimentsevaluate()Runs a task against a dataset with scorer functions, records everything as OTel spans
Upload resultsExperimentUploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks
Query tracesOpenSearchTraceRetrieverRetrieves stored traces from OpenSearch for evaluation pipelines
AWS productionAWSSigV4OTLPExporterSigV4-signed exports to OpenSearch Ingestion or OpenSearch Service
  • Python 3.10+
  • Docker (for the observability stack)
  • An AI agent application (or use the example below)
  1. Install the SDK

    Terminal window
    pip install opensearch-genai-observability-sdk-py

    For auto-instrumentation of your LLM provider:

    Terminal window
    pip install "opensearch-genai-observability-sdk-py[openai]" # or [anthropic], [bedrock], etc.
  2. Start the observability stack

    Terminal window
    git clone https://github.com/opensearch-project/observability-stack.git
    cd observability-stack
    docker compose up -d

    This starts OpenSearch, Data Prepper, OTel Collector, Prometheus, and OpenSearch Dashboards.

  3. Instrument your agent

    Add register() at startup and @observe on your functions:

    from opensearch_genai_observability_sdk_py import register, observe, Op, enrich
    # One-line setup - connects to the local OTel Collector
    register(
    endpoint="http://localhost:4318/v1/traces",
    service_name="my-agent",
    )
    @observe(op=Op.EXECUTE_TOOL)
    def search_docs(query: str) -> list[dict]:
    """Search the knowledge base."""
    return [{"title": "Result 1", "content": "OpenSearch is a search engine"}]
    @observe(op=Op.INVOKE_AGENT)
    def my_agent(question: str) -> str:
    enrich(model="gpt-4o", provider="openai", session_id="session-123")
    docs = search_docs(question)
    answer = f"Based on {len(docs)} docs: {docs[0]['content']}"
    enrich(input_tokens=150, output_tokens=50)
    return answer
    # Run the agent
    result = my_agent("What is OpenSearch?")
    print(result)

    This produces a trace with:

    • A root span invoke_agent my_agent with model, token, and session attributes
    • A child span execute_tool search_docs with tool name and arguments
    • Auto-captured input/output on both spans
  4. View traces in OpenSearch Dashboards

    Open http://localhost:5601 and navigate to Observability > Agent Traces.

    You’ll see your agent trace with the full span tree, token usage, latency, and input/output content. Expand the trace to see the tool call nested under the agent invocation.

  5. Score agent quality

    After reviewing a trace, attach a quality score:

    from opensearch_genai_observability_sdk_py import score
    score(
    name="relevance",
    value=0.95,
    trace_id="<trace-id-from-dashboards>",
    label="relevant",
    explanation="Answer correctly references OpenSearch documentation",
    )

    Scores appear as evaluation spans in the same trace, queryable alongside the agent spans.

  6. Run an experiment (optional)

    Test your agent against a dataset with automated scoring:

    from opensearch_genai_observability_sdk_py import evaluate, EvalScore
    def relevance_scorer(input, output, expected) -> EvalScore:
    is_match = expected.lower() in output.lower()
    return EvalScore(name="relevance", value=1.0 if is_match else 0.0)
    result = evaluate(
    name="my_agent_v1",
    task=my_agent,
    data=[
    {"input": "What is OpenSearch?", "expected": "search engine"},
    {"input": "What is OTEL?", "expected": "opentelemetry"},
    ],
    scores=[relevance_scorer],
    )
    print(result.summary) # avg, min, max per metric
  • Python SDK reference - full API documentation for register, observe, enrich, and AWS auth
  • Evaluation & Scoring - score(), evaluate(), Experiment, and OpenSearchTraceRetriever in depth
  • Agent Tracing UI - explore traces, graphs, and timelines in OpenSearch Dashboards
  • Agent Health - evaluate agents with Golden Path comparison, LLM judges, and batch experiments