Evaluation & Scoring
The Python SDK provides three evaluation capabilities that all emit data through the standard OTLP pipeline:
score()- attach quality scores to individual traces or spansevaluate()- run an agent against a dataset with automated scorer functionsExperiment- upload pre-computed results from any evaluation framework
All evaluation data lands in the same OpenSearch index as your traces, so you can query scores alongside agent spans.
score() - attach scores to traces
Section titled “score() - attach scores to traces”Submits an evaluation score as an OTEL span linked to the trace being scored.
from opensearch_genai_observability_sdk_py import score
# Score an entire tracescore( name="relevance", value=0.92, trace_id="abc123def456...", explanation="Response addresses the user's query",)
# Score a specific spanscore( name="accuracy", value=0.95, trace_id="abc123def456...", span_id="789abc...", label="pass",)
# Standalone score (no trace link)score(name="baseline", value=0.75, label="acceptable")Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
name | str | Metric name, e.g. "relevance", "factuality". |
value | float | Numeric score. |
trace_id | str | Hex trace ID of the trace being scored. |
span_id | str | Hex span ID for span-level scoring. Omit to score the whole trace. |
label | str | Human-readable label, e.g. "pass", "relevant". |
explanation | str | Evaluator rationale. Truncated to 500 characters. |
response_id | str | LLM completion ID for correlation. |
attributes | dict | Additional span attributes. |
Getting the trace ID
Section titled “Getting the trace ID”from opentelemetry import tracefrom opensearch_genai_observability_sdk_py import observe, Op
@observe(op=Op.INVOKE_AGENT, name="my_agent")def run(query: str) -> str: ctx = trace.get_current_span().get_span_context() trace_id = format(ctx.trace_id, "032x") # ... agent logic ... return resultevaluate() - run experiments
Section titled “evaluate() - run experiments”Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel experiment spans.
from opensearch_genai_observability_sdk_py import register, observe, evaluate, Op, EvalScore
register(service_name="my-eval")
@observe(op=Op.INVOKE_AGENT, name="my_agent")def my_agent(input: str) -> str: return f"Answer to: {input}"
def relevance_scorer(input, output, expected) -> EvalScore: is_relevant = expected.lower() in output.lower() return EvalScore( name="relevance", value=1.0 if is_relevant else 0.0, label="relevant" if is_relevant else "irrelevant", )
result = evaluate( name="my_agent_v1_eval", task=my_agent, data=[ {"input": "What is Python?", "expected": "programming language"}, {"input": "What is OpenSearch?", "expected": "search engine"}, ], scores=[relevance_scorer], metadata={"agent_version": "v1"},)
print(result.summary)Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
name | str | Experiment name. |
task | Callable | Function that takes input and returns output. Use @observe for full tracing. |
data | list[dict] | Test cases: "input" (required), "expected", "case_id", "case_name" (optional). |
scores | list[Callable] | Scorer functions - each receives (input, output, expected). |
metadata | dict | Attached to the root experiment span. |
record_io | bool | Record input/output/expected as span attributes. Default False. |
Scorer functions
Section titled “Scorer functions”Scorers return one of:
# Full EvalScore objectdef accuracy(input, output, expected) -> EvalScore: return EvalScore(name="accuracy", value=0.95, label="pass", explanation="Correct")
# Multiple scoresdef multi(input, output, expected) -> list[EvalScore]: return [EvalScore(name="relevance", value=0.9), EvalScore(name="coherence", value=0.85)]
# Simple float (function name becomes metric name)def brevity(input, output, expected) -> float: return min(1.0, 100 / max(len(output), 1))EvalScore dataclass
Section titled “EvalScore dataclass”@dataclassclass EvalScore: name: str # Metric name value: float # Numeric score label: str | None = None # Human-readable label explanation: str | None = None # Rationale metadata: dict | None = None # Extra metadataOTel span structure
Section titled “OTel span structure”flowchart TD
A["test_suite_run - experiment root"] --> B["test_case - case 1"]
A --> C["test_case - case 2"]
B --> D["invoke_agent my_agent"]
B --> E["evaluation result events"]
D --> F["execute_tool ..."]
Agent traces from the task become children of test_case spans - full waterfall from experiment to individual LLM calls.
Result types
Section titled “Result types”result = evaluate(...)result.summary # ExperimentSummaryresult.summary.scores # dict[str, ScoreSummary] - avg, min, max, count per metricresult.cases # list[CaseResult] - per-case input, output, scores, statusExperiment - upload pre-computed results
Section titled “Experiment - upload pre-computed results”Use Experiment when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.
from opensearch_genai_observability_sdk_py import register, Experiment
register(service_name="eval-upload")
with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp: exp.log( input="What is OpenSearch?", output="OpenSearch is an open-source search engine.", expected="search and analytics engine", scores={"faithfulness": 0.92, "relevance": 0.88}, case_name="opensearch_definition", ) exp.log( input="How does RAG work?", output="RAG retrieves documents then generates answers.", scores={"faithfulness": 0.95, "relevance": 0.91}, case_name="rag_explanation", )# summary printed on closelog() parameters
Section titled “log() parameters”| Parameter | Type | Description |
|---|---|---|
input | any | Test case input. |
output | any | Agent output. |
expected | any | Ground truth. |
scores | dict[str, float] | Pre-computed scores. |
metadata | dict | Per-case metadata. |
error | str | Error message (sets status to "fail"). |
case_id | str | Explicit ID. Defaults to SHA256 of input. |
case_name | str | Human-readable name. |
trace_id | str | Creates OTel span link to an agent trace. |
span_id | str | Span-level linking (with trace_id). |
A/B comparison
Section titled “A/B comparison”Run the same dataset against different agent versions:
result_a = evaluate(name="comparison", task=agent_v1, data=cases, scores=[accuracy], metadata={"version": "v1"})result_b = evaluate(name="comparison", task=agent_v2, data=cases, scores=[accuracy], metadata={"version": "v2"})
print(f"V1: {result_a.summary.scores['accuracy'].avg:.2f}")print(f"V2: {result_b.summary.scores['accuracy'].avg:.2f}")OpenSearchTraceRetriever - query stored traces
Section titled “OpenSearchTraceRetriever - query stored traces”Retrieves traces from OpenSearch for building evaluation pipelines. Requires the [opensearch] extra:
pip install "opensearch-genai-observability-sdk-py[opensearch]"from opensearch_genai_observability_sdk_py import OpenSearchTraceRetriever
retriever = OpenSearchTraceRetriever( host="https://localhost:9200", auth=("admin", "admin"), verify_certs=False,)Constructor parameters
Section titled “Constructor parameters”| Parameter | Type | Default | Description |
|---|---|---|---|
host | str | "https://localhost:9200" | OpenSearch endpoint URL. |
index | str | "otel-v1-apm-span-*" | Index pattern for span data. |
auth | tuple | RequestsAWSV4SignerAuth | ("user", "pass") for basic auth, or a SigV4 signer for AWS. | |
verify_certs | bool | True | Whether to verify TLS certificates. |
Methods
Section titled “Methods”list_root_spans() - find recent agent traces:
roots = retriever.list_root_spans(services=["my-agent"], max_results=20)| Parameter | Type | Default | Description |
|---|---|---|---|
services | list[str] | None | Filter by service names. |
since | datetime | 15 min ago | Only traces started after this time. |
max_results | int | 50 | Maximum root spans to return. |
Returns list[SpanRecord] - one per trace (the root span with no parent).
get_traces() - fetch full trace with all spans:
session = retriever.get_traces(roots[0].trace_id)for trace in session.traces: for span in trace.spans: print(f" {span.name} | {span.operation_name} | tokens: {span.input_tokens}")| Parameter | Type | Default | Description |
|---|---|---|---|
identifier | str | A conversation/session ID or trace ID. Queries by gen_ai.conversation.id first, falls back to traceId. | |
max_spans | int | 10000 | Maximum spans to fetch. |
Returns a SessionRecord.
find_evaluated_trace_ids() - filter out already-scored traces:
evaluated = retriever.find_evaluated_trace_ids([s.trace_id for s in roots])to_score = [s for s in roots if s.trace_id not in evaluated]Returns set[str] - trace IDs that already have an evaluation span.
Return types
Section titled “Return types”SpanRecord - normalised view of one span:
@dataclassclass SpanRecord: trace_id: str span_id: str parent_span_id: str name: str # Span name, e.g. "invoke_agent my_agent" start_time: str end_time: str operation_name: str # "invoke_agent" | "execute_tool" | "chat" agent_name: str = "" # gen_ai.agent.name model: str = "" # gen_ai.request.model input_messages: list[Message] = [] # Parsed gen_ai.input.messages output_messages: list[Message] = [] # Parsed gen_ai.output.messages tool_name: str = "" # gen_ai.tool.name tool_call_arguments: str = "" # gen_ai.tool.call.arguments tool_call_result: str = "" # gen_ai.tool.call.result input_tokens: int = 0 # gen_ai.usage.input_tokens output_tokens: int = 0 # gen_ai.usage.output_tokens raw: dict = {} # Original OpenSearch documentMessage - a single user or assistant message:
@dataclassclass Message: role: str # "user", "assistant", etc. content: strTraceRecord - all spans sharing a single trace ID:
@dataclassclass TraceRecord: trace_id: str spans: list[SpanRecord] = []SessionRecord - all traces for a session/conversation:
@dataclassclass SessionRecord: session_id: str traces: list[TraceRecord] = [] truncated: bool = False # True if max_spans was reachedComplete evaluation pipeline
Section titled “Complete evaluation pipeline”from opensearch_genai_observability_sdk_py import register, score, OpenSearchTraceRetriever
register(service_name="eval-pipeline")retriever = OpenSearchTraceRetriever(host="https://localhost:9200", auth=("admin", "admin"), verify_certs=False)
# Find un-evaluated traces → score themroots = retriever.list_root_spans(services=["my-agent"])evaluated = retriever.find_evaluated_trace_ids([s.trace_id for s in roots])
for root in roots: if root.trace_id not in evaluated: session = retriever.get_traces(root.trace_id) relevance = compute_relevance(session.traces[0].spans) score(name="relevance", value=relevance, trace_id=root.trace_id)AWS authentication
Section titled “AWS authentication”from opensearchpy import RequestsAWSV4SignerAuthimport boto3
auth = RequestsAWSV4SignerAuth(boto3.Session().get_credentials(), "us-east-1", "es")retriever = OpenSearchTraceRetriever( host="https://search-my-domain.us-east-1.es.amazonaws.com", auth=auth,)Related links
Section titled “Related links”- Evaluation Integrations - use DeepEval, RAGAS, MLflow, pytest with the observability stack
- Python SDK reference -
register,observe,enrichdocumentation - Agent Tracing UI - explore traces in OpenSearch Dashboards
- Agent Health - Experiments - UI and CLI-based experiment workflows