Skip to content

Evaluation & Scoring

The Python SDK provides three evaluation capabilities that all emit data through the standard OTLP pipeline:

  • score() - attach quality scores to individual traces or spans
  • evaluate() - run an agent against a dataset with automated scorer functions
  • Experiment - upload pre-computed results from any evaluation framework

All evaluation data lands in the same OpenSearch index as your traces, so you can query scores alongside agent spans.


Submits an evaluation score as an OTEL span linked to the trace being scored.

from opensearch_genai_observability_sdk_py import score
# Score an entire trace
score(
name="relevance",
value=0.92,
trace_id="abc123def456...",
explanation="Response addresses the user's query",
)
# Score a specific span
score(
name="accuracy",
value=0.95,
trace_id="abc123def456...",
span_id="789abc...",
label="pass",
)
# Standalone score (no trace link)
score(name="baseline", value=0.75, label="acceptable")
ParameterTypeDescription
namestrMetric name, e.g. "relevance", "factuality".
valuefloatNumeric score.
trace_idstrHex trace ID of the trace being scored.
span_idstrHex span ID for span-level scoring. Omit to score the whole trace.
labelstrHuman-readable label, e.g. "pass", "relevant".
explanationstrEvaluator rationale. Truncated to 500 characters.
response_idstrLLM completion ID for correlation.
attributesdictAdditional span attributes.
from opentelemetry import trace
from opensearch_genai_observability_sdk_py import observe, Op
@observe(op=Op.INVOKE_AGENT, name="my_agent")
def run(query: str) -> str:
ctx = trace.get_current_span().get_span_context()
trace_id = format(ctx.trace_id, "032x")
# ... agent logic ...
return result

Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel experiment spans.

from opensearch_genai_observability_sdk_py import register, observe, evaluate, Op, EvalScore
register(service_name="my-eval")
@observe(op=Op.INVOKE_AGENT, name="my_agent")
def my_agent(input: str) -> str:
return f"Answer to: {input}"
def relevance_scorer(input, output, expected) -> EvalScore:
is_relevant = expected.lower() in output.lower()
return EvalScore(
name="relevance",
value=1.0 if is_relevant else 0.0,
label="relevant" if is_relevant else "irrelevant",
)
result = evaluate(
name="my_agent_v1_eval",
task=my_agent,
data=[
{"input": "What is Python?", "expected": "programming language"},
{"input": "What is OpenSearch?", "expected": "search engine"},
],
scores=[relevance_scorer],
metadata={"agent_version": "v1"},
)
print(result.summary)
ParameterTypeDescription
namestrExperiment name.
taskCallableFunction that takes input and returns output. Use @observe for full tracing.
datalist[dict]Test cases: "input" (required), "expected", "case_id", "case_name" (optional).
scoreslist[Callable]Scorer functions - each receives (input, output, expected).
metadatadictAttached to the root experiment span.
record_ioboolRecord input/output/expected as span attributes. Default False.

Scorers return one of:

# Full EvalScore object
def accuracy(input, output, expected) -> EvalScore:
return EvalScore(name="accuracy", value=0.95, label="pass", explanation="Correct")
# Multiple scores
def multi(input, output, expected) -> list[EvalScore]:
return [EvalScore(name="relevance", value=0.9), EvalScore(name="coherence", value=0.85)]
# Simple float (function name becomes metric name)
def brevity(input, output, expected) -> float:
return min(1.0, 100 / max(len(output), 1))
@dataclass
class EvalScore:
name: str # Metric name
value: float # Numeric score
label: str | None = None # Human-readable label
explanation: str | None = None # Rationale
metadata: dict | None = None # Extra metadata
flowchart TD
    A["test_suite_run - experiment root"] --> B["test_case - case 1"]
    A --> C["test_case - case 2"]
    B --> D["invoke_agent my_agent"]
    B --> E["evaluation result events"]
    D --> F["execute_tool ..."]

Agent traces from the task become children of test_case spans - full waterfall from experiment to individual LLM calls.

result = evaluate(...)
result.summary # ExperimentSummary
result.summary.scores # dict[str, ScoreSummary] - avg, min, max, count per metric
result.cases # list[CaseResult] - per-case input, output, scores, status

Use Experiment when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.

from opensearch_genai_observability_sdk_py import register, Experiment
register(service_name="eval-upload")
with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
exp.log(
input="What is OpenSearch?",
output="OpenSearch is an open-source search engine.",
expected="search and analytics engine",
scores={"faithfulness": 0.92, "relevance": 0.88},
case_name="opensearch_definition",
)
exp.log(
input="How does RAG work?",
output="RAG retrieves documents then generates answers.",
scores={"faithfulness": 0.95, "relevance": 0.91},
case_name="rag_explanation",
)
# summary printed on close
ParameterTypeDescription
inputanyTest case input.
outputanyAgent output.
expectedanyGround truth.
scoresdict[str, float]Pre-computed scores.
metadatadictPer-case metadata.
errorstrError message (sets status to "fail").
case_idstrExplicit ID. Defaults to SHA256 of input.
case_namestrHuman-readable name.
trace_idstrCreates OTel span link to an agent trace.
span_idstrSpan-level linking (with trace_id).

Run the same dataset against different agent versions:

result_a = evaluate(name="comparison", task=agent_v1, data=cases, scores=[accuracy],
metadata={"version": "v1"})
result_b = evaluate(name="comparison", task=agent_v2, data=cases, scores=[accuracy],
metadata={"version": "v2"})
print(f"V1: {result_a.summary.scores['accuracy'].avg:.2f}")
print(f"V2: {result_b.summary.scores['accuracy'].avg:.2f}")

OpenSearchTraceRetriever - query stored traces

Section titled “OpenSearchTraceRetriever - query stored traces”

Retrieves traces from OpenSearch for building evaluation pipelines. Requires the [opensearch] extra:

Terminal window
pip install "opensearch-genai-observability-sdk-py[opensearch]"
from opensearch_genai_observability_sdk_py import OpenSearchTraceRetriever
retriever = OpenSearchTraceRetriever(
host="https://localhost:9200",
auth=("admin", "admin"),
verify_certs=False,
)
ParameterTypeDefaultDescription
hoststr"https://localhost:9200"OpenSearch endpoint URL.
indexstr"otel-v1-apm-span-*"Index pattern for span data.
authtuple | RequestsAWSV4SignerAuth("user", "pass") for basic auth, or a SigV4 signer for AWS.
verify_certsboolTrueWhether to verify TLS certificates.

list_root_spans() - find recent agent traces:

roots = retriever.list_root_spans(services=["my-agent"], max_results=20)
ParameterTypeDefaultDescription
serviceslist[str]NoneFilter by service names.
sincedatetime15 min agoOnly traces started after this time.
max_resultsint50Maximum root spans to return.

Returns list[SpanRecord] - one per trace (the root span with no parent).

get_traces() - fetch full trace with all spans:

session = retriever.get_traces(roots[0].trace_id)
for trace in session.traces:
for span in trace.spans:
print(f" {span.name} | {span.operation_name} | tokens: {span.input_tokens}")
ParameterTypeDefaultDescription
identifierstrA conversation/session ID or trace ID. Queries by gen_ai.conversation.id first, falls back to traceId.
max_spansint10000Maximum spans to fetch.

Returns a SessionRecord.

find_evaluated_trace_ids() - filter out already-scored traces:

evaluated = retriever.find_evaluated_trace_ids([s.trace_id for s in roots])
to_score = [s for s in roots if s.trace_id not in evaluated]

Returns set[str] - trace IDs that already have an evaluation span.

SpanRecord - normalised view of one span:

@dataclass
class SpanRecord:
trace_id: str
span_id: str
parent_span_id: str
name: str # Span name, e.g. "invoke_agent my_agent"
start_time: str
end_time: str
operation_name: str # "invoke_agent" | "execute_tool" | "chat"
agent_name: str = "" # gen_ai.agent.name
model: str = "" # gen_ai.request.model
input_messages: list[Message] = [] # Parsed gen_ai.input.messages
output_messages: list[Message] = [] # Parsed gen_ai.output.messages
tool_name: str = "" # gen_ai.tool.name
tool_call_arguments: str = "" # gen_ai.tool.call.arguments
tool_call_result: str = "" # gen_ai.tool.call.result
input_tokens: int = 0 # gen_ai.usage.input_tokens
output_tokens: int = 0 # gen_ai.usage.output_tokens
raw: dict = {} # Original OpenSearch document

Message - a single user or assistant message:

@dataclass
class Message:
role: str # "user", "assistant", etc.
content: str

TraceRecord - all spans sharing a single trace ID:

@dataclass
class TraceRecord:
trace_id: str
spans: list[SpanRecord] = []

SessionRecord - all traces for a session/conversation:

@dataclass
class SessionRecord:
session_id: str
traces: list[TraceRecord] = []
truncated: bool = False # True if max_spans was reached
from opensearch_genai_observability_sdk_py import register, score, OpenSearchTraceRetriever
register(service_name="eval-pipeline")
retriever = OpenSearchTraceRetriever(host="https://localhost:9200",
auth=("admin", "admin"), verify_certs=False)
# Find un-evaluated traces → score them
roots = retriever.list_root_spans(services=["my-agent"])
evaluated = retriever.find_evaluated_trace_ids([s.trace_id for s in roots])
for root in roots:
if root.trace_id not in evaluated:
session = retriever.get_traces(root.trace_id)
relevance = compute_relevance(session.traces[0].spans)
score(name="relevance", value=relevance, trace_id=root.trace_id)
from opensearchpy import RequestsAWSV4SignerAuth
import boto3
auth = RequestsAWSV4SignerAuth(boto3.Session().get_credentials(), "us-east-1", "es")
retriever = OpenSearchTraceRetriever(
host="https://search-my-domain.us-east-1.es.amazonaws.com", auth=auth,
)