Skip to content

PPL Observability Examples

import { Tabs, TabItem, Aside } from ‘@astrojs/starlight/components’;

These examples use real OpenTelemetry data from the Observability Stack. Each query runs against the live [playground](https://observability.playground.opensearch.org/w/19jD-R/app/explore/logs/#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-6h,to:now))&_q=(dataset:(id:d1f424b0-2655-11f1-8baa-d5b726b04d73,timeFieldName:time,title:‘logs-otel-v1*‘,type:INDEX_PATTERN),language:PPL,query:”)&_a=(legacy:(columns:!(body,severityText,resource.attributes.service.name),interval:auto,isDirty:!f,sort:!()),tab:(logs:(),patterns:(usingRegexPatterns:!f)),ui:(activeTabId:logs,showHistogram:!t)) - click “Try in playground” to run any query instantly.

The Observability Stack uses these OpenTelemetry index patterns:

SignalIndex PatternKey Fields
Logslogs-otel-v1*time, body, severityText, severityNumber, traceId, spanId, resource.attributes.service.name
Tracesotel-v1-apm-span-*traceId, spanId, parentSpanId, serviceName, name, durationInNanos, startTime, endTime, status.code
Service Mapotel-v2-apm-service-map-*serviceName, destination.domain, destination.resource, traceGroupName

See which services are generating the most logs.

| stats count() as log_count by `resource.attributes.service.name`
| sort - log_count

Try in playground →

Filter for high-severity logs across all services.

| where severityText = 'ERROR' or severityText = 'FATAL'
| sort - time

Try in playground →

Calculate the error percentage for each service.

| stats count() as total,
sum(case(severityText = 'ERROR' or severityText = 'FATAL', 1 else 0)) as errors
by `resource.attributes.service.name`
| eval error_rate = round(errors * 100.0 / total, 2)
| sort - error_rate

Try in playground →

Time-bucketed log volume - great for spotting traffic spikes.

| stats count() as volume by span(time, 5m) as time_bucket

Try in playground →

Distribution of log levels per service.

| stats count() as cnt by `resource.attributes.service.name`, severityText
| sort `resource.attributes.service.name`, - cnt

Try in playground →

Quick view of the noisiest services.

| top 10 `resource.attributes.service.name`

Try in playground →

Automatically cluster similar log messages - no regex required.

| patterns body

Try in playground →

Get one representative log per service.

| dedup `resource.attributes.service.name`

Try in playground →


Find the operations with the highest latency.

source = otel-v1-apm-span-*
| eval duration_ms = durationInNanos / 1000000
| sort - duration_ms
| head 20

Try in Playground

Find all spans with error status.

source = otel-v1-apm-span-*
| where status.code = 2
| sort - startTime
| head 20

Try in Playground

P50, P95, P99 latency for each service.

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_ns,
percentile(durationInNanos, 50) as p50_ns,
percentile(durationInNanos, 95) as p95_ns,
percentile(durationInNanos, 99) as p99_ns,
count() as span_count
by serviceName
| eval p50_ms = round(p50_ns / 1000000, 1),
p95_ms = round(p95_ns / 1000000, 1),
p99_ms = round(p99_ns / 1000000, 1)
| sort - p99_ms

Try in Playground

Error rate calculated from span status codes.

source = otel-v1-apm-span-*
| stats count() as total,
sum(case(status.code = 2, 1 else 0)) as errors
by serviceName
| eval error_rate = round(errors * 100.0 / total, 2)
| sort - error_rate

Try in Playground

How many spans does each trace produce? High fan-out can indicate N+1 queries or excessive tool calls.

source = otel-v1-apm-span-*
| stats count() as span_count by traceId
| sort - span_count
| head 20

Try in Playground

What operations does each service perform?

source = otel-v1-apm-span-*
| stats count() as invocations, avg(durationInNanos) as avg_latency by serviceName, name
| sort serviceName, - invocations

Try in Playground


These queries leverage the OpenTelemetry GenAI Semantic Conventions attributes that the Observability Stack captures for AI agent telemetry.

See what types of AI operations are occurring.

| stats count() as operations by `resource.attributes.service.name`, `attributes.gen_ai.operation.name`

Try in playground →

Track LLM token consumption across agents.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.usage.input_tokens`)
| stats sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens,
sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens,
count() as calls
by serviceName
| eval total_tokens = input_tokens + output_tokens
| sort - total_tokens

Try in Playground

Monitor token consumption trends.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.usage.input_tokens`)
| stats sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens,
sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens
by span(startTime, 5m) as time_bucket

Try in Playground

Which AI systems are being used and how often?

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.system`)
| stats count() as requests,
sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens,
sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens
by `attributes.gen_ai.system`
| sort - requests

Try in Playground

See which tools agents are calling and their performance.

source = otel-v1-apm-span-*
| where `attributes.gen_ai.operation.name` = 'execute_tool'
| stats count() as executions,
avg(durationInNanos) as avg_latency,
max(durationInNanos) as max_latency
by `attributes.gen_ai.tool.name`, serviceName
| eval avg_ms = round(avg_latency / 1000000, 1)
| sort - executions

Try in Playground

End-to-end latency for agent invocations.

source = otel-v1-apm-span-*
| where `attributes.gen_ai.operation.name` = 'invoke_agent'
| eval duration_ms = durationInNanos / 1000000
| stats avg(duration_ms) as avg_ms,
percentile(duration_ms, 95) as p95_ms,
count() as invocations
by serviceName, `attributes.gen_ai.agent.name`
| sort - p95_ms

Try in Playground

Find agent operations that resulted in errors.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.operation.name`) and status.code = 2
| sort - startTime
| head 20

Try in Playground


Track the overall error rate trend - spot the moment things started going wrong.

| stats count() as total,
sum(case(severityText = 'ERROR' or severityText = 'FATAL', 1 else 0)) as errors
by span(time, 5m) as time_bucket
| eval error_pct = round(errors * 100.0 / total, 2)
| sort time_bucket

Try in playground →

Find when each service first started erroring - pinpoint the origin of an incident.

| where severityText = 'ERROR'
| stats earliest(time) as first_seen, count() as total_errors by `resource.attributes.service.name`
| sort first_seen

Try in playground →

Visualize error spikes per service over time - the Splunk-style timechart equivalent.

| where severityText = 'ERROR'
| timechart span=5m count() by `resource.attributes.service.name`

Try in playground →

Track latency degradation over time - the core SRE golden signal.

source = otel-v1-apm-span-*
| stats percentile(durationInNanos, 95) as p95_ns by span(startTime, 5m) as time_bucket, serviceName
| eval p95_ms = round(p95_ns / 1000000, 1)
| sort time_bucket

Try in Playground

Find the most expensive operations to target for optimization.

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_ns, percentile(durationInNanos, 95) as p95_ns, count() as calls by serviceName, name
| eval avg_ms = round(avg_ns / 1000000, 1), p95_ms = round(p95_ns / 1000000, 1)
| sort - p95_ms
| head 20

Try in Playground


Jump from a trace to its associated logs using the traceId.

source = logs-otel-v1*
| where traceId = '<your-trace-id>'
| sort time

Services with both high error logs and slow traces

Section titled “Services with both high error logs and slow traces”

Combine log and trace signals to find the most problematic services.

source = logs-otel-v1*
| where severityText = 'ERROR'
| stats count() as error_logs by `resource.attributes.service.name`
| where error_logs > 10
| sort - error_logs

Try in playground →

Then investigate trace latency for those services:

source = otel-v1-apm-span-*
| where serviceName = '<service-from-above>'
| stats percentile(durationInNanos, 95) as p95, count() as spans by name
| eval p95_ms = round(p95 / 1000000, 1)
| sort - p95_ms

These queries produce results well-suited for dashboard visualizations.

source = otel-v1-apm-span-*
| stats count() as total_spans,
sum(case(status.code = 2, 1 else 0)) as error_spans,
avg(durationInNanos) as avg_latency_ns
by serviceName
| eval error_rate = round(error_spans * 100.0 / total_spans, 2),
avg_latency_ms = round(avg_latency_ns / 1000000, 1)
| sort - error_rate

Try in Playground

| eval hour = hour(time)
| stats count() as volume by `resource.attributes.service.name`, hour
| sort `resource.attributes.service.name`, hour

Try in playground →

| where severityText = 'ERROR'
| top 20 body

Try in playground →


Use eventstats to compute per-group aggregates without collapsing rows, then flag outliers that deviate significantly from their service’s baseline.

source = otel-v1-apm-span-*
| eventstats avg(durationInNanos) as svc_avg by serviceName
| eval deviation = durationInNanos - svc_avg
| where deviation > svc_avg * 2
| sort - deviation
| head 20

Try in Playground

Find spans that take more than 3x the service average — surface hidden performance outliers that percentile queries miss.

Use streamstats to compute sliding-window aggregates over ordered events, ideal for detecting real-time latency regressions.

source = otel-v1-apm-span-*
| sort startTime
| streamstats window=20 avg(durationInNanos) as rolling_avg by serviceName
| eval current_ms = durationInNanos / 1000000, avg_ms = rolling_avg / 1000000
| where durationInNanos > rolling_avg * 3
| head 20

Try in Playground

Flag spans that exceed 3x the rolling 20-span average per service — catch latency spikes as they happen.

Use trendline to compute simple moving averages over sorted data, making it easy to spot sustained performance shifts versus momentary noise.

source = otel-v1-apm-span-*
| trendline sort startTime sma(5, durationInNanos) as short_trend sma(20, durationInNanos) as long_trend
| eval short_ms = short_trend / 1000000, long_ms = long_trend / 1000000
| eval trend = if(short_ms > long_ms, 'degrading', 'improving')
| head 50

Try in Playground

Compare short-term (5-span) versus long-term (20-span) moving averages to classify whether latency is degrading or improving.


These multi-command pipelines combine several PPL features to solve real observability problems in a single query.

A complete service health dashboard in one query — error rates, latency percentiles, and automated health classification.

source = otel-v1-apm-span-*
| stats count() as total_spans,
sum(case(status.code = 2, 1 else 0)) as error_spans,
avg(durationInNanos) as avg_latency_ns,
percentile(durationInNanos, 95) as p95_ns,
percentile(durationInNanos, 99) as p99_ns
by serviceName
| eval error_rate = round(error_spans * 100.0 / total_spans, 2),
avg_ms = round(avg_latency_ns / 1000000, 1),
p95_ms = round(p95_ns / 1000000, 1),
p99_ms = round(p99_ns / 1000000, 1),
health = case(
error_rate > 5, 'CRITICAL',
error_rate > 1, 'DEGRADED',
p99_ms > 5000, 'SLOW'
else 'HEALTHY')
| sort - error_rate

Try in Playground

Combines stats, eval, and case to produce a single-query health scorecard across all services. Use this as a starting point for service-level dashboards.

Complete GenAI observability: latency, token usage, failure rate, and per-operation breakdown across all AI agents.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.operation.name`)
| eval duration_ms = durationInNanos / 1000000,
input_tokens = `attributes.gen_ai.usage.input_tokens`,
output_tokens = `attributes.gen_ai.usage.output_tokens`,
total_tokens = input_tokens + output_tokens
| stats count() as operations,
avg(duration_ms) as avg_latency_ms,
percentile(duration_ms, 95) as p95_ms,
sum(total_tokens) as total_tokens,
sum(case(status.code = 2, 1 else 0)) as failures
by serviceName, `attributes.gen_ai.operation.name`, `attributes.gen_ai.system`
| eval failure_rate = round(failures * 100.0 / operations, 2),
tokens_per_op = round(total_tokens / operations, 0)
| sort - total_tokens

Try in Playground

Breaks down every GenAI operation by service, operation type, and AI system. Use this to track cost drivers, identify high-failure operations, and compare AI provider performance.

Parse raw Envoy access logs into an API traffic dashboard — method, path, and status class breakdown.

source = logs-otel-v1*
| where `resource.attributes.service.name` = 'frontend-proxy'
| grok body '\[%{GREEDYDATA:timestamp}\] "%{WORD:method} %{URIPATH:path} HTTP/%{NUMBER}" %{POSINT:status}'
| where isnotnull(method)
| eval status_class = case(
cast(status as int) < 200, '1xx',
cast(status as int) < 300, '2xx',
cast(status as int) < 400, '3xx',
cast(status as int) < 500, '4xx'
else '5xx')
| stats count() as requests by method, path, status_class
| sort - requests
| head 30

Try in Playground

Uses grok to extract structured fields from unstructured proxy logs, then aggregates into an API traffic summary. Adapt the grok pattern for other log formats.

Cluster error messages into patterns per service with zero regex — PPL’s killer feature for incident triage.

source = logs-otel-v1*
| where severityText = 'ERROR'
| patterns body method=brain mode=aggregation by `resource.attributes.service.name`
| sort - pattern_count
| head 20

Try in Playground

The patterns command with method=brain uses ML-based clustering to group similar error messages. During an incident, run this first to see the shape of the problem without writing a single regex.

Cross-signal correlation: logs meet traces

Section titled “Cross-signal correlation: logs meet traces”

Correlate error logs with error spans across indices — find which trace operations cause which log errors.

source = logs-otel-v1*
| where severityText = 'ERROR'
| where traceId != ''
| left join left=l right=r on l.traceId = r.traceId [
source = otel-v1-apm-span-*
| where status.code = 2
| eval span_duration_ms = durationInNanos / 1000000
| sort - span_duration_ms
| head 1000
]
| where isnotnull(r.serviceName)
| stats count() as correlated_errors by l.`resource.attributes.service.name`, r.serviceName, r.name
| sort - correlated_errors
| head 20

Try in Playground

Joins error logs with error spans using traceId to reveal which span operations are responsible for which log errors. This cross-index join is one of PPL’s most powerful capabilities for root cause analysis.


OpenTelemetry attributes contain dots. Wrap them in backticks:

| where isnotnull(`resource.attributes.service.name`)

Combine stats with eval for computed metrics

Section titled “Combine stats with eval for computed metrics”
| stats count() as total, sum(case(severityText = 'ERROR', 1 else 0)) as errors by service
| eval error_pct = round(errors * 100.0 / total, 2)
| stats count() by span(time, 1m) as minute

Always add | head while exploring to avoid scanning all data:

| where severityText = 'ERROR'
| head 50
| sort - durationInNanos