PPL Observability Examples

import { Tabs, TabItem, Aside } from ‘@astrojs/starlight/components’;

These examples use real OpenTelemetry data from the Observability Stack. Each query runs against the live [playground](https://observability.playground.opensearch.org/w/19jD-R/app/explore/logs/#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-6h,to:now))&_q=(dataset:(id:d1f424b0-2655-11f1-8baa-d5b726b04d73,timeFieldName:time,title:‘logs-otel-v1*‘,type:INDEX_PATTERN),language:PPL,query:”)&_a=(legacy:(columns:!(body,severityText,resource.attributes.service.name),interval:auto,isDirty:!f,sort:!()),tab:(logs:(),patterns:(usingRegexPatterns:!f)),ui:(activeTabId:logs,showHistogram:!t)) - click “Try in playground” to run any query instantly.

Index patterns

The Observability Stack uses these OpenTelemetry index patterns:

Signal	Index Pattern	Key Fields
Logs	`logs-otel-v1*`	`time`, `body`, `severityText`, `severityNumber`, `traceId`, `spanId`, `resource.attributes.service.name`
Traces	`otel-v1-apm-span-*`	`traceId`, `spanId`, `parentSpanId`, `serviceName`, `name`, `durationInNanos`, `startTime`, `endTime`, `status.code`
Service Map	`otel-v2-apm-service-map-*`	`serviceName`, `destination.domain`, `destination.resource`, `traceGroupName`

Log investigation

Count logs by service

See which services are generating the most logs.

| stats count() as log_count by `resource.attributes.service.name`
| sort - log_count

Try in playground →

Find error and fatal logs

Filter for high-severity logs across all services.

| where severityText = 'ERROR' or severityText = 'FATAL'
| sort - time

Try in playground →

Error rate by service

Calculate the error percentage for each service.

| stats count() as total,
        sum(case(severityText = 'ERROR' or severityText = 'FATAL', 1 else 0)) as errors
  by `resource.attributes.service.name`
| eval error_rate = round(errors * 100.0 / total, 2)
| sort - error_rate

Try in playground →

Log volume over time

Time-bucketed log volume - great for spotting traffic spikes.

| stats count() as volume by span(time, 5m) as time_bucket

Try in playground →

Severity breakdown by service

Distribution of log levels per service.

| stats count() as cnt by `resource.attributes.service.name`, severityText
| sort `resource.attributes.service.name`, - cnt

Try in playground →

Top log-producing services

Quick view of the noisiest services.

| top 10 `resource.attributes.service.name`

Try in playground →

Discover log patterns

Automatically cluster similar log messages - no regex required.

| patterns body

Try in playground →

Deduplicate logs by service

Get one representative log per service.

| dedup `resource.attributes.service.name`

Try in playground →

Trace analysis

Slowest traces

Find the operations with the highest latency.

source = otel-v1-apm-span-*
| eval duration_ms = durationInNanos / 1000000
| sort - duration_ms
| head 20

Try in Playground

Error spans

Find all spans with error status.

source = otel-v1-apm-span-*
| where status.code = 2
| sort - startTime
| head 20

Try in Playground

Latency percentiles by service

P50, P95, P99 latency for each service.

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_ns,
        percentile(durationInNanos, 50) as p50_ns,
        percentile(durationInNanos, 95) as p95_ns,
        percentile(durationInNanos, 99) as p99_ns,
        count() as span_count
  by serviceName
| eval p50_ms = round(p50_ns / 1000000, 1),
       p95_ms = round(p95_ns / 1000000, 1),
       p99_ms = round(p99_ns / 1000000, 1)
| sort - p99_ms

Try in Playground

Service error rates

Error rate calculated from span status codes.

source = otel-v1-apm-span-*
| stats count() as total,
        sum(case(status.code = 2, 1 else 0)) as errors
  by serviceName
| eval error_rate = round(errors * 100.0 / total, 2)
| sort - error_rate

Try in Playground

Trace fan-out analysis

How many spans does each trace produce? High fan-out can indicate N+1 queries or excessive tool calls.

source = otel-v1-apm-span-*
| stats count() as span_count by traceId
| sort - span_count
| head 20

Try in Playground

Operations by service

What operations does each service perform?

source = otel-v1-apm-span-*
| stats count() as invocations, avg(durationInNanos) as avg_latency by serviceName, name
| sort serviceName, - invocations

Try in Playground

AI agent observability

These queries leverage the OpenTelemetry GenAI Semantic Conventions attributes that the Observability Stack captures for AI agent telemetry.

GenAI operations breakdown

See what types of AI operations are occurring.

| stats count() as operations by `resource.attributes.service.name`, `attributes.gen_ai.operation.name`

Try in playground →

Token usage by agent

Track LLM token consumption across agents.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.usage.input_tokens`)
| stats sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens,
        sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens,
        count() as calls
  by serviceName
| eval total_tokens = input_tokens + output_tokens
| sort - total_tokens

Try in Playground

Token usage over time

Monitor token consumption trends.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.usage.input_tokens`)
| stats sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens,
        sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens
  by span(startTime, 5m) as time_bucket

Try in Playground

AI system usage breakdown

Which AI systems are being used and how often?

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.system`)
| stats count() as requests,
        sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens,
        sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens
  by `attributes.gen_ai.system`
| sort - requests

Try in Playground

Tool execution analysis

See which tools agents are calling and their performance.

source = otel-v1-apm-span-*
| where `attributes.gen_ai.operation.name` = 'execute_tool'
| stats count() as executions,
        avg(durationInNanos) as avg_latency,
        max(durationInNanos) as max_latency
  by `attributes.gen_ai.tool.name`, serviceName
| eval avg_ms = round(avg_latency / 1000000, 1)
| sort - executions

Try in Playground

Agent invocation latency

End-to-end latency for agent invocations.

source = otel-v1-apm-span-*
| where `attributes.gen_ai.operation.name` = 'invoke_agent'
| eval duration_ms = durationInNanos / 1000000
| stats avg(duration_ms) as avg_ms,
        percentile(duration_ms, 95) as p95_ms,
        count() as invocations
  by serviceName, `attributes.gen_ai.agent.name`
| sort - p95_ms

Try in Playground

Failed agent operations

Find agent operations that resulted in errors.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.operation.name`) and status.code = 2
| sort - startTime
| head 20

Try in Playground

SRE incident response

Error rate percentage over time

Track the overall error rate trend - spot the moment things started going wrong.

| stats count() as total,
        sum(case(severityText = 'ERROR' or severityText = 'FATAL', 1 else 0)) as errors
  by span(time, 5m) as time_bucket
| eval error_pct = round(errors * 100.0 / total, 2)
| sort time_bucket

Try in playground →

First error occurrence per service

Find when each service first started erroring - pinpoint the origin of an incident.

| where severityText = 'ERROR'
| stats earliest(time) as first_seen, count() as total_errors by `resource.attributes.service.name`
| sort first_seen

Try in playground →

Error spike by service (timechart)

Visualize error spikes per service over time - the Splunk-style timechart equivalent.

| where severityText = 'ERROR'
| timechart span=5m count() by `resource.attributes.service.name`

Try in playground →

P95 latency timeseries by service

Track latency degradation over time - the core SRE golden signal.

source = otel-v1-apm-span-*
| stats percentile(durationInNanos, 95) as p95_ns by span(startTime, 5m) as time_bucket, serviceName
| eval p95_ms = round(p95_ns / 1000000, 1)
| sort time_bucket

Try in Playground

Slowest operations by service

Find the most expensive operations to target for optimization.

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_ns, percentile(durationInNanos, 95) as p95_ns, count() as calls by serviceName, name
| eval avg_ms = round(avg_ns / 1000000, 1), p95_ms = round(p95_ns / 1000000, 1)
| sort - p95_ms
| head 20

Try in Playground

Cross-signal correlation

Logs for a specific trace

Jump from a trace to its associated logs using the traceId.

source = logs-otel-v1*
| where traceId = '<your-trace-id>'
| sort time

Services with both high error logs and slow traces

Combine log and trace signals to find the most problematic services.

source = logs-otel-v1*
| where severityText = 'ERROR'
| stats count() as error_logs by `resource.attributes.service.name`
| where error_logs > 10
| sort - error_logs

Try in playground →

Then investigate trace latency for those services:

source = otel-v1-apm-span-*
| where serviceName = '<service-from-above>'
| stats percentile(durationInNanos, 95) as p95, count() as spans by name
| eval p95_ms = round(p95 / 1000000, 1)
| sort - p95_ms

Dashboard-ready queries

These queries produce results well-suited for dashboard visualizations.

Service health summary (data table)

source = otel-v1-apm-span-*
| stats count() as total_spans,
        sum(case(status.code = 2, 1 else 0)) as error_spans,
        avg(durationInNanos) as avg_latency_ns
  by serviceName
| eval error_rate = round(error_spans * 100.0 / total_spans, 2),
       avg_latency_ms = round(avg_latency_ns / 1000000, 1)
| sort - error_rate

Try in Playground

Log volume heatmap (by service and hour)

| eval hour = hour(time)
| stats count() as volume by `resource.attributes.service.name`, hour
| sort `resource.attributes.service.name`, hour

Try in playground →

Top error messages

| where severityText = 'ERROR'
| top 20 body

Try in playground →

Advanced analytics

Outlier detection with eventstats

Use eventstats to compute per-group aggregates without collapsing rows, then flag outliers that deviate significantly from their service’s baseline.

source = otel-v1-apm-span-*
| eventstats avg(durationInNanos) as svc_avg by serviceName
| eval deviation = durationInNanos - svc_avg
| where deviation > svc_avg * 2
| sort - deviation
| head 20

Try in Playground

Find spans that take more than 3x the service average — surface hidden performance outliers that percentile queries miss.

Rolling window analysis with streamstats

Use streamstats to compute sliding-window aggregates over ordered events, ideal for detecting real-time latency regressions.

source = otel-v1-apm-span-*
| sort startTime
| streamstats window=20 avg(durationInNanos) as rolling_avg by serviceName
| eval current_ms = durationInNanos / 1000000, avg_ms = rolling_avg / 1000000
| where durationInNanos > rolling_avg * 3
| head 20

Try in Playground

Flag spans that exceed 3x the rolling 20-span average per service — catch latency spikes as they happen.

Smoothed latency trends with trendline

Use trendline to compute simple moving averages over sorted data, making it easy to spot sustained performance shifts versus momentary noise.

source = otel-v1-apm-span-*
| trendline sort startTime sma(5, durationInNanos) as short_trend sma(20, durationInNanos) as long_trend
| eval short_ms = short_trend / 1000000, long_ms = long_trend / 1000000
| eval trend = if(short_ms > long_ms, 'degrading', 'improving')
| head 50

Try in Playground

Compare short-term (5-span) versus long-term (20-span) moving averages to classify whether latency is degrading or improving.

Masterclass pipelines

These multi-command pipelines combine several PPL features to solve real observability problems in a single query.

Service health scorecard

A complete service health dashboard in one query — error rates, latency percentiles, and automated health classification.

source = otel-v1-apm-span-*
| stats count() as total_spans,
        sum(case(status.code = 2, 1 else 0)) as error_spans,
        avg(durationInNanos) as avg_latency_ns,
        percentile(durationInNanos, 95) as p95_ns,
        percentile(durationInNanos, 99) as p99_ns
  by serviceName
| eval error_rate = round(error_spans * 100.0 / total_spans, 2),
       avg_ms = round(avg_latency_ns / 1000000, 1),
       p95_ms = round(p95_ns / 1000000, 1),
       p99_ms = round(p99_ns / 1000000, 1),
       health = case(
           error_rate > 5, 'CRITICAL',
           error_rate > 1, 'DEGRADED',
           p99_ms > 5000, 'SLOW'
           else 'HEALTHY')
| sort - error_rate

Try in Playground

Combines stats, eval, and case to produce a single-query health scorecard across all services. Use this as a starting point for service-level dashboards.

GenAI agent cost and performance analysis

Complete GenAI observability: latency, token usage, failure rate, and per-operation breakdown across all AI agents.

source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.operation.name`)
| eval duration_ms = durationInNanos / 1000000,
       input_tokens = `attributes.gen_ai.usage.input_tokens`,
       output_tokens = `attributes.gen_ai.usage.output_tokens`,
       total_tokens = input_tokens + output_tokens
| stats count() as operations,
        avg(duration_ms) as avg_latency_ms,
        percentile(duration_ms, 95) as p95_ms,
        sum(total_tokens) as total_tokens,
        sum(case(status.code = 2, 1 else 0)) as failures
  by serviceName, `attributes.gen_ai.operation.name`, `attributes.gen_ai.system`
| eval failure_rate = round(failures * 100.0 / operations, 2),
       tokens_per_op = round(total_tokens / operations, 0)
| sort - total_tokens

Try in Playground

Breaks down every GenAI operation by service, operation type, and AI system. Use this to track cost drivers, identify high-failure operations, and compare AI provider performance.

Envoy access log analysis

Parse raw Envoy access logs into an API traffic dashboard — method, path, and status class breakdown.

source = logs-otel-v1*
| where `resource.attributes.service.name` = 'frontend-proxy'
| grok body '\[%{GREEDYDATA:timestamp}\] "%{WORD:method} %{URIPATH:path} HTTP/%{NUMBER}" %{POSINT:status}'
| where isnotnull(method)
| eval status_class = case(
       cast(status as int) < 200, '1xx',
       cast(status as int) < 300, '2xx',
       cast(status as int) < 400, '3xx',
       cast(status as int) < 500, '4xx'
       else '5xx')
| stats count() as requests by method, path, status_class
| sort - requests
| head 30

Try in Playground

Uses grok to extract structured fields from unstructured proxy logs, then aggregates into an API traffic summary. Adapt the grok pattern for other log formats.

Automatic error pattern discovery

Cluster error messages into patterns per service with zero regex — PPL’s killer feature for incident triage.

source = logs-otel-v1*
| where severityText = 'ERROR'
| patterns body method=brain mode=aggregation by `resource.attributes.service.name`
| sort - pattern_count
| head 20

Try in Playground

The patterns command with method=brain uses ML-based clustering to group similar error messages. During an incident, run this first to see the shape of the problem without writing a single regex.

Cross-signal correlation: logs meet traces

Correlate error logs with error spans across indices — find which trace operations cause which log errors.

source = logs-otel-v1*
| where severityText = 'ERROR'
| where traceId != ''
| left join left=l right=r on l.traceId = r.traceId [
    source = otel-v1-apm-span-*
    | where status.code = 2
    | eval span_duration_ms = durationInNanos / 1000000
    | sort - span_duration_ms
    | head 1000
  ]
| where isnotnull(r.serviceName)
| stats count() as correlated_errors by l.`resource.attributes.service.name`, r.serviceName, r.name
| sort - correlated_errors
| head 20

Try in Playground

Joins error logs with error spans using traceId to reveal which span operations are responsible for which log errors. This cross-index join is one of PPL’s most powerful capabilities for root cause analysis.

Query tips

Backtick field names with dots

OpenTelemetry attributes contain dots. Wrap them in backticks:

| where isnotnull(`resource.attributes.service.name`)

Combine stats with eval for computed metrics

| stats count() as total, sum(case(severityText = 'ERROR', 1 else 0)) as errors by service
| eval error_pct = round(errors * 100.0 / total, 2)

Use span() for time bucketing

| stats count() by span(time, 1m) as minute

Use head to limit during exploration

Always add | head while exploring to avoid scanning all data:

| where severityText = 'ERROR'
| head 50

Sort with - for descending

| sort - durationInNanos

PPL Observability Examples

Index patterns

Log investigation

Count logs by service

Find error and fatal logs

Error rate by service

Log volume over time

Severity breakdown by service

Top log-producing services

Discover log patterns

Deduplicate logs by service

Trace analysis

Slowest traces

Error spans

Latency percentiles by service

Service error rates

Trace fan-out analysis

Operations by service

AI agent observability

GenAI operations breakdown

Token usage by agent

Token usage over time

AI system usage breakdown

Tool execution analysis

Agent invocation latency

Failed agent operations

SRE incident response

Error rate percentage over time

First error occurrence per service

Error spike by service (timechart)

P95 latency timeseries by service

Slowest operations by service

Cross-signal correlation

Logs for a specific trace

Services with both high error logs and slow traces

Dashboard-ready queries

Service health summary (data table)

Log volume heatmap (by service and hour)

Top error messages

Advanced analytics

Outlier detection with eventstats

Rolling window analysis with streamstats

Smoothed latency trends with trendline

Masterclass pipelines

Service health scorecard

GenAI agent cost and performance analysis

Envoy access log analysis

Automatic error pattern discovery

Cross-signal correlation: logs meet traces

Query tips

Backtick field names with dots

Combine stats with eval for computed metrics

Use span() for time bucketing

Use head to limit during exploration

Sort with - for descending

Further reading