PPL Observability Examples
import { Tabs, TabItem, Aside } from ‘@astrojs/starlight/components’;
These examples use real OpenTelemetry data from the Observability Stack. Each query runs against the live [playground](https://observability.playground.opensearch.org/w/19jD-R/app/explore/logs/#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-6h,to:now))&_q=(dataset:(id:d1f424b0-2655-11f1-8baa-d5b726b04d73,timeFieldName:time,title:‘logs-otel-v1*‘,type:INDEX_PATTERN),language:PPL,query:”)&_a=(legacy:(columns:!(body,severityText,resource.attributes.service.name),interval:auto,isDirty:!f,sort:!()),tab:(logs:(),patterns:(usingRegexPatterns:!f)),ui:(activeTabId:logs,showHistogram:!t)) - click “Try in playground” to run any query instantly.
Index patterns
Section titled “Index patterns”The Observability Stack uses these OpenTelemetry index patterns:
| Signal | Index Pattern | Key Fields |
|---|---|---|
| Logs | logs-otel-v1* | time, body, severityText, severityNumber, traceId, spanId, resource.attributes.service.name |
| Traces | otel-v1-apm-span-* | traceId, spanId, parentSpanId, serviceName, name, durationInNanos, startTime, endTime, status.code |
| Service Map | otel-v2-apm-service-map-* | serviceName, destination.domain, destination.resource, traceGroupName |
Log investigation
Section titled “Log investigation”Count logs by service
Section titled “Count logs by service”See which services are generating the most logs.
| stats count() as log_count by `resource.attributes.service.name`| sort - log_countFind error and fatal logs
Section titled “Find error and fatal logs”Filter for high-severity logs across all services.
| where severityText = 'ERROR' or severityText = 'FATAL'| sort - timeError rate by service
Section titled “Error rate by service”Calculate the error percentage for each service.
| stats count() as total, sum(case(severityText = 'ERROR' or severityText = 'FATAL', 1 else 0)) as errors by `resource.attributes.service.name`| eval error_rate = round(errors * 100.0 / total, 2)| sort - error_rateLog volume over time
Section titled “Log volume over time”Time-bucketed log volume - great for spotting traffic spikes.
| stats count() as volume by span(time, 5m) as time_bucketSeverity breakdown by service
Section titled “Severity breakdown by service”Distribution of log levels per service.
| stats count() as cnt by `resource.attributes.service.name`, severityText| sort `resource.attributes.service.name`, - cntTop log-producing services
Section titled “Top log-producing services”Quick view of the noisiest services.
| top 10 `resource.attributes.service.name`Discover log patterns
Section titled “Discover log patterns”Automatically cluster similar log messages - no regex required.
| patterns bodyDeduplicate logs by service
Section titled “Deduplicate logs by service”Get one representative log per service.
| dedup `resource.attributes.service.name`Trace analysis
Section titled “Trace analysis”Slowest traces
Section titled “Slowest traces”Find the operations with the highest latency.
source = otel-v1-apm-span-*| eval duration_ms = durationInNanos / 1000000| sort - duration_ms| head 20Error spans
Section titled “Error spans”Find all spans with error status.
source = otel-v1-apm-span-*| where status.code = 2| sort - startTime| head 20Latency percentiles by service
Section titled “Latency percentiles by service”P50, P95, P99 latency for each service.
source = otel-v1-apm-span-*| stats avg(durationInNanos) as avg_ns, percentile(durationInNanos, 50) as p50_ns, percentile(durationInNanos, 95) as p95_ns, percentile(durationInNanos, 99) as p99_ns, count() as span_count by serviceName| eval p50_ms = round(p50_ns / 1000000, 1), p95_ms = round(p95_ns / 1000000, 1), p99_ms = round(p99_ns / 1000000, 1)| sort - p99_msService error rates
Section titled “Service error rates”Error rate calculated from span status codes.
source = otel-v1-apm-span-*| stats count() as total, sum(case(status.code = 2, 1 else 0)) as errors by serviceName| eval error_rate = round(errors * 100.0 / total, 2)| sort - error_rateTrace fan-out analysis
Section titled “Trace fan-out analysis”How many spans does each trace produce? High fan-out can indicate N+1 queries or excessive tool calls.
source = otel-v1-apm-span-*| stats count() as span_count by traceId| sort - span_count| head 20Operations by service
Section titled “Operations by service”What operations does each service perform?
source = otel-v1-apm-span-*| stats count() as invocations, avg(durationInNanos) as avg_latency by serviceName, name| sort serviceName, - invocationsAI agent observability
Section titled “AI agent observability”These queries leverage the OpenTelemetry GenAI Semantic Conventions attributes that the Observability Stack captures for AI agent telemetry.
GenAI operations breakdown
Section titled “GenAI operations breakdown”See what types of AI operations are occurring.
| stats count() as operations by `resource.attributes.service.name`, `attributes.gen_ai.operation.name`Token usage by agent
Section titled “Token usage by agent”Track LLM token consumption across agents.
source = otel-v1-apm-span-*| where isnotnull(`attributes.gen_ai.usage.input_tokens`)| stats sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens, sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens, count() as calls by serviceName| eval total_tokens = input_tokens + output_tokens| sort - total_tokensToken usage over time
Section titled “Token usage over time”Monitor token consumption trends.
source = otel-v1-apm-span-*| where isnotnull(`attributes.gen_ai.usage.input_tokens`)| stats sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens, sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens by span(startTime, 5m) as time_bucketAI system usage breakdown
Section titled “AI system usage breakdown”Which AI systems are being used and how often?
source = otel-v1-apm-span-*| where isnotnull(`attributes.gen_ai.system`)| stats count() as requests, sum(`attributes.gen_ai.usage.input_tokens`) as input_tokens, sum(`attributes.gen_ai.usage.output_tokens`) as output_tokens by `attributes.gen_ai.system`| sort - requestsTool execution analysis
Section titled “Tool execution analysis”See which tools agents are calling and their performance.
source = otel-v1-apm-span-*| where `attributes.gen_ai.operation.name` = 'execute_tool'| stats count() as executions, avg(durationInNanos) as avg_latency, max(durationInNanos) as max_latency by `attributes.gen_ai.tool.name`, serviceName| eval avg_ms = round(avg_latency / 1000000, 1)| sort - executionsAgent invocation latency
Section titled “Agent invocation latency”End-to-end latency for agent invocations.
source = otel-v1-apm-span-*| where `attributes.gen_ai.operation.name` = 'invoke_agent'| eval duration_ms = durationInNanos / 1000000| stats avg(duration_ms) as avg_ms, percentile(duration_ms, 95) as p95_ms, count() as invocations by serviceName, `attributes.gen_ai.agent.name`| sort - p95_msFailed agent operations
Section titled “Failed agent operations”Find agent operations that resulted in errors.
source = otel-v1-apm-span-*| where isnotnull(`attributes.gen_ai.operation.name`) and status.code = 2| sort - startTime| head 20SRE incident response
Section titled “SRE incident response”Error rate percentage over time
Section titled “Error rate percentage over time”Track the overall error rate trend - spot the moment things started going wrong.
| stats count() as total, sum(case(severityText = 'ERROR' or severityText = 'FATAL', 1 else 0)) as errors by span(time, 5m) as time_bucket| eval error_pct = round(errors * 100.0 / total, 2)| sort time_bucketFirst error occurrence per service
Section titled “First error occurrence per service”Find when each service first started erroring - pinpoint the origin of an incident.
| where severityText = 'ERROR'| stats earliest(time) as first_seen, count() as total_errors by `resource.attributes.service.name`| sort first_seenError spike by service (timechart)
Section titled “Error spike by service (timechart)”Visualize error spikes per service over time - the Splunk-style timechart equivalent.
| where severityText = 'ERROR'| timechart span=5m count() by `resource.attributes.service.name`P95 latency timeseries by service
Section titled “P95 latency timeseries by service”Track latency degradation over time - the core SRE golden signal.
source = otel-v1-apm-span-*| stats percentile(durationInNanos, 95) as p95_ns by span(startTime, 5m) as time_bucket, serviceName| eval p95_ms = round(p95_ns / 1000000, 1)| sort time_bucketSlowest operations by service
Section titled “Slowest operations by service”Find the most expensive operations to target for optimization.
source = otel-v1-apm-span-*| stats avg(durationInNanos) as avg_ns, percentile(durationInNanos, 95) as p95_ns, count() as calls by serviceName, name| eval avg_ms = round(avg_ns / 1000000, 1), p95_ms = round(p95_ns / 1000000, 1)| sort - p95_ms| head 20Cross-signal correlation
Section titled “Cross-signal correlation”Logs for a specific trace
Section titled “Logs for a specific trace”Jump from a trace to its associated logs using the traceId.
source = logs-otel-v1*| where traceId = '<your-trace-id>'| sort timeServices with both high error logs and slow traces
Section titled “Services with both high error logs and slow traces”Combine log and trace signals to find the most problematic services.
source = logs-otel-v1*| where severityText = 'ERROR'| stats count() as error_logs by `resource.attributes.service.name`| where error_logs > 10| sort - error_logsThen investigate trace latency for those services:
source = otel-v1-apm-span-*| where serviceName = '<service-from-above>'| stats percentile(durationInNanos, 95) as p95, count() as spans by name| eval p95_ms = round(p95 / 1000000, 1)| sort - p95_msDashboard-ready queries
Section titled “Dashboard-ready queries”These queries produce results well-suited for dashboard visualizations.
Service health summary (data table)
Section titled “Service health summary (data table)”source = otel-v1-apm-span-*| stats count() as total_spans, sum(case(status.code = 2, 1 else 0)) as error_spans, avg(durationInNanos) as avg_latency_ns by serviceName| eval error_rate = round(error_spans * 100.0 / total_spans, 2), avg_latency_ms = round(avg_latency_ns / 1000000, 1)| sort - error_rateLog volume heatmap (by service and hour)
Section titled “Log volume heatmap (by service and hour)”| eval hour = hour(time)| stats count() as volume by `resource.attributes.service.name`, hour| sort `resource.attributes.service.name`, hourTop error messages
Section titled “Top error messages”| where severityText = 'ERROR'| top 20 bodyAdvanced analytics
Section titled “Advanced analytics”Outlier detection with eventstats
Section titled “Outlier detection with eventstats”Use eventstats to compute per-group aggregates without collapsing rows, then flag outliers that deviate significantly from their service’s baseline.
source = otel-v1-apm-span-*| eventstats avg(durationInNanos) as svc_avg by serviceName| eval deviation = durationInNanos - svc_avg| where deviation > svc_avg * 2| sort - deviation| head 20Find spans that take more than 3x the service average — surface hidden performance outliers that percentile queries miss.
Rolling window analysis with streamstats
Section titled “Rolling window analysis with streamstats”Use streamstats to compute sliding-window aggregates over ordered events, ideal for detecting real-time latency regressions.
source = otel-v1-apm-span-*| sort startTime| streamstats window=20 avg(durationInNanos) as rolling_avg by serviceName| eval current_ms = durationInNanos / 1000000, avg_ms = rolling_avg / 1000000| where durationInNanos > rolling_avg * 3| head 20Flag spans that exceed 3x the rolling 20-span average per service — catch latency spikes as they happen.
Smoothed latency trends with trendline
Section titled “Smoothed latency trends with trendline”Use trendline to compute simple moving averages over sorted data, making it easy to spot sustained performance shifts versus momentary noise.
source = otel-v1-apm-span-*| trendline sort startTime sma(5, durationInNanos) as short_trend sma(20, durationInNanos) as long_trend| eval short_ms = short_trend / 1000000, long_ms = long_trend / 1000000| eval trend = if(short_ms > long_ms, 'degrading', 'improving')| head 50Compare short-term (5-span) versus long-term (20-span) moving averages to classify whether latency is degrading or improving.
Masterclass pipelines
Section titled “Masterclass pipelines”These multi-command pipelines combine several PPL features to solve real observability problems in a single query.
Service health scorecard
Section titled “Service health scorecard”A complete service health dashboard in one query — error rates, latency percentiles, and automated health classification.
source = otel-v1-apm-span-*| stats count() as total_spans, sum(case(status.code = 2, 1 else 0)) as error_spans, avg(durationInNanos) as avg_latency_ns, percentile(durationInNanos, 95) as p95_ns, percentile(durationInNanos, 99) as p99_ns by serviceName| eval error_rate = round(error_spans * 100.0 / total_spans, 2), avg_ms = round(avg_latency_ns / 1000000, 1), p95_ms = round(p95_ns / 1000000, 1), p99_ms = round(p99_ns / 1000000, 1), health = case( error_rate > 5, 'CRITICAL', error_rate > 1, 'DEGRADED', p99_ms > 5000, 'SLOW' else 'HEALTHY')| sort - error_rateCombines stats, eval, and case to produce a single-query health scorecard across all services. Use this as a starting point for service-level dashboards.
GenAI agent cost and performance analysis
Section titled “GenAI agent cost and performance analysis”Complete GenAI observability: latency, token usage, failure rate, and per-operation breakdown across all AI agents.
source = otel-v1-apm-span-*| where isnotnull(`attributes.gen_ai.operation.name`)| eval duration_ms = durationInNanos / 1000000, input_tokens = `attributes.gen_ai.usage.input_tokens`, output_tokens = `attributes.gen_ai.usage.output_tokens`, total_tokens = input_tokens + output_tokens| stats count() as operations, avg(duration_ms) as avg_latency_ms, percentile(duration_ms, 95) as p95_ms, sum(total_tokens) as total_tokens, sum(case(status.code = 2, 1 else 0)) as failures by serviceName, `attributes.gen_ai.operation.name`, `attributes.gen_ai.system`| eval failure_rate = round(failures * 100.0 / operations, 2), tokens_per_op = round(total_tokens / operations, 0)| sort - total_tokensBreaks down every GenAI operation by service, operation type, and AI system. Use this to track cost drivers, identify high-failure operations, and compare AI provider performance.
Envoy access log analysis
Section titled “Envoy access log analysis”Parse raw Envoy access logs into an API traffic dashboard — method, path, and status class breakdown.
source = logs-otel-v1*| where `resource.attributes.service.name` = 'frontend-proxy'| grok body '\[%{GREEDYDATA:timestamp}\] "%{WORD:method} %{URIPATH:path} HTTP/%{NUMBER}" %{POSINT:status}'| where isnotnull(method)| eval status_class = case( cast(status as int) < 200, '1xx', cast(status as int) < 300, '2xx', cast(status as int) < 400, '3xx', cast(status as int) < 500, '4xx' else '5xx')| stats count() as requests by method, path, status_class| sort - requests| head 30Uses grok to extract structured fields from unstructured proxy logs, then aggregates into an API traffic summary. Adapt the grok pattern for other log formats.
Automatic error pattern discovery
Section titled “Automatic error pattern discovery”Cluster error messages into patterns per service with zero regex — PPL’s killer feature for incident triage.
source = logs-otel-v1*| where severityText = 'ERROR'| patterns body method=brain mode=aggregation by `resource.attributes.service.name`| sort - pattern_count| head 20The patterns command with method=brain uses ML-based clustering to group similar error messages. During an incident, run this first to see the shape of the problem without writing a single regex.
Cross-signal correlation: logs meet traces
Section titled “Cross-signal correlation: logs meet traces”Correlate error logs with error spans across indices — find which trace operations cause which log errors.
source = logs-otel-v1*| where severityText = 'ERROR'| where traceId != ''| left join left=l right=r on l.traceId = r.traceId [ source = otel-v1-apm-span-* | where status.code = 2 | eval span_duration_ms = durationInNanos / 1000000 | sort - span_duration_ms | head 1000 ]| where isnotnull(r.serviceName)| stats count() as correlated_errors by l.`resource.attributes.service.name`, r.serviceName, r.name| sort - correlated_errors| head 20Joins error logs with error spans using traceId to reveal which span operations are responsible for which log errors. This cross-index join is one of PPL’s most powerful capabilities for root cause analysis.
Query tips
Section titled “Query tips”Backtick field names with dots
Section titled “Backtick field names with dots”OpenTelemetry attributes contain dots. Wrap them in backticks:
| where isnotnull(`resource.attributes.service.name`)Combine stats with eval for computed metrics
Section titled “Combine stats with eval for computed metrics”| stats count() as total, sum(case(severityText = 'ERROR', 1 else 0)) as errors by service| eval error_pct = round(errors * 100.0 / total, 2)Use span() for time bucketing
Section titled “Use span() for time bucketing”| stats count() by span(time, 1m) as minuteUse head to limit during exploration
Section titled “Use head to limit during exploration”Always add | head while exploring to avoid scanning all data:
| where severityText = 'ERROR'| head 50Sort with - for descending
Section titled “Sort with - for descending”| sort - durationInNanosFurther reading
Section titled “Further reading”- PPL Language Overview - Why PPL and how it compares
- Command Reference - Full syntax for all commands
- Function Reference - 200+ built-in functions
- Discover Logs - Using PPL in the Logs UI
- Discover Traces - Using PPL in the Traces UI