Skip to content

ml

import { Aside } from ‘@astrojs/starlight/components’;

The ml command applies machine learning algorithms from the ML Commons plugin directly in your PPL query pipeline. It supports anomaly detection using Random Cut Forest (RCF) and clustering using k-means, running train-and-predict operations in a single step.

Anomaly detection (time-series):

ml action='train' algorithm='rcf' time_field=<field> [<parameters>]

Anomaly detection (batch/non-time-series):

ml action='train' algorithm='rcf' [<parameters>]

K-means clustering:

ml action='train' algorithm='kmeans' [<parameters>]
ArgumentRequiredDefaultDescription
time_field=<field>YesThe timestamp field for time-series analysis.
number_of_trees=<int>No30Number of trees in the forest.
shingle_size=<int>No8Consecutive records in a shingle (sliding window).
sample_size=<int>No256Sample size for stream samplers.
output_after=<int>No32Minimum data points before results are produced.
time_decay=<float>No0.0001Decay factor for stream samplers.
anomaly_rate=<float>No0.005Expected anomaly rate (0.0 to 1.0).
date_format=<string>Noyyyy-MM-dd HH:mm:ssFormat of the time field.
time_zone=<string>NoUTCTime zone of the time field.
category_field=<field>NoGroup input by category; prediction runs independently per group.
ArgumentRequiredDefaultDescription
number_of_trees=<int>No30Number of trees in the forest.
sample_size=<int>No256Random samples per tree from training data.
output_after=<int>No32Minimum data points before results are produced.
training_data_size=<int>NoFull datasetSize of the training dataset.
anomaly_score_threshold=<float>No1.0Score threshold above which a point is anomalous.
category_field=<field>NoGroup input by category; prediction runs independently per group.
ArgumentRequiredDefaultDescription
centroids=<int>No2Number of clusters.
iterations=<int>No10Maximum iterations for convergence.
distance_type=<type>NoEUCLIDEANDistance metric: COSINE, L1, or EUCLIDEAN.

The command appends these fields to each row:

FieldDescription
scoreAnomaly score (higher = more anomalous).
anomaly_gradeAnomaly grade (0.0 = normal, higher = more anomalous).
FieldDescription
scoreAnomaly score.
anomalousBoolean indicating whether the point is anomalous (True/False).
FieldDescription
ClusterIDThe cluster assignment (integer starting from 0).
  • For time-series RCF, ensure data is ordered by the time field before passing to ml. The algorithm expects sequential data.
  • The output_after parameter controls the warm-up period. The first N data points will have a score of 0 while the model learns normal patterns.
  • Batch RCF treats each data point independently, making it suitable for detecting outliers in non-sequential data.
  • K-means works best when numeric fields are on similar scales. Consider normalizing with eval before clustering.
  • Use category_field to run independent models per category (e.g., per service), avoiding cross-contamination between different baseline behaviors.

Detect anomalous latency in time-series data

Section titled “Detect anomalous latency in time-series data”

Aggregate span duration into 1-minute buckets and detect anomalies:

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_latency by span(startTime, 1m) as minute
| ml action='train' algorithm='rcf' time_field='minute'
| where anomaly_grade > 0
| sort - anomaly_grade

Run independent anomaly detection per service:

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_latency by span(startTime, 1m) as minute, serviceName
| ml action='train' algorithm='rcf' time_field='minute' category_field='serviceName'
| where anomaly_grade > 0

Batch outlier detection on request durations

Section titled “Batch outlier detection on request durations”

Detect unusually slow spans without considering time ordering:

source = otel-v1-apm-span-*
| ml action='train' algorithm='rcf'
| where anomalous = 'True'

Cluster services by error and latency behavior

Section titled “Cluster services by error and latency behavior”

Use k-means to group services into behavioral clusters based on error rate and average latency:

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_duration, count() as total, sum(case(status.code = 2, 1 else 0)) as errors by serviceName
| eval error_rate = errors * 100.0 / total
| ml action='train' algorithm='kmeans' centroids=3

Lower the anomaly_rate and increase shingle_size for stricter detection with more context:

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_latency by span(startTime, 1m) as minute
| ml action='train' algorithm='rcf' time_field='minute' anomaly_rate=0.001 shingle_size=16
| where anomaly_grade > 0

End-to-end latency anomaly investigation (OTel)

Section titled “End-to-end latency anomaly investigation (OTel)”

Detect anomalous latency spikes and then find the specific traces responsible:

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_latency, max(durationInNanos) as max_latency by span(startTime, 5m) as window, serviceName
| ml action='train' algorithm='rcf' time_field='window' category_field='serviceName'
| where anomaly_grade > 0
| sort - anomaly_grade
| head 10

After identifying the anomalous time windows, investigate individual traces:

source = otel-v1-apm-span-*
| where serviceName = 'checkout'
| sort - durationInNanos
| head 20

Cluster OTel services by operational profile

Section titled “Cluster OTel services by operational profile”

Group services by their token usage, latency, and throughput characteristics to identify operational tiers:

source = otel-v1-apm-span-*
| stats avg(durationInNanos) as avg_duration, count() as throughput by serviceName
| eval avg_duration_ms = avg_duration / 1000000
| ml action='train' algorithm='kmeans' centroids=3 distance_type=EUCLIDEAN
  • stats — aggregate data before feeding to ML algorithms
  • eventstats — append aggregation results alongside original events
  • trendline — simple and weighted moving averages
  • eval — normalize fields before clustering