Unsupervised real time anomaly detection
May 18, 2020 • 20 min read
 1. What is an anomaly?
 2. Metric streams as input
 3. The realtime anomaly detection pipeline
 4. Model selection
 5. Is multivariate deep learning our way?
 6. From deep learning to gradient boosting
 7. From single time series anomaly detection to multivariate metrics alerts
 8. From gradient boosting to neocortex models
 Conclusion
 References
Most modern application systems consist of multiple middleware components. This includes databases, queues, search engines, storage, caches, and inmemory data grids, identity services, etc. They also include multiple stateful or stateless microservices and mobile application proxies — all connected by data and processing flows.
Each component is a set of physical or virtual machines and hosts middleware or service applications. They all produce system and application logs and usually collect system and application metrics through metric collection agents. These logs and metrics are aggregated by logging and monitoring services and serve as data sources for future analysis and anomaly detection.
In this article we will present a solution for realtime anomaly detection through component metrics aggregated in time series. The solution consists of the following components:
 History metrics time series analysis and preprocessing
 ML models design, training, and evaluation
 ML platform and data science pipeline implementation with the cloud platform
 Realtime data ingestion and preprocessing
 Model deployment and serving
 Anomaly detection algorithms with realtime inferences
 New metrics onboarding
 Alerts
 Anomaly detection visualizations
 Anomaly labeling tools to determine the user experience in anomaly detection
Although in this paper we refer to anomaly detection as occurring in realtime, it is actually a quasi realtime approach. In reality, metric values are aggregated in a certain time window – so there is a slight delay between the actual anomaly time and its availability in the anomaly highlights. However, this delay is usually no longer than one to five minutes, so it is of minor significance.
1. What is an anomaly?
The term “anomaly” can have various meanings, both formal and mathematical. In this particular case (time series statistics concept) it refers to an outlier – a value that is outside of certain thresholds. Additionally, the term anomaly can be applied to specific time points or longer time periods where a continuously abnormal behaviour of time series values is observed.
Currently, the solution defines an anomaly as any behaviour of a time series value that is different from the typical signal pattern within a set time frame. The detection system includes a model that is trained to be able to classify, define, and predict a signal pattern and the measured difference to a threshold boundary. The size of this difference is called the anomaly score and its nature influences the calculation and modeling of the threshold boundary. In a nutshell, anomalies can be observed both as signal spikes or sharp drops, as well as longer, more continuous growth spans or short term oscillations.
An abnormal state depends on the situational context, an analyst/specialist’s interpretation, as well as domain knowledge. The situational context refers to taking into account other connected metrics and related components. All anomalies should be monitored by specialists such as production support engineers who have the power to control and validate them.
1.1 Anomaly examples
 MySQL DB throughput (requests per second):
2. MySQL cluster management node write bytes count
3. Application memory use in MB
4. Correlated metrics anomalies:
a. Application API throughput (transactions per second) with status code 200 is continuously falling:
b. At the same time application API latency is showing abnormal growth:
c. Finally, an application API showing a spike in the number of errors with status code 502:
2. Metric streams as input
Realtime anomaly detection requires instantaneous ingestion of metric values however, this process is dependent on the characteristics of the data sources.
Data source aspects we face are:
 API interaction type: push or pull.
 API protocol.
 Data aggregation functions, aggregation window, and time steps between these windows.
 API limit quotas: number of requests per minute, response size, request execution time.
The solution provides design and implementation of Spark Streaming [Streaming Programming Guide] and custom implementation of HTTP receivers to pull data via the datasource API. For example, StackDriver Monitoring uses the REST API to fetch metrics time series.
We implemented a custom receiver to read data from the Google Pub/Sub topic, where messages are assorted by StackDriver Logging Export sinks.
The streaming processes are measured with “batch duration”, which is the time between readfromstream iterations. Meanwhile, the HTTP receivers are required to be set for the task period. By tasks we understand the implementation of runnable interfaces scheduled by the Scheduled Executor Service. The task fetches (pulls) data from the datasource API and applies the processing logic, which includes actions such as metric construction, augmentation, time alignment, and persistence.
2.1 Metric construction
Metric is a fourvector involving: metric name, time, value, and source. Its name string consists of: component type, metric type, and metric unit if required.
In practice it appears as follows: [mysql][queryruntime][sec]. The time is represented in ISO 8601 format and UTC time zone: 20200201T10:00:00Z and is stored as a string or a timestamp. When it comes to value, it has a double floating point type. The metric datasource is described as a string of the “source”.
2.2 Metric augmentation
Another important concept is metric time series data spaces (or missing values). Data space is a metric time point, where datasource does not have any value.
The following three approaches are used in responding to the missed data:
1. Leave spaces in their place – it is applied for metrics time series with a high degree of sparsit, for example when metrics have many data spaces along with historical time periods and/or metrics are rarely collected. In this case it is possible to fill (augment) data spaces with zeros, but only if zero is a minimal value.
2. Linear interpolation is used for historical data sets and consumed for model training.
3. Missed data replacement with values predicted by the trained regression models.
2.3 The alignment of the time of a metric to the fetch time
First of all, we need to define a time scale for the fetch procedure and anomaly detection. It depends on the metric value aggregation window size, fetch time, and anomaly detection duration. It is required to select a maximum window size or duration. In our case the best time scale is one minute.
Ingesting jobs logic involves datasource API query time range construction. It would be easy to use the current time as the left boundary and through shifting it by some value as the query time range. The shifting value is selected by using the last data point availability property, which is guaranteed by datasource.
Unfortunately, the current time is not a good core time point. You need to keep in mind that the number of metrics is increased by new, continuously onboarding metrics and domains, so job tasks should be scheduled in parallel. However, the number of threads depends on the number of CPU cores, which is a limited resource. Therefore, the tasks are distributed to parallel batches and each batch is executed sequentially. Unfortunately, if the datasource ingesting queries have spikes in latency, it leads to some delays in the starting points of the job.
Luckily, there is a better approach: the pair (job creation time, job period) can be moved step by step over the time scale, fetching required time ranges of data. In this case, the job period is the time period between successive executions. For example:
Initially <fetch time point> := <job creation time>. And for each iteration <fetch time point> : = <fetch time point> + <job period>.
2.4 Metric persistence
The fetched data can be persisted for the following reasons:
 Time series preprocessing may require several steps
 To meet the requirement of collecting historical data for model training (datasource can only store so much, so it is essential to avoid a situation where it provides only last month’s time series values)
 Persisted time series are used for distribution and abnormal behaviour visualizations for both the current time as well as times in the past
Our preferred persisting storage is Elasticsearch cluster and on this platform each metric time series has two replicas. Moreover, the index contains index_name field and timestamp field together with [metric_name, time, value, source].
3. The realtime anomaly detection pipeline
The anomaly detection pipeline is implemented as a workflow of streaming jobs in iterations. We can distinguish three types of streaming jobs:
 Data ingestion – collects and processes input datasource data and persists it to Elasticsearch. It can also temporarily store and aggregate data from DB queue in the case of the pushdata source
 Anomaly detection – fetches metrics data chunks from Elasticsearch and applies the anomaly detection inference logic. The anomaly detection algorithm makes an online prediction by trained model. It is also able to persist abnormal points and predictions to Elasticsearch and visualize them in aggregation through the anomalies dashboard
 Alerting – fetches abnormal points from Elasticsearch and filters them by configured business rules, which are composed as crossmetrics aggregations and compared with already configured or learned thresholds. Some business rules require online predictions provided by classification models. If rules are met, the alerts are sent to consumers as email notifications
The anomaly detection workflow and its components are driven and configured by the configuration database, which stores the following:
 Metrics metadata: metrics aliases, metrics tiers, and metrics relations
 Ingestion configuration: time ranges, datasource types, and endpoints
 Anomaly detection rules: time ranges, model references, and thresholds
 Alerting rule configurations
 Anomaly visualization configurations
4. Model selection
There are two types of consumption models:
 Anomaly detection models
 Alerting models
Anomaly detection models are used to predict either the metrics time series value or model structure states for analysed time points. Results of this model’s usage are utilized by anomaly detection algorithms along with anomaly detection streaming jobs. Such models are designed and trained for single or multivariate time series.
Alerts models are used for multivariate anomaly classification or for the clusterization that is needed to detect the abnormal points combination importance on an analysed time frame.
The following provides an example of the model selection process. The design of the solution and evaluation have passed through the model selection.
First, we processed the formal selection criteria:
 In the case of regression models, we verified the R2 score:
$$
\begin{aligned}
R^{2} = 1 \frac {\sum^N_{i=1}(y_iy^i)2}{\sum^N_{i=1}(y_iy)2},\text{where}\\
\end{aligned}
$$
$$
\begin{aligned}
& y_i \text{metric observed value at time point i;} \\
& ŷ_i \text{predicted value of metric at time point i;}\\
& ȳ_i = \frac 1 {N} \sum^N_{i=1} y_i \text{meaning of observed data.}\\
\end{aligned}
$$
 Adjusted R2 score for selection from regression models with various sets of features:
$$
\begin{aligned}
R^{2} = 1 – (1 – R^2) \frac {N – 1} {NP1} \text{,where}\\
\text {P – the total number of explanatory variables in the model.}\\
\end{aligned}
$$
 Classification models are evaluated with Recall, Precision, Fmeasure:
$$
\begin{aligned}
Precision = \frac {TP} {TP + FP}, Recall = \frac {TP} {TP + FN}, F = 2⋅\frac {Precision⋅Recall} {Precision+Recall}
\end{aligned}
$$
Even though the next portion of the criteria is much more informal, it is significantly more important.
Diagram presenting the importance of the features:
 Prediction curve:
 Production engineer feedback – Anomalies are delivered to production engineers via email alerts. Emails contain all the details for abnormal time points and references to the relevant dashboards. It allows the production engineers to monitor and provide feedback about anomaly detection quality.
The solution passed the evolution road with the following milestones:
5. Is multivariate deep learning our way?
After analyzing multiple papers on anomaly detection, we were impressed by the deep learning approach described in [A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time…]. It is based on Convolutional LSTM autoencoders training and inference. After spending a significant amount of time investigating various approaches for demand prediction with machine learning, collecting enough cases with recurrent neural networks and their ensembles implementations, we decided that would be our starting point.
Firstly, we fetched the history dataset, which was three months worth of data aggregated by the minute metrics and produced by the REST services (20). We focused on two types of metrics: latency, and traffic (status code 200 requests per second).
Mathematically we had N time series (one for each metric)
$$
\begin{aligned}
\{X_1, X_2, …, X_N\}, X_i =\{X_{i,1},X_{i,2},…,X_{i,T}\};
\end{aligned}
$$
where T is the number of time series time points. This data was processed by developed Spark jobs, which executed the following actions:
a. Created time series segments
$$
X_i^{t, w}=\{x_{i, t w}, x_{i, t – w + 1}, …, x_{i, t – 1}, x_{i,t}\},
$$
for each time point t and scale w, where number and values for scale w were selected by data analysis
b. Calculated a signature matrix for each time point t and scale w:
$$
\begin{aligned}
M^{t,w}=\{a _ {i,j} & ^{{t,w}\}}{N,N}_ {i=1,j=1}, a_{i,j} ^{t,w}=\frac {1}{w}=\sum^w_{\delta=0}x_{i,t – \delta}x_{j,t – \delta}
\end{aligned}
$$
Mathematically, you can see that we started with pairwise correlations between the metrics time series. We believe that anomaly detection by metrics relations can be an approach useful for future RCA.
Each matrix has size N x N and subsequently, signature matrices were united and stored for each time point and all scales w as tensor:
$$
\begin{aligned}
\Phi^t = {M^{{t,w_1},M}{t,w_2}, M^{t,w_3}}
\end{aligned}
$$
where w1, w2, w3 – are selected scales. Created tensors were split by consuming types: “train”, “validation”, and “test”.
We then implemented a convolutional LSTM autoencoder model consuming design described in the paper, where LSTM neural network units are more useful for time series prediction and convolution layers sequences are better for encoding.
We trained models with Google AI Platform resources tuning parameters with Bayesian optimization. The R2 score for the testing split was close to zero, which is why we got an unexpected regression measure and tried the evaluation on the next level: anomaly classification.
The anomaly detection task is about finding the time segments where anomalies were raised. So it is a classification task to define a class of time segment: normal or abnormal.
Our classification task was based on an anomaly score calculation algorithm for the residual matrices, where the residual matrix is the difference between input signature tensors $Phi^t$ and is predicted by the trained autoencoder tensor $hat{Phi}^t$:
$$
R_t,w = absolute(M^{t,w} – \widehat{{M}^{t,w}})
$$
where ${\widehat{{M}^{t,w}}}$ – predicted signature matrix;
With estimating parameters of normal distribution N(, ), we calculated the anomaly score as:$ A^{t,w} = \frac {1}{\sigma}(R_{t,w} – μ)^T(R_{t,w} – μ)$
Time segment is “abnormal” (has anomalies) if an anomaly score equals $A^{t,w}>τ$ where τ is the classification threshold. It is calculated by maximizing the betascore when the “abnormal” class is positive but the “normal” class is negative:
$$
F_\beta = (1 + \beta^2)\frac{precision ⋅ recall}{\beta^2precision+recall}, \beta < 1 .
$$
To achieve it, we used a test time series split, which we labeled with manual data analysis and defined normal and abnormal time segments. Maximizing the $F_beta$ score, we detected threshold $τ$ using minimal scale $w$. We then applied the found threshold to classify the analysed time segments.
Lastly, we applied the found anomaly score parameters for the second part of the test split and handled the significant abnormal points as presented below:
However, the number of correlated anomalies was small and the number of falsepositives and falsenegatives significantly outnumbered the truepositives.
Classification metrics were:
 Precision ~ 0.25
 Recall ~ 0.18
 Fmeasure ~ 0.21
Unfortunately, we weren’t satisfied with the results so we came to the following conclusions:
 The number of metrics’ correlations is low by Pearson because they were fetched for the highlevel services and not connected directly
 The metrics’ correlations were adjusted, which means an alignment delay in the abnormal time points between them
 Our dataset was not sufficient – it only consisted of the last three months, which was not enough to handle the required pattern
 The labeling quality was inadequate – there was no production engineering experience
Ultimately, we were dealing with data, model, and experience issues all at once. Our next step was to make a decomposition of our previous task to design a more appropriate model with much better interpretation.
6. From deep learning to gradient boosting
We decided to simplify our approach in the following directions:
 Make the domain simpler – focusing on analyzing and modeling the single metrics instead of their correlations
 Make the model more straightforward – use classical regression models instead of convolutional LSTM autoencoder
 Utilize the minimal anomaly detection experience for the anomaly score calculation. This time we used a threshold calculation based on the expected number of abnormal time points instead of labeled by experience time points, making the approach more unsupervised
We decided to save the core concept of the anomaly detection, which is based on the difference calculation between metrics observed and predicted values and then compare the residuals with the selected threshold.
We replaced the original model with LightGBM gradient boosting.
The analysed number of metrics was about 140 and we expected the same number of gradient boosting models to train and evaluate. However, after observations and analysis we decided to group metrics by metrics type and data sparsity.
Since not all the metrics have the same value every time (for example the TPS value with status code 404 is not mentioned every minute because in most cases it is abnormal) and some services are stopped and are being run only during a short time period. Sparsity is the ratio between defined and undefined values of metrics along with the analysed time periods.
So by applying this grouping the number of models decreased from 140 to 6:
 Latency with low sparsity
 Latency with high sparsity
 Traffic TPS with status code 200 and low sparsity
 Traffic TPS with status code 200 and high sparsity
 Traffic TPS with status code not 200 and low sparsity
 Traffic TPS with status code not 200 and high sparsity
Data samples were produced via metric values with the minute shift lagging. The target of the model is current time metric value. Moreover, we included “metric” as a categorical feature.
LightGBM models were trained with the Google AI Platform resources. The basic analysis showed that the significant number of dramatic spikes affected the metric time series distribution, so we decided to apply scaling without analytical outlier exclusion.
After the hyperparameters were tuned with the Bayesian optimization, we determined the importance of different features for various models. However, they were found to be similar with lagging in time shifts with 110 minutes of accuracy. Depending on the level of the sparsity of the metric time series, we got an R2 score ranging from 0.25 – 0.9.
Our next step was to try a nonlinear type of the data transformation. We applied the Yeo Johnson transformation (I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954959, (2000)) with scikitlearn implementation (PowerTransformer).
The new R2 score values range became 0.75 – 0.97.
As noted above, we changed the anomaly score calculation together with the model design. Our new approach was based on the anomaly score and threshold calculation without any existing classified experience in anomaly detection.
Its algorithm includes a moving Zscore calculation as the anomaly score for the residuals (difference between observed and predicted metric values at the analysed time point) and using a moving average and standard deviation for a selected window:
 Calculate residual matrix for specified time segment and scales:
$$
R_t=absolute(M^t \hat{M^t}),
$$
where $\hat{M^t}$ predicted metric value 
Calculate moving average:
$$
\overline{R}_ {t,w}=\frac{1}{w} \sum^{t1}_ {d=tw} R_d
$$ 
Calculate moving standard deviation
$$
St,w= \sqrt{\frac{1}{w}\sum^{t1}_ {d=tw}(\overline{R}_ dR_{t,w})^2}
$$ 
Calculate absolute moving Zscore
$$
Z_ {t,w} = \frac {R_ {t} \overline{R} _ {t,w}}{S_ {t,w}}
$$
Time segment is “abnormal” (has anomalies) if the anomaly score (moving Zscore) is $Z_ {t,w} > τ $, where $τ$ is the calculated threshold. Using the observations in the history dataset, we estimated the assumed number of abnormal time points for the analysed time period and consumed its percent as a parameter for the threshold calculation for each metric.
The threshold was calculated as a quantile by the Zscore distribution and specified parameter. Therefore, if the assumed number of anomalies is 0.01% then = quantile(0.99). Quantile was applied for three types of parameters related to: minor, major, and critical severities of anomalies.
This approach was integrated with the realtime data ingestion, preprocessing, and online prediction with AI Platform, where trained models were deployed, as well as realtime inference to calculate the anomaly scores and anomaly verification for each metric.
Anomalies were being detected with increased precision and recalled on the last week of observations:
 Precision ~ 0.6
 Recall ~ 0.85
 F1 score ~ 0.7
We still had many false positives, but measures were subjective and depended on the threshold and visual interpretation of single anomalies.
At the same time, this approach was successful and new metric types were onboarded:
 Applications and middleware components metric (including their throughput, latency)
 System metrics of virtual machines (CPU, Memory, Disk IO, JVM metrics, NodeJS metrics, etc)
 Application error log metrics (number of application errors and/or exceptions per minute
The total number of metrics increased from 140 to 880 and the number of LightGBM models grew from 6 to 13.
We gathered the following issues from the second approach:
 Occasionally, there are long term changes in metric value distributions. It occurs because the implementation or external conditions of services are changed between releases. In this case, our solution handles these changes as anomalies
 When distributions change, the anomaly score threshold becomes non actual
 New metric types or sparsity require new models

It is required to have an additional filter for the abnormal time points to decrease the noise (false positives)
Some models were retrained with the new historical sets without any changes in factor saving performance. However, we can observe a restriction in models retraining procedure automation because there was a portion of models that required new feature selection iteration.
Threshold recalculation is an easier process – using a new distribution for a recent time frame means that Zscores are recalculated by a scheduled job. Quantile parameters are selected manually or calculated as statistics applied to recent time frames.
While utilizing the appropriate anomaly detection models and parameters for a single metric, we still strived to work on improving their interpretation. So we decided to resolve it on the alerting phase of the anomaly detection pipeline and we hoped to decrease the noise in the same way.
7. From single time series anomaly detection to multivariate metrics alerts
7.1 Alert business rules
The initial approach provides mathematical rules for calculating statistics by the time series created from abnormal time points of various metrics. We analysed a number of abnormal time points of various severity (minor, major, and critical) and various metrics of the same domain on time frames where these numbers produced significant concentrations:
Using the results of the analyses we constructed the rules and aggregation of their thresholds. Here is an example of a business rule represented by the developed DSL:
{
"and": [
{
"ge": 2,
"lin": [
{
"agg": "minor_event_counter"
},
0.022,
{
"agg": "major_event_counter"
},
0.044,
{
"agg": "critical_event_counter>"
},
0.133
]
}
]
}
It corresponds to:
0.022 * minor_event_counter +
0.044 * major_event_counter +
0.133 * critical_event_counter >= 2
“Minor_”, “major_”, and “critical” event counters are a number of abnormal time points for all the metrics of the selected domain, specified by the severity and time frame.
Time frame is selected in relation to about 5 – 7 minutes of the current time and length and these rule calculations are covered by “alert” streaming jobs.
7.2 Unsupervised alerts based on clustering
Another approach was designed to replace manual business rule construction with machine learning methods.
Firstly, we applied the correlation analysis to all the domain metrics where the abnormal points were handled. Correlation by Pearson in the 0.7 threshold was used to select about 32 related metrics from a domain with the cardinality of about 150.
For these metrics we calculated the moving average aggregations by the window size equal to 5 minutes of applying weights for time points by the specified severity:
 If time point is not abnormal then weight equals 1
 If an abnormal time point has minor severity then weight equals 2
 If an abnormal time point has major severity then weight equals 3
 If an abnormal time point has critical severity then weight equals 4
We then proceeded to cut out all the samples prepared on the train along with the test splits and applied the Gaussian Mixture Model to clusterize the train split to two sets.
We assumed that the clusterization of latent criterion is essential and can impact the anomaly. We also assumed that smaller clusters are more important because most of the time the components work without any anomaly. Overall, we extracted “alert” and “no alert” clusters.
We gathered the following dataset statistics:
 Number of minutes in the dataset = 20160
 Number of anomalous windows = 6012
 Number of “alert predicted” windows = 162
 Percent of “alert” windows with respect to all minutes ≈ 0.80%
 Percent of “alert” windows with respect to anomalous windows ≈ 2.69%
The number of anomalies of each severity in every window of a cluster are:
No anomally  MINOR  MAJOR  CRITICAL  
NO ALERT  156.174701  1.039316  0.779145  2.491453 
ALERT  140.783951  0.561728  0.382716  18.555556 
Cluster distribution is represented as a PCA (Principal Components Analysis) graph with the variation ratio sum 0.51:
The bar chart below shows all anomaly average quantities for both classes:
These clustering results are converted into the anomaly compositions for alerts:
The trained GMM model is deployed to the ML Platform. The streaming job to create and send alerts covers this model and serves as a source of online predictions to verify and send an alert of anomaly composition.
8. From gradient boosting to neocortex models
As it was described above, the model retrain and selection of features are required for distribution changes or new metrics onboarding. Therefore, we started to search for a realtime training approach, keeping in mind the fact that the anomaly score threshold calculation is not flexible because it requires a manual selection of quantile parameters.
We settled on a method that is based on the Hierarchical Temporal Memory (HTM) model online training and inference [Unsupervised realtime anomaly detection for streaming data]. HTM is a framework that describes a biological machine learning neural network that models neocortex region structure and behaviour. HTM data processing is described in the following graph:
It is described as:
$x_t$ is HTM input, it is a metric value at time point $t$. It goes through Encoder – an algorithm to create SDR for $x_t$. SDR is a sparse distributed representation of data as a binary vector.
Spatial Pooling is the process of translating semantic information from one binary space to another. It translates semantic information projected across an input space into a space occupied by the minicolumns. A minicolumn is a group of pyramidal neurons sharing proximal input. The inhibitory neurons group pyramidal neurons into the minicolumns in the neocortex.
So $a(x_t)$ is still a binary vector and an output of the spatial pooler.
The sequence memory models temporal patterns is in $a(x_t)$ and provides a prediction $\pi(x_t)$for $a(x_ {t+1}). \pi (x_t)$ it is also a sparse vector.
The HTM model is trained in realtime, processing every new time point and metric value. However, it is possible to stop it temporarily to control the learned patterns and then resume the training again.
The anomaly detection with the HTM model consists of the prediction error and the anomaly likelihood calculations:
Prediction error is calculated with the following equation: $$
S_ t:=1 \frac{ a(x_t)\pi (x_ {t1})}{a(x_ t)},
$$
where vector $a(x_t)$ and its prediction vector $\pi(x_{t1})$ have the same lengths, $a(x_t)\pi(x_{t1})$ is scalar product, $a(x_t)$ – number of “one” bits. So the prediction error represents the ratio between the numbers of neurons in active and predictive states.
$S_t == 0\hspace{3mm}$if $\hspace{3mm}a(x_t)\hspace{3mm}$ is matched to its prediction and $\hspace{3mm}S_t == 1 \hspace{3mm}$ if $\hspace{3mm}a(x_t)\hspace{3mm}$and $\hspace{3mm}\pi(x_{t1})\hspace{3mm}$ do not have common bits. It means that St shows the quality of prediction. Prediction error is used for the instantaneous predictions and anomaly detection at the current time points (for example a temporal one minute spike). At the same time, if we use prediction errors as anomaly scores for noisy metric values, we can observe many false positives. The anomaly likelihood calculation and thresholding leads the anomaly detection on a continuous time frame and decreases the number of false positives.
The anomaly likelihood is estimated with probability for prediction errors distribution:
$$
L_t=1 – Q(\frac{\overline\mu_t \mu_t }{\sigma_t}), $$
where
$\mu_t=\frac{\sum^{w1}_ {i=o} S_ {t1}}{W}$ – sample mean for last Wprediction errors window,
$\overline\mu_t = \frac{\sum^{W’1}_ {i=0} S _ {t1}}{W’}$ short term moving average for window $W’ll W$,
$\sigma^2_ t = \frac{\sum^{W1}_ {i=0} (S_{ti}\mu_t)^2} {W1}$ – prediction errors variance for window $W$,
$Q(x)$ – Gaussian tail probability function.
Therefore, the anomaly detection threshold is based on a really small parameter $\varepsilon>0$. and is detected at the time point $t$ if anomaly likelihood $L_t\geq1 – \varepsilon$. We used the Nupic library [Numenta Platform for Intelligent Computing] with HTM framework implementation along with encoders and anomaly detection inference to integrate the likelihood calculation based on the anomaly detection algorithm of a related streaming job.
Our experiments showed that it is enough to tune $\varepsilon>0$ once. Every new HTM model is added in the runtime as an inmemory object of the REST API application deployed to a VM instance. The model is managed in runtime and It is possible to stop and resume the training. Moreover, the models are persisted to the file system by a configured schedule.
The HTM approach showed that false positives and false negatives decreased:
 Precision ~ 0.9
 Recall ~ 0.92
 F1 score ~ 0.909
Additionally, we observed an improvement of anomaly interpretation with visualization:
 Number of DB connections:
 CPU load:
 Number of error logs per minute:
Conclusion
In this article we demonstrated a solution for realtime anomaly detection through the component metrics aggregated in time series.
It showed a realtime unsupervised anomaly detection solution evolution from #a to #c approaches:
 Multivariate metrics and components deep learning model based on convolutional LSTM autoencoder
 Decomposition on univariate Gradient Boosting models
 Neocortex Neural Networks and multivariate anomaly alerts composition
We concluded that the last approach based on Neocortex Neural Network (HTM) is more appropriate for realtime anomaly detection for the following reasons:
 Realtime data ingest and realtime anomaly detection
 Models realtime training and adoption
 New metrics onboarding for anomaly detection in runtime
 Anomaly detection in a continuous time frame
 Anomaly score thresholds do not change over time
This approach was successfully integrated as a production support tool and workflow for issues detection and root cause analysis in retailer ecommerce infrastructure with Google Cloud services: Google Compute Engine, Google Cloud Storage, Google AI Platform, StackDriver Monitoring, and Logging.
Our further work in this area will be to extend and adopt the solution’s implementation for other cloud data platforms including Amazon AWS service, Microsoft Azure as well as applying it for data quality cases.
References
 A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data https://arxiv.org/abs/1811.08055
 Unsupervised realtime anomaly detection for streaming data https://www.sciencedirect.com/science/article/pii/S0925231217309864
 Gaussian Mixture Models https://scikitlearn.org/stable/modules/mixture.html#gmm
 J. Hawkins, S. Ahmad Why neurons have thousands of synapses, a theory of sequence memory in neocortex Front. Neural Circuits., 10 (2016), pp. 113, 10.3389/fncir.2016.00023
 Numenta Platform for Intelligent Computing https://github.com/numenta/nupic
 Streaming Programming Guide https://spark.apache.org/docs/latest/streamingprogrammingguide.html