Home Insights Add anomaly detection to your data with Grid Dynamics starter kit

Add anomaly detection to your data with Grid Dynamics starter kit

Alex Rodin

Michael Ionkin

Jul 15, 2020 • 19 min read

Add anomaly detection to your data with Grid Dynamics Starter Kit

Table of Contents

Overview
Design
Anomaly detection KPIs
Tools
Use cases
Real-time machine learning
Conclusion
References

Modern businesses in e-commerce, banking, insurance, manufacturing and many other areas generate huge amounts of data every day. Effectively utilizing that data helps increase business value by making the right decisions at the right time based on insights generated from targeted data analytics.

In this article we describe our real-time cloud based Anomaly Detection Solution. We will cover its design and applicability to the most common use cases: monitoring, root cause analysis, data quality, and intelligent alerting. The solution is AI driven and implements a flexible approach based on normal state and behaviour patterns extraction, but does not rely on purely statistical methods. Therefore it can catch not only suddenly occurring outliers but can also reveal changes in the distribution of very noisy data.

The solution is configurable for various types of components and services and detects abnormal behaviour in networks of connected services and collected data profiles. It can visualize abnormal places and make intelligent notifications based on them and provides tools for further deep analysis and RCA.

Overview

Most application systems consist of multiple services and middleware components. This includes databases, queues, search engines, storage, identity services, caches, and in-memory data grids. They also include multiple stateful or stateless microservices and mobile application proxies — all connected by data and processing flows.

Each component produces system and application logs and usually collects system and application metrics through metric collection agents. In addition, data itself may contain important information that represents the state or behaviour of a service, process, or user. All of these can be aggregated and served as data sources for future analysis and real-time anomaly detection using the Anomaly Detection Solution.

The solution covers the following use cases:

Monitoring and root cause analysis – The solution features the Anomaly detection dashboard, which enables users to monitor the state of all components in real-time and provides functionality to build dependencies graphs. It displays services as nodes, and logical data workflows or functional relations between them as graph edges. Each service can have several metrics and graphs can show states of services for specified time points and histories. Thus graphs can be used to visualize an anomaly detection domain, a “degree of destruction” for subsystems if an issue was raised, or an impact of each component on one another via anomalies. There is the additional option to use an Anomaly Statistic chart. This highlights anomaly impacts in two dimensions: abnormal metrics count at specified time, and anomalies density. So the graphs and the statistics charts are good starting points for deep root cause analysis.
Data quality anomaly detection – A data profile is a collected statistic or an informative summary of the state of data. The solution provides mechanisms for detecting anomalies in the quality of data profiles. For example, it could be summary information about user feedback, system transactions, access actions and so on, which are onboarded to the anomaly detection pipeline as metrics. The only thing the customer needs to do is to create an ETL process, which transfers data from its source to Amazon Kinesis and defines an ingest configuration. Following that, ingested data will be available for anomaly detection and by creating the dependencies graph between data domains and their metrics, it is possible to accurately monitor anomalies in data quality.
Intelligent alerting – Alerting is implemented via emails that are configured in the Anomalies Detection Dashboard. Alert configurations can be created for one or more anomalies graphs. Each email contains information about abnormal metrics, time points, and anomaly distribution charts.

An alert is a more high-level entity than just an anomaly. It is essentially a group of anomalies and the number and density of them are determined by business rules that are set via the dashboard.

Experiments have been run with the current solution based on real-world data. They showed a decrease in false positives (non-important abnormal cases) as well as false negatives (missed important abnormal cases). As an example, the following are evaluation metric values achieved for real-world e-commerce domain service anomaly detections:

Precision ~ 0.9
Recall ~ 0.92
F1 score ~ 0.909

To measure the classification metrics, which provides information about the anomaly detection quality, significant effort is required because a huge volume of metrics and timepoints must be analyzed and marked as normal or abnormal. The solution proposes to analyse only abnormal cases and related KPI’s (see Anomaly Detection KPI’s). It uses the solution labeling tool to collect user experiences in the anomaly importance verification process.

The solution is part of Grid Dynamics’ Analytics Platform. It consumes Amazon Services including AWS S3, RDS, EKS, Cloud Map, Secret Manager, Cloud Formation, Kinesis, Lambda, EMR, Sagemaker, etc., and has been published to Amazon Marketplace.

The current solution is an implementation of the Neocortex Neural Networks approach described in the Grid Dynamics Insights white paper [1] with Amazon Cloud Services. It is based on the Nupic HTM framework [4] integration and anomaly detection is based on the likelihood calculation for prediction errors (anomaly scores) distribution.

The solution provides complete implementation of a real-time anomaly detection workflow, including its deployment and scaling, external data onboarding, configuration, and tuning of ML models for specific tasks and domains.

The following features are available:

A metrics and anomalies generator to simulate data for tools study and enterprise cases.
Service dependencies graph design (Anomalies Graph).
Real world services metrics data ingest for anomaly detection via streaming channel.
Real-time anomaly detection.
Time series, anomalies, ML models behaviour visualization.
Abnormal behaviour in the Anomalies Graph visualization.
Real-time alerting to send notifications about abnormal cases to recipients based on designed business rules.
Labeling tool to collect user experience in anomalies validation for further analysis, ML models tuning, alerting with machine learning approach.

Let’s now walk through the solution design.

Design

Anomaly detection workflow

The solution implements an anomaly detection workflow. It covers the end-to-end process from data collection through to information delivery on abnormal behaviours.

The first step occurs in the data domain. It consists of application, system, and analytics metrics values collected from monitoring of services and middleware components, data quality profiles, and logs. This data is pushed to one of the available channels: Elasticsearch index and streaming.

The Elasticsearch index is used for showcase purposes but at the same time demonstrates the capability to ingest data from data sources connected via the HTTP protocol. The streaming channel is ready for enterprise tasks and is a datasource for real-world customers’ data onboarding to the anomaly detection ecosystem.

The next step is data processing. It prepares ingested data to be input for ML models training and inference in real-time. The anomaly detection inference is a ML models prediction, where anomaly scores are calculated to obtain the likelihood (probability of abnormal cases) for comparison with the threshold.

The last step is anomaly visualization and notifications to recipients. It includes signal distribution and anomaly statistics charts as well as anomaly dependencies graphs.

The anomaly detection pipeline includes a data workflow. It consists of cells where each cell is a combination of the data input pin, configuration pin, data processing unit, and data output pin.

The data processing unit contains algorithms for one or more anomaly detection workflow steps. The data processing configuration is a set of parameters for data processing algorithms. The following diagram shows the processing units and transitions for the solution data workflow:

Each step is configured with settings stored in the relational DB. Data is persisted to and fetched for the next steps from Elasticsearch indices. The root point of the workflow is the ETL job implemented outside the solution by the customer to onboard real-world data to the anomaly detection ecosystem.

The anomaly detection core is the HTM Model together with its inference and the anomaly likelihood calculation logic. Neocortext Neural Network and Nupic HTM framework [4] integration is described in the Realtime Machine Learning section where we describe the ML model’s life cycle, their deployment units, and helper processes. However, to provide a brief description of the process:

Each analyzed metric has its own HTM model.
All models are hosted and run outside the data processing unit.
The data processing unit invokes ML model endpoint for anomalies scores and likelihood calculations.
Anomaly detection is logic to make a decision about current behaviour of a signal on an analyzed time period, abnormal or not.

Deployment

Grid Dynamics designed and published the solution for the Data Analytics Platform. It provides the number of components deployed using Amazon Cloud Services to implement data and machine learning pipelines. All supported pipelines cover various business areas that consume data ingest, batch, and streaming data processing algorithms, etc.

Machine learning tasks are represented by computer vision, programmatic marketing, data quality, and anomaly detection. The anomaly detection case uses ADP components that provide the toolset, services, and resources for real-time processing and streaming. The following diagram represents the ADP components consumed by the anomaly detection elements.

The copy of the anomaly detection solution is run as an ADP use case or single item via Amazon Marketplace. It is a “one-click” deployment including configuration of ADP and anomaly detection parameters.

All anomaly detection resources and components are registered in the AWS Cloud Map for internal system discovery and discovery for user access. All credentials are generated in deployment time and available via AWS Secret Manager.

AWS Account allows the user to touch and consume all components of the solution. Only the Anomaly detection dashboard site is available via public network (ethernet). It is protected by the fact that credentials are created automatically during the deployment phase to access the site. All other components are available from the AWS VPCs or with port forwarding with kubernetes CLIs from a local desktop or laptop. Grafana is also available in the “viewer” mode from ethernet proxied by the Anomaly detection dashboard.

HTM models are hosted and run using the Amazon Sagemaker endpoint. One model version is hosted by only one endpoint instance. The current solution deployment provides only one Sagemaker endpoint for all HTM models. The solution design additionally only allows one instance per endpoint. Our default VM configuration for instance is ml.m4.xlarge. The maximum number of hosted HTM models is about 50. It’s possible to increase the number of models by scaling up the Sagemaker endpoint VM properties (instance type) or by scaling up horizontally the number of Sagemaker endpoints.

Anomaly detection KPIs

To measure the classification metrics that provide information about the quality of the anomaly detection process, significant effort is required. This is because a huge volume of metrics and timepoints need to be analyzed and marked as normal or abnormal.

Playing with various real-world domain data and simulations, we concluded that the true positives metric works sufficiently well. At the same time, the false positives metric should be decreased to avoid or decrease the number of noise alerts.

The labeling tool allows the user to mark abnormal cases as normal or abnormal. The precision metrics are calculated using labeled data:

$$
Precision = \frac{TP}{TP + FP}
$$

where:

TP = True Positives = number of important abnormal cases,
FP = False positives = number of handled abnormal cases that aren’t important.

The labeling sparsity (LS) is a measure of labeling coverage:

$$
LS = \frac{Labeled Anomalies}{Total Anomalies}
$$

So the precision and labeling sparsity allow for the construction of KPIs for the anomaly detection quality:

LS >= 0.5
Precision >= 0.9

The first one is achieved by the labeling tool consuming effort. It is required for the precision calculation and relevance. The expected precision is achieved by tuning the likelihood threshold and in some cases by more relevant metrics onboarding to the anomaly detection. For example, when garbage collector metrics have changes in their signal distribution and lead anomalies, but the application is not impacted, then it is required to replace those metrics by a more relevant system or application metrics such as: heap size, available memory, latency, throughput, etc.

In cases where it is not possible to achieve an appropriate value for precision via the listed ways, it is necessary to conduct the tuning in the alerts configuration step. There are two threads to do it:

The graph anomalies business rule – this allows users to configure manual business rules to decrease the number of noise alerts.
AI Alerting – this is an automatically trained, evaluated, and deployed machine learning model that classifies anomalies to be sent as alerts using the user experience collected by the labeling tool.

Tools

The solution internal processes are not visible for an end-user but it is possible to provide or modify their configuration, onboard data, and visualize processes behaviour and output. All these things are available via the solution tools:

Anomaly detection dashboard
Grafana dashboards

Anomaly detection dashboard

The Anomaly detection dashboard web application is the solution entrypoint and management portal. It provides tools to configure data ingest, new domains onboarding, ML models, and anomaly detection parameters.

The tool is designed to minimize the data science effort in ML models design and tuning. All model hyperparameters are configured with default values that are appropriate for most cases. It means that any analyst or production engineer can start to work with anomaly detection. At the same time the solution allows for tuning of the hyperparameters by creating a new version of the model and validating anomaly detection in real-time for new parameter values.

There are two types of hyperparameters:

SDR encoder parameters – encoders transform metric values into sparse distributed representations (SDR). SDR is the core binary representation of data for the Neocortex model components sequence: input, encoder, spatial pooler, sequence memory, prediction.
Likelihood calculation parameters – likelihood is based on the tail Gausian distribution function calculation for moving Z-scores for prediction errors obtained from HTM model neuron state statistics. See [2] for details.

It is possible to manage the training and detection processes using Train and Detect flags. The Detect flag enables HTM model prediction and anomaly detection logic. The Train flag enables HTM model learning (training). The HTM model is trained in real-time. Each new metric value is used for model training and for model inference (anomaly detection). So it is possible to suspend training if a user makes a decision that a signal pattern has been defined and all future changes of a signal behaviour can be explained as abnormal. Conversely, if the user needs to continue learning to define new pattern properties, then it is possible to resume training using this flag.

The Anomaly detection dashboard configures notifications about anomalies as emails sent to recipients such as production and support engineers or customers. It also provides a labeling tool to collect user experiences in anomaly validation.

The Anomaly detection dashboard is an anomaly monitoring panel. It visualizes anomaly statistics and service dependencies graphs to assist in Root Cause Analysis (RCA). It highlights anomalies and anomaly graph states in real-time or may be used for historical anomaly analysis.

It also provides a Metric Generator, which allows for the building of a simulation of user services and metrics, or trying various ingest channels and anomalies. This essentially allows users to trial a wide range of metrics and services to gain a comprehensive picture of the solution and tools.

Grafana dashboards

Grafana is used by thousands of companies to monitor everything from infrastructure like power plants to applications and beehives. The anomaly detection solution provides Grafana dashboards to visualize metrics time series, anomalies highlights, ML models behaviour as distribution of anomaly scores and likelihood.

Grafana dashboards are referenced by metrics links visualized in Anomalies graph in the Anomaly detection dashboard.

Now it is time to describe anomaly detection use-cases covered by the solution implementation.

Use cases

Monitoring and root cause analysis

The Anomaly detection dashboard contains a predefined anomalies graph “Showcase” built with simulated metrics and services. Users can modify or create new graphs to run simulations with real-world components and data.

Anomalies graph is a services/components/subsystems domain dependencies chart. It displays services as nodes and logical data workflow or functional relations between them as graph edges. All nodes are “green” in the normal state. But if the solution detects an abnormal case for a metric then its service will be shown in “red”.

The graph shows states for specified time points and stores their histories. It means that by browsing time points (via calendar control), users can obtain current and historical states of the graph.

If a user clicks on any node of the graph then a list of all its metrics available for a specified time point is shown. Those with anomalies are shown in red for the select times.

It is possible to enable auto refresh to show the current state of the graph every minute.

Any graph plays several roles:

It visualizes an anomaly detection domain.
It visualizes a “degree of destruction” for the subsystem if an issue was raised.
It visualizes impacts on components by other components via anomalies. So it is the starting point of deep root cause analysis (RCA).

The Showcase anomalies graph demonstrates the high-level design of the current solution and simulates real-world domains. Each component of the architecture contains a list of simulated metrics. For example, the Airflow service contains the CPU Load and Available Memory metrics.

On the screen below, the abnormal state of the graph is shown:

Three related services are in an abnormal state: Airflow, Sagemaker Model Endpoint, and Dashboard API. The Airflow CPU Load metric has an anomaly. The Sagemaker Model Endpoint also has an abnormal case for the Latency metric: the Available Memory metric is abnormal for Dashboard API.

So it is possible to create the following hypothesis for the root cause analysis (RCA):

The Sagemaker Model Endpoint service has some issues that impacted its latency. For example, some HTM model training cycles require a large amount of time to complete.
Airflow DAG’s multithreaded or parallel tasks invoked model training and inference. But their execution required a very large time period because Sagemaker endpoint latency was dramatically increased. So the first portion of parallel tasks was not completed and the next portion of tasks was started in the same time range. So CPU load was increased.
It also seems that Dashboard API requests that invoked Sagemaker Endpoint had delays and did not free up service memory in time. In addition, throughput was impacted so in this scenario it appears that users should wait for the next time point.

The Anomaly detection dashboard also contains the Statistic chart. It is a very useful component of the dashboard because it highlights anomaly impacts across two dimensions:

Abnormal metrics count at a specified time – if there are many abnormal cases at observed time points then the number of abnormal metrics will be increased, which is a cause for concern.
Anomalies density – this refers to the density of abnormal cases during a selected time period. If users regularly observe anomaly cases then the system will likely have long term issues or down time.

Data quality anomaly detection

This section details the solution application for real-world cases for example, anomaly detection in the data quality profile. The customer business system collects data, which is then split across several data domains:

Feedbacks
Transactions
Access

Each domain has the same list of data quality metrics:

Number of empty values – empty values indicate information is missing from the data set.
Dark amount – how much information is unusable due to data quality problems?
Time-to-value – how long does it take for the business to derive value from the information?

Data profiles are also described by data business markers:

Pos state ratio – how many records have “some positive business” state? For example, the number of completed transactions or positive feedbacks.
Neg state ratio – how many records have “some negative business” state? For example, the number of incomplete transactions or number of negative feedbacks.

We assume that data domains have dependencies. Access data metrics impact transaction and feedback domain metrics and feedback domain metrics impact transactions.

The first step in the anomaly detection process is data ingestion from a client data source into the anomaly detection ecosystem. The solution provides the Amazon Kinesis data stream configured for the client to transfer data.

Data records should have a valid JSON format with corresponding schema. Data example:


 [{
 "service_name": "Transactions",
     "metric_name": "Dark Amount",
     "metric_unit": "AMT"  
     "time":  "2020-04-21T08:23:00Z",
     "value": 25.0,
     "source": "Data Quality"
     }]

There are various ways to implement the ETL process for this step. For example, users can create an Airflow DAG fetching their data from the original datasource and push records to the destination data stream consuming the Kinesis data stream hook.

As soon as data begins being sent via the ETL job to the solution data stream, it is necessary to add the data ingest configuration to allow the data to be onboarded for the anomaly detection pipeline. This step allows for the data source to be defined, ingest tasks partitioning, and criteria to be filtered for data pushed to the data stream. It also allows for the creation of metrics names to be represented on the solution side with specified patterns. It is managed by the Data Ingest Configuration tool provided in the Anomaly detection dashboard:

Ingested data is now available for anomaly detection. The time series for ingested data can be analyzed with the Grafana Anomalies dashboard to get minimal and maximal values or to consume original knowledge about data distribution. The anomaly detection is enabled by configuring HTM models for each metric. Its behaviour is visualized via the Grafana Anomalies dashboard:

Corresponding to default configuration and likelihood calculation logic, the anomaly detection will begin producing scores immediately however, real likelihood values (greater than 0.5) will be generated after lkh_learning_period + lkh_estimation_samples (288 + 100 = 388) minutes.

Designing the dependencies graph between data domains and their metrics with the solution Graph Management tool, it is possible to create a new panel to monitor data quality anomalies and visualize abnormal states for further RCA:

Alerts

Alerting is implemented with emails sent by the solution job, which is configured with the Anomalies Detection Dashboard. The anomaly detection solution allows for the configuring of parameters within the SMTP server and credentials via AWS Cloud Map and AWS Secret Manager.

Alert configuration is created for one or more anomaly graphs. Each email contains information about abnormal metrics, time points, and anomaly distribution charts as attachments. The alerts machinery calculates aggregations for metrics of those graphs.

The business rule is a logical expression that contains logical and comparison operations and aggregations names as variables. Logical expression is constructed using JSON-like DSL. It’s schema is shown below:


{
 "$id": "https://www.griddynamics.com/ai4ops.alert.rule.schema.json",
 "$schema": "http://json-schema.org/draft-07/schema#",
 "title": "Alert Rule",
 "type": "object",
 "properties": {
   "and": {
     "$ref": "#/definitions/conditions"
   },
   "or": {
     "$ref": "#/definitions/conditions"
   }
 },
 "oneOf": [
   {"required": ["and"]},
   {"required": ["or"]}
 ],
 "definitions": {
   "condition": {
     "type": "object",
     "properties": {
       "agg": {"type": "string", "enum": ["event_counter", "metric_counter"]},
       "lt": {"type": "number"},
       "gt": {"type": "number"},
       "eq": {"type": "number"},
       "le": {"type": "number"},
       "ge": {"type": "number"},
       "ne": {"type": "number"}
     },
     "oneOf": [
       {"required": ["agg", "lt"]},
       {"required": ["agg", "gt"]},
       {"required": ["agg", "le"]},
       {"required": ["agg", "ge"]},
       {"required": ["agg", "eq"]},
       {"required": ["agg", "ne"]}
     ]
   },
   "conditions": {
     "type": "array",
     "minItems": 1,
     "items": [
       {
         "type": "object",
         "properties": {
           "and": {
             "$ref": "#/definitions/conditions"
           }
         }
       },
       {
         "type": "object",
         "properties": {
           "or": {
             "$ref": "#/definitions/conditions"
           }
         }
       },
       {
         "$ref": "#/definitions/condition"
       }
     ]
   }
 }
}

From the schema there are two aggregation variables:

Event_counter – the number of abnormal events along the analyzed time period.
Metric_counter – the number of unique metrics where abnormal cases were found along the analyzed time period.

For example:


{
 "and": [
   {
     "agg": "event_counter",
     "ge": 5
   }
 ]
}

It means that the logical expression is TRUE if the number of abnormal cases is greater or equal to 5 along the analyzed time period. If the logical expression (rule) has value TRUE, then a notification is sent.

The user can change existing alerts or create new ones by filling out additional email and job properties:

The following is an example of the expected alert email content that recipients would receive:

User experience collection

The solution allows users to collect user experiences in the anomaly validation process. This is achieved via the Labeling Tool, which asks each user (analyst, production engineer, data engineer etc) only one question about each anomaly along the analyzed time period: “Is the anomaly important?”. The users’ answers are then stored to the Elasticsearch index.

This experience can then be used for model and algorithm tuning, threshold configurations, and alerts design with supervised machine learning.

The Labeling Tool is available with the corresponding page in the dashboard for each configured anomalies graph.

It visualizes metrics distribution and abnormal places at selected time points. Users answer questions for all anomalies detected along the time period.

Real-time machine learning

Each analyzed metric must have its own HTM model. There can be several versions of the model for a metric with different hyperparameters but there can only be one active version of the model that performs predictions.

In the Anomaly Detection Solution the HTM model consists of two artifacts: a NuPIC HTM model and an appropriate NuPIC Anomaly Likelihood object. The lifecycle of the model is as follows. Right after the creation of the HTM model, it contains the raw NuPIC model and likelihood objects tuned with the specified hyperparameters and is ready for training and predictions.

As described in the Anomaly detection dashboard section, the HTM model can be in two states: train or detect. After some time, a user can turn off model training, after which the model will only make predictions. It also periodically makes backups of the model in storage, no matter what state the model is in.

A Spark Streaming job executes inference logic: every minute it sends data for an individual metric to an appropriate HTM model to generate an anomaly score and anomaly likelihood. If the received data was known for the model, then the anomaly score is zero. If the data was not known, then the score is one. Partially known data has a score between zero and one.

The anomaly likelihood shows the probability of anomaly behaviour in the context of historical changes of metric data. If the received anomaly likelihood is greater than the predefined likelihood threshold for this metric, then the data point is an anomaly.

The HTM model is hosted on a SageMaker Endpoint in a custom Docker container. One SageMaker model contains several HTM models. Each HTM model is loaded to RAM for the purposes of real-time training and inferences.

There are two ways for HTM models to be loaded to a SageMaker Endpoint. The first is to create a new model via a user request and the second is to import an already existing HTM model from the Amazon S3 bucket to a SageMaker Endpoint.

The creation process of a new HTM model is shown on the HTM model workflows diagram and labeled with the number 1. In this case, meta information about the model (including its hyperparameters and parameters of likelihood calculation) is persisted to the database. Anomaly likelihood objects and a new NuPIC HTM model are then created in RAM and immediately persisted to the Amazon S3 bucket. It is necessary to be able to import this new model if the user changes the endpoint of the model for example.

The import functionality is implemented by using Apache Airflow (the process is labeled with the number 2 on the workflows diagram). Import DAG periodically checks whether the HTM models are loaded into the appropriate endpoints. It retrieves the mapping between the HTM models and SageMaker Endpoints from the database, then sends requests to endpoints to determine whether it needs to import an HTM model or not. If a SageMaker Endpoint does not contain a specified HTM model then it downloads it from S3 bucket. At the same time, the Import DAG periodically checks the status of that process and notifies if any errors have occurred.

The Anomaly Detection Solution provides export functionality, which allows for the periodical backup of all HTM models to S3 bucket and thereby saves their learning progress. It is also implemented using Apache Airflow (the process is labeled with the number 3 on the workflows diagram).

The logic of the Export DAG is as follows. It first maps the HTM models and SageMaker Endpoints from the database and sends export requests to the appropriate endpoints. Each endpoint prepares the HTM models for export: temporarily disables learning of the model and packs it with an appropriate likelihood into a single artifact. As soon as Export DAG knows that the artifact is ready, it downloads it to the filesystem of the Airflow virtual machine, fits the artifact to the specific structure, then uploads it to S3 bucket.

Conclusion

The current solution is built to implement an end-to-end real-time anomaly detection pipeline. It contains tools to manage data processing units, build ML models, and visualize abnormal behaviour.

The solution is available as an Amazon marketplace service. It contains showcases with simulated data, which allows the user to touch and use all the features and steps of the pipeline. It is ready for real-world anomaly detection cases and customer data onboarding and is able to be scaled up to increase the number of processed metrics.

The core property of the solution is real-time machine learning. It is implemented with the Nupic framework and helper processes, which are integrated to the anomaly detection workflow.

It is deployed with a “one-click” procedure, which requires that the user configure the deployment properties only. It consumes Amazon Cloud Services and components are deployed and configured with Grid Dynamics’ Analytics Platform.

Grid Dynamics proposes that the solution can be extended and further customized for specific business cases spanning computer vision, NLP, fraud detection, and many other potential uses.

References

Unsupervised real time anomaly detection, Alex Rodin, white paper, Grid Dynamics Blog
Unsupervised real-time anomaly detection for streaming data, Subutai Ahmad, Alexander Lavin, Scott Purdy, Zuha Agha, Neurocomputing
Why neurons have thousands of synapses, a theory of sequence memory in neocortex Front. J. Hawkins, S. Ahmad, Neural Circuits., 10 (2016), pp. 1-13
Numenta Platform for Intelligent Computing, GitHub

Add anomaly detection to your data with Grid Dynamics starter kit

Overview

Design

Anomaly detection workflow

Deployment

Anomaly detection KPIs

Tools

Anomaly detection dashboard

Grafana dashboards

Use cases

Monitoring and root cause analysis

Data quality anomaly detection

Alerts

User experience collection

Real-time machine learning

Conclusion

References

Tags

You might also like

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

CONTACTS

SECTIONS

FOLLOW US

Add anomaly detection to your data with Grid Dynamics starter kit

Overview

Design

Anomaly detection workflow

Deployment

Anomaly detection KPIs

Tools

Anomaly detection dashboard

Grafana dashboards

Use cases

Monitoring and root cause analysis

Data quality anomaly detection

Alerts

User experience collection

Real-time machine learning

Conclusion

References

Tags

You might also like

Subscribe to Grid Dynamics insights now

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

Subscribe to Grid Dynamics
insights now