Different subscriber behaviours, hardware setups, and software configurations in the Internet service product generate various patterns. A development team validates its functionality by strict evaluation process, such as test pass rates, trouble reports, or key performance indicators. However, absence of evidence is not an evidence of absence. It is hard to detect abnormality when unknown cases deviated from normal behaviours. The root cause for these cases might be due to unseen user behaviors, hardware malfunctions or software defects. We call this anomaly.
Anomaly detection refers to the problem of finding patterns in data that do not conform to the expected behaviour occurring on product. For example, we can identify unusual peaks and drops of traffic which are different from normal pattern. Since it is hard to conclude possible abnormal behaviours by rule-based system, a company commonly applies machine learning techniques to detect anomaly. It is important in computer system performance mornitoring perspective to detect anomalies quickly and automaticallly.
Note that anomaly detection is different from intrusion detection. While both of them aim to detect significant changes, intrusion detection aims to detect policay violations rather than possible product defects.
Potential anomalies in continuous sequence can be divided into three different types.
While it is easier to discover point anomaly, it is much harder to detect contextual anomaly since we need to take the context into account in order to explain whether it is normal or abnormal. We, therefore, cannot simply judge the time-series plot by its shape if more than one contextual attributes are required to judge. A contextual anomaly detection algorithm is required to consider all the important varialbes that may explain suspicious behaviours.
Here are the more detailed reasons why detecting a contextual anomaly is hard:
The general idea in Bayesian anomaly detection is to build a probablistic model over normal cases, and to compare new samples with trained model when they arrived. Samples that have small probabilities of being generated by the model are considred anomalies, that is, they are very unlikely to belong to a set of normal cases. If the number of potential indicators becomes large, it becomes hard to induce the best set of indicators In worst case, the number of regressors can be larger than the number of training samples, due to data sparsity.
cbar
creates a probablistic model that finds the best indicators to define context and to conclude potential anomalies based on their context. This library depends bsts
and Boom
, which Steven L Scott developes.