border border

Multi-Distribution Time Series Anomaly Detection

Team members

Gary Ong Guan Jie (ISTD), Wang Zijia (ISTD), Chan Chen Chang Joseph (ISTD), Li Yahan (ESD), Lin Hao (ESD), Lin Xiaohao (ESD)

Instructors:

Cyrille Pierre Joseph Jegourel, Stefano Galelli, Ying Xu

Writing Instructors:

Nurul Wahidah Binte Mohd Tambee

Project Description

  • Companies in the 21st century often collect vast amounts of data to perform analysis. Early detection of anomalies in the data would allow companies to discover unusual points or patterns, which would enable them to act on such insights quickly. The objective of our project is to build a robust and scalable anomaly detection system that identifies anomalies in real-time without supervision.

The video in the-right hand-side is a demonstration of our product, which is a dashboard that allows the user to obtain a clear understanding of the detection result. The dashboard contains several functions and is able to provide all the useful information the user needs to conduct further investigation on the detected anomalies.

1. Project Introduction

 

Grab’s strong mobile app and business avenues attract an increasing number of users. This retention strategy gathers information on user behaviour within Grab’s system to generate business insights. By analysing the collected data, Grab wants to detect if there are any anomalies present and conduct further investigation on them to produce greater business efficiency. 

 

1.1 Project Requirements

1. Detect abnormal events from different data sources.

2. Report the detected anomalies for further investigation.

1.2 Features

1. High scalability (50 billion events per hour).

2. Effective and robust model for rate and shape anomalies.

3. Clear and quick visualisation.

1.3 Dataset Description

The dataset given contains 2 million events. Each event is a data point generated at a specific time containing 12 attributes. ‘Event’ and ‘Val’ were the key attributes used for detection.

 

 

2 Solution & Product Description

 

A semi-supervised model that is able to detect anomalies for large batches of data in a distinctly parallel way without bottlenecks. The model consists of 2 parts: unsupervised and supervised. 

We have to make our model be capable of computing billions of data points! Thus, we discuss the ways in which we can parallelize the processing of our data so that the computing resources are utilized as efficiently as possible!

 

2.1 Overall System Architecture

 

screen shot 2020 07 30 at 21 44 54

 

 

2.2 Unsupervised Model

The unsupervised model uses the ensemble learning method to detect anomalies. The ensemble learning method is a combination of models, including local outlier factor, isolation forest, and one-class support vector machine.

screen shot 2020 08 12 at 03 12 30

 

 

2.3 Ensemble Learning Method

We ensemble the three algorithms mentioned (refer to the slider below) to create a robust model that works with a wide range of distribution! 

 

 

2.3 Supervised Model

The soft labels produced by the ensembled unsupervised model are then passed to the supervised model for training and testing.

 

 

screen shot 2020 07 30 at 21 44 54

 

3. Frontend User Interface

 

The integrated dashboard enables the user to quickly observe the detection results by visualizing the detected anomalies. It provides all the useful information that the user needs to conduct further analysis.

 

screenshot 2020 08 12 at 3 51 14 am 1 screenshot 2020 08 12 at 3 51 49 am

 

Function 1. Provides an informative summary. Allows the user to have a clear understanding of the dataset.

Function 2. Allows changes in detection’s threshold depending on the level of significance of the anomalies.

Function 3. Ranks the events by their anomaly scores. Served as a reference on the order of further investigation.

Function 4. Provides detailed information on every abnormal event when clicked on.

Function 5. The detection page allows the user to visualize the distribution of any event.

 

4. Performance Evaluation

Experiments are conducted on external datasets to test its performance on robustness and scalability.

 

Evaluation Matrics

1. Raw Runtime Value

2. ROC-AUC Score

 

External Datasets

1. Cardiotocography

2. Satellite Images

3. MNIST Images

Raw Runtime Value

screenshot 2020 08 12 at 2 50 24 am

Receiver Operating Curve - Area Under Curve Score

screenshot 2020 08 12 at 2 50 30 am

 

 

 

Our model demonstrates lesser runtime on all datasets. The runtime increases linearly with respect to an increase in data, showing manageable scalability for handling higher data traffic.

 

 

 

 

 

 

ROC-AUC score is a performance measurement for classification problems at various threshold settings. Our model produces higher ROC-AUC scores in all datasets, proving its improved robustness.

TEAM MEMBERS

student Gary Ong Guan Jie Information Systems Technology and Design
student Wang Zijia Information Systems Technology and Design
student Chan Chen Chang Joseph Information Systems Technology and Design
student Li Yahan Engineering Systems and Design
student Lin Hao Engineering Systems and Design
student Lin Xiaohao Engineering Systems and Design
border border