Next Generation Outlier Detection

详细信息

作者：Wang ; Ye
学历：Doctor
年：2014
关键词：Applied sciences ; Outiler detection ; Data mining ; Alg
导师：Parthasarathy,Srinivasan
毕业院校：Ohio State University
Department：Computer Science and Engineering
专业：Computer science
ISBN：9781321477375
CBH：3670786
Country：USA
语种：English
FileSize：2647951
Pages：189

文摘

Outlier detection is a fundamental task that is used in numerous data analytic applications. It tackles the problem of identifying rare or atypical points that widely diverge from the general behavior or model of the data. The process of detecting outliers and subsequently using them for data analysis relies on the underlying application. For example,outlier detection can be employed as a preprocessing step to clean the data set from erroneous measurements and noisy data points. On the other hand,it can also be used to isolate suspicious or interesting patterns in the data. Examples include fraud detection,customer relationship management,network intrusion,clinical diagnosis,and biological data analysis. Although many successful algorithms have been developed for outlier detection,several challenges have haunted researchers and practitioners for decades. The first one is limited algorithm scalability. Due to the fast evolution of World Wide Web,the collected data can easily reach terabyte- or even petabyte- scale. Most existing approaches,ranging from statistical methods to geometric methods,and from density-based approaches to information theory based approaches,suffer from limited scalability and do not work well on large scale data. The second one is to detect outliers in the irregular,dynamic semi-structured data such as trees and graphs. There have been some research on finding outliers from the graphs. What are the definitions for meaningful outliers in the graph context? How can we detect them accurately and efficiently? The third challenge is to build a unified and modular detection system which provides researchers a complete toolbox for outlier detection tasks. Our research aims at designing the next-generation outlier detection algorithms that tackle the above three challenges. To achieve better scalability,we have done an extensive empirical study on different optimization techniques for distance-based outlier detection. Also,we proposed an ranking scheme driven by the Locality Sensitive Hashing (LSH),which finds all outliers by only visiting a small portion of the data (10%). Find similar points of each point,or all pair similarity search,is the key operation for many distance-based,density-based and cluster-based outliers. We optimized this fundamental kernel in metric space on MapReduce platform,and scaled the algorithm to hundreds of machines and solved the inadequate memory issue. For semi-structured outlier detection,we first designed a clustering-based algorithm,and a generic clustering algorithm for sets/multisets,trees and graphs. We also studied a concrete detection application on the semi-structured knowledge base,and found more than one million anomalies. Finally,we integrated our work seamlessly into a detection framework,which accepts different types of data. Users also enjoy the freedom of choosing and comparing different algorithms.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700