FAAD:an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream

英文篇名：FAAD:an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream
作者：Bin ; LI ; Yi-jie ; WANG ; Dong-sheng ; YANG ; Yong-mou ; LI ; Xing-kong ; MA
英文作者：Bin LI;Yi-jie WANG;Dong-sheng YANG;Yong-mou LI;Xing-kong MA;Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer,National University of Defense Technology;Block Chain Research Institute of LianLian Pay;
英文关键词：Data stream;;Multi-dimensional sequence;;Anomaly detection;;Concept drift;;Feature selection
中文刊名：JZUS
英文刊名：信息与电子工程前沿(英文)
机构：Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer,National University of Defense Technology;Block Chain Research Institute of LianLian Pay;
出版日期：2019-03-03
出版单位：Frontiers of Information Technology & Electronic Engineering
年：2019
期：v.20
基金：Project supported by the National Key R&D Program of China(No.2016YFB1000101);; the National Natural Science Foundation of China(Nos.61379052 and 61502513);; the Natural Science Foundation for Distinguished Young Scholars of Hunan Province,China(No.14JJ1026);; the Specialized Research Fund for the Doctoral Program of Higher Education,China(No.20124307110015)
语种：英文;
页：JZUS201903008
页数：17
CN：03
ISSN：33-1389/TP
分类号：86-102

摘要

Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because:(1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling;(2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection(FAAD) method which includes three algorithms. First, a method called "information calculation and minimum spanning tree cluster" is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called"random sampling and subsequence partitioning based on the index probabilistic suffix tree." Last, the method called "anomaly buffer based on model dynamic adjustment" dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data.Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.
Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because:(1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling;(2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection(FAAD) method which includes three algorithms. First, a method called "information calculation and minimum spanning tree cluster" is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called"random sampling and subsequence partitioning based on the index probabilistic suffix tree." Last, the method called "anomaly buffer based on model dynamic adjustment" dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data.Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.

引文

Bao H, Wang Y.J, 2016. A C-SVM based anomaly detection method for multi-dimensional sequence over datastream. Proc IEEE 22nd Int Conf on Parallel and Distributed Systems, p.948-955.https://doi.org/10.1109/ICPADS.2016.0127
    Box GE, Jenkins GM, R.einsel GC, et al., 2015. Time Series Analysis:Forecasting and Control. John Wiley&Sons,Hoboken, USA.
    Budalakoti S, Srivastava AN, Akella R., et al., 2006. Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences. TM-2006-214553, NASA Ames Research Center, USA.
    Budalakoti S, Srivastava AN, Otey ME, 2009. Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. IEEE Trans Syst Moan Cybern C, 39(1):101-113.https://doi.org/10.1109/TSMCC.2008.2007248
    Carlin BP, Louis TA, 2000. Bayes and Empirical Bayes Methods for Data Analysis(2nd Ed.). Chapman&Hall/CRC Press, Boca R.aton, FL, USA.
    Chandola V, Mithal V, Kumar V, 2008. Comparative evaluation of anomaly detection techniques for sequence data.Proc 8th IEEE Int Conf on Data Mining,p.743-748.https://doi.org/10.1109/ICDM.2008.151
    Chandola V, Banerjee A, Kumar V, 2009. Anomaly detection:a survey. ACM Comput Surv, 41(3), Article 15.https://doi.org/10.1145/1541880.1541882
    Chandola V, Banerjee A, Kumar V, 2012. Anomaly detection for discrete sequences:a survey. IEEE Trans Knowl Data Eng, 24(5):823-839.https://doi.org/10.1109/TKDE.2010.235
    Dani MC, Freixo C, Jollois FX, et al., 2015. Unsupervised anomaly detection for aircraft condition monitoring system. Proc IEEE Aerospace Conf, p.1-7.https://doi.org/10.1109/AERO.2015.7119138
    Esposito F,di Mauro N, Basile TMA,et al.,2008. Multidimensional relational sequence mining. Fundam Inform, 89(1):23-43.
    Hall MA, 2000. Correlation-based feature selection for discrete and numeric class machine learning. Proc 17th Int Conf on Machine Learning, p.359-366.
    Jin Y, Zuo WL, 2007. Multi-dimensional concept lattice and incremental discovery of multi-dimensional sequential patterns. J Comput Res Dev, 44(11):1816-1824(in Chinese).
    Kaufman L, R.ousseeuw PJ, 2009. Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley&Sons, New York, USA.
    Keogh E, Chakrabarti K, Pazzani M, et al., 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowl Inform Syst, 3(3):263-286.https://doi.org/10.1007/PL00011669
    Kponyo J.J, Kuang Y.J, Zhang EZ, et al., 2013. VANET cluster-on-demand minimum spanning tree(MST)prim clustering algorithm. Proc Int Conf on Computational Problem-Solving, p.101-104.https://doi.org/10.1109/ICCPS.2013.6893585
    Lane T, 1998. Machine Learning Techniques for the Domain of Anomaly Detection for Computer Security. Purdue University, Indiana, USA.
    Lee CH, 2015. A multi-phase approach for classifying multidimensional sequence data. Intell Data Anal, 19(3):547-561. https://doi.org/10.3233/IDA-150731
    Li C, Tian XG, Xiao X,et al., 2012. Anomaly detection of user behavior based on shell commands and cooccurrence matrix. J Comput Res Dev,49(9):1982-1990(in Chinese).
    Li XY, Wang YJ, Li XL, et al., 2014. Parallelizing skyline queries over uncertain data streams with sliding window partitioning and grid index. Knowl Inform Syst,41(2):277-309.https://doi.org/10.1007/s10115-01:3-0725-8
    Parveen P, Mdaniel N, Weger Z, et al., 2013. Evolving insider threat detection stream mining perspective. Int J Artif Intell Tools, 22(5):1360013.https://doi.org/10.1142/S0218213013600130
    Qian Q, Wu JL, Zhu W, et al., 2012. Improved edit distance method for system call anomaly detection. Proc IEEE12th Int Conf on Computer and Information Technology,p.1097-1102. https://doi.org/10.1109/CIT.2012.223
    Ron DN,Singer Y,Tishby N,1994. Learning probabilistic automata with variable memory length. Proc 7th Annual Conf on Computational Learning Theory, p.35-46.https://doi.org/10.1145/180139.181006
    Sarhrouni E, Hammouch A, Aboutajdine D, 2012. Application of symmetric uncertainty and mutual information to dimensionality reduction and classification of hyperspectral images. Int J Eng Technol, 4(5):268-276.https://doi.org/10.1145/180139.181006
    Shu XK, Yao DF, Ryder BG, 2015. A formal framework for program anomaly detection. Proc 18th Int Symp Research in Attacks, Intrusions, and Defenses, p.270-292. https://doi.org/10.1007/978-3-319-26362-5_13
    Tandon G, Chan P, 2003. Learning rules from system call arguments and sequences for anomaly detection. Proc ICDM Workshop on Data Mining for Computer Security, p.20-29.
    Wang Y, Ma X, 2015. A general scalable and elastic contentbased publish/subscribe service. IEEE Trans Parall Distr Syst, 26(8):2100-2113.https://doi.org/10.1109/TPDS.2014.2346759
    Wang Y.J, Li S, 2006. Research and performance evaluation of data replication technology in distributed storage systems. Comput Math Appl, 51(11):1625-1632.https://doi.org/10.1016/j.camwa.2006.05.002
    Wang Y.J, Li XY, Li XL, et al., 2013. A survey of queries over uncertain data. Knowl Inform Syst,37(3):485-530.https://doi.org/10.1007/s 10115-013-0638-6
    Wang YJ, Pei X, Ma X, et al., 2018. TA-update:an adaptive update scheme with tree-structured transmission in erasure-coded storage systems. IEEE Trans Parall Distr Syst,29(8):1893-1906.https://doi.org/10.1109/TPDS.2017.2717981
    Xianyu JC, Rasouli S,Timmermans H, 2017. Analysis of variability in multi-day GPS imputed activitytravel diaries using multi-dimensional sequence alignment and panel effects regression models. Transportation, 44(3):533-553.https://doi.org/10.1007/s11116-015-9666-2
    Xiong TK, Wang SR, Jiang QS, et al., 2011. A new Markov model for clustering categorical sequences. Proc IEEE11th Int Conf on Data Mining, p.854-863.https://doi.org/10.1109/ICDM.2011.13
    Yamanishi K, Maruyama Y, 2005. Dynamic syslog mining for network failure monitoring. Proc 11th ACM SIGKDDInt Conf on Knowledge Discovery in Data Mining,p.499-508. https://doi.org/10.1145/1081870.1081927
    Yang J, Wang W, 2003. CLUSEQ:efficient and effective sequence clustering. Proc 19th Int Conf on Data Engineering, p.101-112.https://doi.org/10.1109/ICDE.2003.1260785
    Yu L, Liu H, 2003. Feature selection for high-dimensional data:a fast correlation-based filter solution. Proc 20th Int Conf on Machine Learning, p.856-863.error function(also called the"Gauss error function"). Eq.(Al)is adopted as the standard of the proportion of current anomalies and proportion of historical anomalies. The larger Pt deviates from the historical value, the larger F(Pt)is, which shows that the data distribution has changed and the concept drift occurs.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700