基于双向LSTM的Seq2Seq模型在加油站时序数据异常检测中的应用

英文篇名：Abnormal time series data detection of gas station by Seq2Seq model based on bidirectional long short-term memory
作者：陶涛 ; 周喜 ; 马博 ; 赵凡
英文作者：TAO Tao;ZHOU Xi;MA Bo;ZHAO Fan;Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences;University of Chinese Academy of Sciences;Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry;
关键词：加油站时序数据 ; 深度学习 ; Seq2Seq ; 双向长短期记忆 ; 异常检测
英文关键词：gas station time-serise data;;deep learning;;Seq2Seq;;Bidirectional Long Short-Term Memory(Bi-LSTM);;outlier detection
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：中国科学院新疆理化技术研究所;中国科学院大学;新疆理化技术研究所新疆民族语音语言信息处理实验室;
出版日期：2018-10-31 16:01
出版单位：计算机应用
年：2019
期：v.39;No.343
基金：新疆维吾尔自治区高层次人才引进工程资助项目(Y639401201);; 中国科学院西部之光项目(2016-QNXZ-A-3)~~
语种：中文;
页：JSJY201903051
页数：6
CN：03
ISSN：51-1307/TP
分类号：308-313

摘要

加油时序数据包含加油行为的多维信息,但是指定加油站点数据较为稀疏,现有成熟的数据异常检测算法存在挖掘较多假性异常点以及遗漏较多真实异常点的缺陷,并不适用于挖掘加油站时序数据。提出一种基于深度学习的异常检测方法识别加油异常车辆,首先通过自动编码器对加油站点采集到的相关数据进行特征提取,然后采用嵌入双向长短期记忆(Bi-LSTM)的Seq2Seq模型对加油行为进行预测,最后通过比较预测值和原始值来定义异常点的阈值。通过在加油数据集以及信用卡欺诈数据集上的实验验证了该方法的有效性,并且相对于现有方法在加油数据集上均方根误差(RMSE)降低了21.1%,在信用卡欺诈数据集上检测异常的准确率提高了1.4%。因此,提出的模型可以有效应用于加油行为异常的车辆检测,从而提高加油站的管理和运营效率。
Time series data of gas station contains multi-dimensional information of fueling behavior, but the data of specific gas station are sparse. The existing abnormal data detection algorithms are not suitable for gas station time series data, because many pseudo outliers are mined and many real abnormal points are missed. To solve the problems, an abnormal detection method based on deep learning was proposed to detect vehicles with abnormal fueling. Firstly, feature extraction was performed on data collected from the gas station through an automatic encoder. Then, a deep learning model Seq2 Seq with embedding Bidirectional Long Short-Term Memory(Bi-LSTM) was used to predict the fueling behavior. Finally, the threshold of outliers was defined by comparing the predicted value and the original value. The experiments on a fueling dataset and a credit card fraud dataset verify the effectiveness of the proposed method. Compared with the existing methods, the Root Mean Squared Error(RMSE) of the proposed method is decreased by 21.1% on the fueling dataset, and abnormal detection accuracy of the proposed method is improved by 1.4% on the credit card fraud dataset. Therefore, the proposed method can be applied to detect vehicles with abnormal fueling behavior, improving the management and operational efficiency of gas station.

引文

[1] ROUSSEEUW P J, LEROY A M. Robust Regression and Outlier Detection [M]. New York: John Wiley & Sons, 2005: 254-255.
    [2] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2015- 08- 09]. https://arxiv.org/pdf/1508.01991.pdf.
    [3] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]// NIPS 2014: Proceedings of the 2014 Advances in Neural Information Processing Systems 27. Montréal: , 2014: 3104-3112.
    [4] 严宏,杨波,杨红雨.基于异方差高斯过程的时间序列数据离群点检测[J].计算机应用,2018,38(5):1346-1352.(YAN H, YANG B, YANG H Y. Outlier detection in time series data based on heteroscedastic Gaussian processes [J]. Journal of Computer Applications, 2018, 38(5): 1346-1352.)
    [5] 陈斌,陈松灿,潘志松,等.异常检测综述[J].山东大学学报(工学版), 2009,39(6):13-23. (CHEN B, CHEN S C, PAN Z S. et al. Survey of outlier detection technologies [J]. Journal of Shandong University (Engineering Science), 2009, 39(6): 13-23.)
    [6] HUANG T, ZHU Y, WU Y, et al. Anomaly detection and identification scheme for VM live migration in cloud infrastructure [J]. Future Generation Computer Systems, 2016, 56(C): 736-745.
    [7] WANG T, LI Z. Outlier detection in high-dimensional regression model [J]. Communications in Statistics, 2016, 46(14): 6947-6958.
    [8] 鲍苏宁,张磊,杨光.基于核主成分分析的异常轨迹检测方法[J].计算机应用,2014,34(7):2107-2110.(BAO S N, ZHANG L, YANG G. Trajectory outlier detection method based on kernel principal component analysis [J]. Journal of Computer Applications, 2014, 34(7): 2107-2110.
    [9] SHIPMON D T, GUREVITCH J M, PISELLI P M, et al. Time series anomaly detection: detection of anomalous drops with limited features and sparse examples in noisy highly periodic data [EB/OL]. [2017- 08- 11].http://cn.arxiv.org/ftp/arxiv/papers/1708/1708.03665.pdf.
    [10] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate [EB/OL]. [2016- 05- 19]. https://arxiv.org/pdf/1409.0473.pdf.
    [11] FARIAS G, DORMIDO-CANTO S, VEGA J, et al. Automatic feature extraction in large fusion databases by using deep learning approach [J]. Fusion Engineering and Design, 2016, 112: 979-983.
    [12] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [EB/OL]. [2018- 07- 10]. https://arxiv.org/pdf/1409.3215.pdf
    [13] ZHENG J, XU C, ZHANG Z, et al. Electric load forecasting in smart grids using long-short-term-memory based recurrent neural network [C]// CISS 2017: Proceedings of the 2017 51st Annual Conference on Information Sciences and Systems. Piscataway, NJ: IEEE, 2017: 1-6.
    [14] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks [EB/OL].[2018- 05- 30]. https://arxiv.org/pdf/1503.00075.pdf.
    [15] CHO K, van MERRIENBOER B, GULCEHRE C, et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[EB/OL]. [2017- 09- 03]. https://arxiv.org/pdf/1406.1078.pdf.
    [16] MADLENáK R, MADLENáKOVá L, SVADLENKA L, et al. Analysis of website traffic dependence on use of selected Internet marketing tools [J]. Procedia Economics and Finance, 2015, 23: 123-128.
    [17] AGNIHOTRI M. Credit card fraud detection [DB/OL]. [2017- 04- 27]. https://www.ushuji.com/financial/296.html.
    [18] TSANGARATOS P, ILIA I. Comparison of a logistic regression and Na?ve Bayes classifier in landslide susceptibility assessments: the influence of models complexity and training dataset size [J]. Catena, 2016, 145:164-179.
    [19] LAPTEV N, AMIZADEH S, FLINT I. Generic and scalable framework for automated time-series anomaly detection [C]// KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 1939-1947.
    [20] SHI Y, EBERHART R C. Empirical study of particle swarm optimization [C]// CEC '99: Proceedings of the 1999 Congress on Evolutionary Computation. Piscataway, NJ: IEEE, 1999, 3: 1945-1950.
    [21] 周志华.机器学习:=Machine learning[M]. 北京:清华大学出版社,2016:33-36.(ZHOU Z H. Machine learning:=Machine learning [M]. Beijing: Tsinghua University Press, 2016:33-36.)This work is partially supported by the Program of Introducing High-Level Talents of Xinjiang(Y639401201), the West Light Foundation of Chinese Academy of Sciences (2016-QNXZ-A-3).TAO Tao, born in 1994, M. S. candidate. His research interests include big data analysis, data mining.ZHOU Xi, born in 1978, Ph. D., research fellow. His research interests include Internet of things, big data analysis.MA Bo, born in 1984, Ph. D., associate research fellow. His research interests include data analysis and knowledge discovery, machine learning.ZHAO Fan, born in 1980, Ph. D. candidate, associate research fellow. His research interests include information security, big data analysis.