基于SparkR的水文传感器数据的异常检测方法

英文篇名：Anomaly detection method for hydrologic sensor data based on SparkR
作者：刘子豪 ; 李凌 ; 叶枫
英文作者：LIU Zihao;LI Ling;YE Feng;School of Computer Science,Jiangsu University of Science and Technology;College of Computer and Information,Hohai University;
关键词：SparkR ; 自回归积分滑动平均模型 ; 异常检测 ; 水文时间序列 ; K均值
英文关键词：SparkR;;AutoRegressive Integrated Moving Average(ARIMA) model;;anomaly detection;;hydrologic time series;;K-Means
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：江苏科技大学计算机学院;河海大学计算机与信息学院;
出版日期：2018-11-16 13:51
出版单位：计算机应用
年：2019
期：v.39;No.342
基金：江苏省博士后科研资助计划项目(1701020C);; 江苏省“六大人才高峰”资助项目(XYDXX-078)~~
语种：中文;
页：JSJY201902023
页数：5
CN：02
ISSN：51-1307/TP
分类号：132-136

摘要

为了高效地从海量的水文传感器数据中检测出异常值,提出一种基于SparkR的水文时间序列异常检测方法。首先,对数据进行清洗后,采用滑动窗口配合自回归积分滑动平均模型(ARIMA)在SparkR平台上进行预测;然后,对预测的结果计算置信区间,将在区间范围以外的判定为异常值;最后,基于检测结果,利用K均值算法对原数据进行聚类,同时计算其状态转移概率,对检测出的异常值进行质量评估。以在滁河获取的水文传感器数据为实验数据,分别在运行时间和异常值检测效果这两个方面进行了实验。结果显示:利用SparkR对百万级数据进行计算时,利用双节点计算的时间要长于单节点;但是对千万级数据进行计算时,双节点比单节点计算时间上更少,最多减少了16. 21%,且评估过后的灵敏度由之前的5. 24%提高到了92. 98%。实验结果表明,在SparkR下,根据水文数据的特点并结合预测检验和聚类校验的方法对千万级水文时间序列进行检测时,能有效提高传统方法的计算效率,并且在灵敏度方面相比传统方法也有显著提升。
To efficiently detect outliers in massive hydrologic sensor data,an anomaly detection method for hydrological time series based on SparkR was proposed.Firstly,a sliding window and Autoregressive Integrated Moving Average(ARIMA)model were used to forecast the cleaned data on SparkR platform.Then,the confidence interval was calculated for the prediction results,and the results outside the interval range were judged as anomaly data.Finally,based on the detection results,K-Means algorithm was used to cluster the original data,the state transition probability was calculated,and the anomaly data were evaluated in quality.Taking the data of hydrologic sensor obtained from the Chu River as experimental data,experiments on the detection time and outlier detection performance were carried out respectively.The results show that the millions of data calculation by two slaves costs more time than that by one slave,but when calculating the tens of milllions of data,the time costed by two slaves is less than that by one slave,and the maximum reduction is 16.21%.The sensitivity of the evaluation is increased from 5.24%to 92.98%.It shows that under big data platform,the proposed algorithm which is based on the characteristics of hydrological data and combines forecast test and cluster test can effectively improve the computational efficiency of hydrologic time series detection for tens of millions data and has a significant improvement in sensitivity.

引文

[1]吴德.水文时间序列相似模式挖掘的研究与应用[D].南京:河海大学,2007.(WU D.Research and application of hydrological time series similarity pattern[D].Nanjing:Hohai University,2007.)
    [2]桑燕芳,王中根,刘昌明.水文时间序列分析方法研究进展[J].地理科学进展,2013,32(1):20-30.(SANG Y F,WANG Z G,LIU C M.Research progress on the time series analysis methods in hydrology[J].Progress in Geography,2013,32(1):20-30.)
    [3]孙建树,娄渊胜,陈裕俊.基于ARIMA-SVR的水文时间序列异常值检测[J].计算机与数字工程,2018,46(2):225-230.(SUN J S,LOU Y S,CHEN Y J.Outlier detection of hydrological time series based on ARIMA-SVR model[J].Computer&Digital Engineering,2018,46(2):225-230.)
    [4]余宇峰,朱跃龙,万定生,等.基于滑动窗口预测的水文时间序列异常检测[J].计算机应用,2014,34(8):2217-2220,2226.(YU Y F,ZHU Y L,WAN D S,et al.Time series outlier detection based on sliding window prediction[J].Journal of Computer Applications,2014,34(8):2217-2220,2226.)
    [5]HAWKINS D M.Identification of Outliers[M].Berlin:Springer,1980:27-41
    [6]牛丽肖,王正方,臧传治,等.一种基于小波变换和ARIMA的短期电价混合预测模型[J].计算机应用研究,2014,31(3):688-691.(NIU L X,WANG Z F,ZANG C Z,et al.Hybrid model based on wavelet and ARIMA for short-term electricity price forecasting[J].Application Research of Computers,2014,31(3):688-691.)
    [7]任勋益,王汝传,孔强.基于主元分析和支持向量机的异常检测[J].计算机应用研究,2009,26(7):2719-2721.(REN X Y,WANG R C,KONG Q.Principal component analysis and support vector machine based anomaly detection[J].Application Research of Computers,2009,26(7):2719-2721.)
    [8]VY N D K,ANH D T.Detecting variable length anomaly patterns in time series data[C]//Proceedings of the 2016 International Conference on Data Mining and Big Data,LNCS 9714.Berlin:Springer,2016:279-287.
    [9]BREUNIG M M,KRIEGEL H-P,NG R T,et al.LOF:Identifying density-based local outliers[C]//Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data.New York:ACM,2000:93-104.
    [10]潘渊洋,李光辉,徐勇军.基于DBSCAN的环境传感器网络异常数据检测方法[J].计算机应用与软件,2012,29(11):69-72.(PAN Y Y,LI G H,XU Y J.Abnormal data detection method for environment wireless sensor networks based on DBSCAN[J].Computer Applications and Software,2012,29(11):69-72.)
    [11]twitter/AnomalyDEtection[EB/OL].[2015-09-01].https://github.com/twitter/Anomaly Detection.
    [12]杨志勇,朱跃龙,万定生.基于知识粒度的时间序列异常检测研究[J].计算机技术与发展,2016,26(7):51-54.(YANG ZY,ZHU Y L,WAN D S.Research on time series anomaly detection based on knowledge granularity[J].Computer Technology and Development,2016,26(7):51-54.)
    [13]刘雪梅,王亚茹.基于异常因子的时间序列异常模式检测[J].计算机技术与发展,2018,28(3):93-96.(LIU X M,WANG Y R.A-nomaly pattern detection in time series based on outlier factor[J].Computer Technology and Development,2018,28(3):93-96.)
    [14]Spark R(R frontend for Spark)[EB/OL].[2016-06-11].https://github.com/amplab-extras/SparkR.pkg.
    [15]谭旭杰,邓长寿,董小刚,等.SparkDE:一种基于RDD云计算模型的并行差分进化算法[J].计算机科学,2016,43(9):116-119,139.(TAN X J,DENG C S,DONG X G,et al.SparkDE:a parallel version of differential evolution based on resilient distributed datasets model in cloud computing[J].Computer Science,2016,43(9):116-119,139.)
    [16]CONTRERAS J,ESPINOLA R,NOGALES F J,et al.ARIMAmodels to predict next-day electricity prices[J].IEEE Transactions on Power Systems,2003,18(3):1014-1020.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700