关于大数据存储过程中缺失信息检测仿真

英文篇名：Detection and simulation of missing information in big data storage process
作者：冉娟 ; 任琼
英文作者：RAN Juan;REN Qiong;Department of Computer Science & Technology Tianjin University Renai College;School of Mathematics and Computer Science,Jianghan University;
关键词：非完全数据 ; 缺失信息 ; 近邻传播 ; 区间相似度 ; 分布式计算
英文关键词：Incomplete data;;Missing information;;Affinity propagation;;Interval similarity;;Distributed computing
中文刊名：JSJZ
英文刊名：Computer Simulation
机构：天津大学仁爱学院计算机科学与技术系;江汉大学数学与计算机科学学院;
出版日期：2018-12-15
出版单位：计算机仿真
年：2018
期：v.35
语种：中文;
页：JSJZ201812103
页数：5
CN：12
ISSN：11-3724/TP
分类号：467-471

摘要

对大数据存储过程中缺失信息进行有效检测,不仅可以避免用户数据查询异常,而且可以提高系统非完整数据挖掘分析的准确性与完整性。当前缺失信息检测方法在数据量上升的过程中,由检测算法带来的检测时延呈现指数增长,影响检测精度,甚至造成系统程序阻塞崩溃,为了对现有方法的检测时延进行有效优化,同时兼顾检测精度,提出了分布式优化近邻聚类的缺失信息检测方法。首先采用近邻传播对非完整数据集做聚类处理,将其分为完整和非完整两个数据集,并利用提出的区间相似度,把属于一类的数据归属于同一个簇,这种聚类方式避免了其它对象带来的干扰,有利于提高聚类精度和速度;然后,为了更加有效的提高检测算法执行效率,设计了分布式计算优化聚类过程,将主要耗时操作的聚类过程采取并行计算;最后,将聚类后得到的同类对象利用信息熵计算,检测得到缺失信息。通过仿真,验证了所提方法对于非完整数据缺失信息检测时延具有明显的优化效果,同时具有良好的检测精度。
Effective detection of missing information in large data stored procedures,Not only can you avoid user data query exceptions,Moreover,the accuracy and completeness of the analysis of incomplete data mining can be improved. Current lack of information detection method in the process of data increase,The detection algorithm brought by the detection algorithm shows exponential growth,Impact detection accuracy,Even causing the system to block crashes,In order to optimize the detection delay of existing methods,At the same time,the precision of detection is given. This paper presents a method for the detection of the missing information of the distributed optimization neighbor clustering. First,the non-complete dataset is used to cluster the non-complete data set. Divide it into complete and incomplete data sets. And using the proposed interval similarity,To ascribe a category of data to the same cluster,This clustering method avoids interference from other objects. It is helpful to improve the precision and speed of clustering. Then,in order to improve the efficiency of detection algorithm more effectively,The distributed computing optimization clustering process is designed. Parallel computation of the main time consuming clustering process;Finally,the similar objects obtained after clustering are calculated using information entropy. The missing information was detected. Through the simulation experiment,It is proved that the proposed method has obvious optimization effect on the detection delay of incomplete data. It also has good detection accuracy.

引文

[1]李南.基于聚类假设的数据流分类算法[J].模式识别与人工智能,2017,30(1):1-10.
    [2] M Chen,S Mao,Y Liu. Big Data:A Survey[J]. Mobile Networks&Applications,2014,19(2):171-209.
    [3]武森,冯小东,单志广.基于不完备数据聚类的缺失数据填补方法[J].计算机学报,2012,35(8):1726-1738.
    [4]郝胜轩,宋宏,周晓锋.基于近邻噪声处理的KNN缺失数据填补算法[J].计算机仿真,2014,31(7):264-268.
    [5]高科,等.含缺失属性值的问题数据检测与修复[J].计算机工程与设计,2016,37(3):643-649.
    [6]于彦伟,等.一种基于密度的空间数据流在线聚类算法[J].自动化学报,2012,38(6):1051-1059.
    [7]陈学斌,王师,董岩岩.面向大数据的并行分类混合算法研究[J].微电子学与计算机,2016,33(4):138-140.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700