相关分析在异常检测中的应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文研究相关分析方法在异常检测中的应用,并将其应用于特征选择及地震特征数据的异常检测中。主要研究内容如下:
     提出了一种基于离散粒子群算法(Binary Particle Swarm Optimization,BPSO)及以重叠信息熵(Overlap Information Entropy,OIE)为适应值函数的特征子集选择方法。该方法是不依赖于分类器的特征选择方法。主要思想是:首先随机产生若干粒子,以特征属性集与类别属性之间的OIE作为BPSO算法的适应值函数,其大小表示所选特征子集与类别属性之间相关性程度的高低;利用BPSO算法对特征子集进行优化,最终确定与类别属性的OIE最大的特征子集为最优特征子集。实验结果显示:该方法不仅能有效地寻找到最优特征子集,且能进行特征降维和去除冗余信息,其分类结果不差于全部属性的分类结果。
     提出了一个非线性新相关信息熵的概念,推导并证明了该信息熵的若干性质,这些性质满足香农熵的基本性质。新相关信息熵是一种度量多变量、非线性系统的相关性程度大小的标准。作为多变量之间相关关系的不确定性度量,变量间的相关性程度越大,对应的新相关信息熵值越小。新相关信息熵的提出为相关分析理论的研究提供了一种新方法和新思路。新相关信息熵的应用实例结果说明它是一种有效且有用的度量非线性系统不确定性的方法。
     基于上述研究,开发了用数据挖掘技术进行地震趋势预报与评判的分析软件原型系统,此系统的开发目的旨在为后续的进一步研究打下基础。本文的研究结果主要开发了其中的相关分析模块,同时提供给用户可视化的操作界面,其主要功能是进行特征选择和异常检测,以此评判本文特征选择方法的有效性。以汶川余震特征数据为实验数据,测试结果表明该系统功能正确。
This thesis mainly focuses on that the correlation analysis method is applied in anomaly detection, and this method is used in feature selection and earthquake feature data’s anomaly detection. At the same time, the prototype software system of using data mining theory and technology to forecast and judge earthquake tendency was developed. The main contents are as follows:
     This thesis proposes a new method of Feature Subset Selection, which is based on discrete Binary version of Particle Swarm Optimization (BPSO) and Overlap Information Entropy (OIE). This method does not depend on classifier. The main idea is: at first, a group of particles are generated randomly. The OIE between attribute set and class attribute is used as BPSO algorithm’s fitness function, its size denotes the correlation degree between selected attribute set and class attribute. Then, feature subset is optimized by BPSO. Finally, feature subset, which has the largest OIE with class attribute, is selected as the Optimal Feature Subset. Experimental results confirm that this method can not only find the Optimal Feature Subset effectively but also do feature reduction and remove the redundant information, and its classification results are not worse than all features’classification results.
     The concept of A New Nonlinear Correlation Information Entropy (NNCIE) is proposed based on the study of Correlation Information Entropy (CIE) and Hpal Entropy. Under the condition of the largest partition of finite sets, some properties of this information entropy are derived and proved theoretically and these properties meet the basic properties of the information entropy, which is proposed by Shannon C E. The NNCIE is a measurement criterion of multi-variable and nonlinear system’s correlation degree. As an uncertainty measurement of multi-variable correlation, the more correlation information between variables contain, the smaller value of corresponding NNCIE is. The NNCIE contributes to information fusion and provides a new method and idea for the research of correlation analysis theory. The results of NNCIE show that NNCIE is an effective and useful measurement method for nonlinear system’s uncertainty.
     Based on above research results, the software prototype system of using data mining theory and technology for prediction and judgment earthquake tendency was developed. But this system is not an application software system, and its development just only supplies a good foundation for subsequent research. Correlation analysis module is one of main constituent part, and this module makes the NNCIE be the fitness function of feature selection method that this thesis proposed. At the same time, a visual operation interface is provided for user and its main function is feature selection and anomaly detection so as to judge this feature selection method’s availability. Experimental data is WenChuan aftershock’s feature data, and the test results show that the software runs well.
引文
[1]林德明,刘则渊.国际地震预测预报研究现状的文献计量分析[J].中国软科学, 2009, 6: 62-70
    [2]韩志军,王桂兰,周成虎,等.地震序列研究现状与研究方向探讨[J].地球物理学进展, 2003, 18(1): 74-78
    [3] Yang K Y, Shahabi C. On the stationarity of multivariate time series for correlation-based data analysis [C]. Proceedings of the Fifth IEEE International Conference on Data Mining, 2005
    [4] Spiros P, Sun J M, Philip S.Y. Local Correlation Tracking in Time Series [C]. Proceedings of the Sixth IEEE International Conference on Data Mining, 2006: 456-465
    [5] Wang Q, Shen Y. Performances Evaluation of Image Fusion Techniques Based on Nonlinear Correlation Measurement [C]. Proceedings of IEEE Instrumentation and Measurement Technology Conference, 2004, 1: 472-475
    [6] Wang Q, Shen Y, Zhang Y. A fast method to evaluate the performances of image fusion techniques and its error analysis [C]. Proceedings of IEEE Instrumentation and Measurement Technology Conference, 2003, 2: 823-826
    [7]杨惠娟,张建秋.一种基于奇异值分解的动态多传感器数据融合算法[J].传感技术学报, 2004, 3: 440-445
    [8]刘准,陈哲.条件数在系统可观测性分析中的应用研究[J].系统仿真学报, 2004, 16(7): 1552-1555
    [9]刘敏华,萧德云.基于信息熵的多传感器数据分类方法[J].控制与决策, 2006, 21(4): 410-414
    [10]周曲,颜国正,王文兴.相关系数分析在模糊图像参数识别中的应用[J].光学精密工程, 2007, 15(6): 987-995
    [11]宋利娜.多传感器相关分析方法研究与应用[D].西安科技大学硕士学位论文, 2009
    [12]孙权森,曾生根,王平安,等.典型相关分析的理论及其在特征融合中的应用[J].计算机学报, 2005, 28(9): 1524-1533
    [13] Clercq D, Vergult A, Vanrumste B, et al. Canonical correlation analysis applied to remove muscle artifacts from the Electroencephalogram [J]. IEEE Transactions on Biomedical Engineering, 2006, 53(12): 2583-2587
    [14] Zheng W, Zhou X, Zou C, ea al. Facial expression recognition using kernel canonical correlation analysis (KCCA) [J]. IEEE Transactions on Neural Networks, 2006, 17(1):233-238
    [15]欧阳志远.关于地震预测预报的认识论和方法论问题[J].中国人民大学学报, 2009, 1: 96-104
    [16]梅世蓉,冯得益,张国民,等.中国地震预报概论[M].北京,地震出版社, 1993
    [17]吴绍春.地震预报中的数据挖掘方法研究[D].上海大学博士学位论文, 2005
    [18]王炜,蒋春曦,张军,等. BP神经网络在地震综合预报中的应用[J].地震, 1999, 19(2): 118-126
    [19]叶燎原,刘本玉,缪升.人工神经网络方法在防震减灾工程中的应用[C].全国首届防震减灾工程学术研讨会, 2004: 103-108
    [20]印兴耀,孔国英,张广智.基于核主成分分析的地震属性优化方法及应用[J].石油地球物理勘探, 2008, 43(2): 179-183
    [21]张丽新.高维数据的特征选择及基于特征选择的集成学习研究[D].清华大学博士学位论文, 2004
    [22]任江涛,黄焕宇,孙婧昊,等.基于相关性分析及遗传算法的高维数据特征选择[J].计算机应用, 2006, 6: 1403-1405
    [23]钱国良,舒文豪,陈彬,等.基于信息熵的特征子集选择启发式算法的研究[J].软件学报, 1998, 12: 219-619
    [24]乔立岩,彭喜元,彭宇.基于微粒群算法和支持向量机的特征子集选择方法[J].电子学报, 2006, 34(3): 496-498
    [25]郭文忠,陈国龙,陈庆良,等.基于离子群优化算法和相关分析的特征子集选择[J].计算机应用, 2008, 25: 441-541
    [26] Chakraborty B. Feature Subset Selection by Particle Swarm Optimization with Fuzzy Fitness Function [C]. Proceeding of 2008 3rd International Conference on Intelligent System and Knowledge Engineering, Xiamen, China, 2008, 1: 1038-1042
    [27] Wang X Y, Yang J, Teng X L, et al. Feature selection based on rough sets and particle swarm optimization [J]. Pattern Recognition Letters, 2007, 28(4): 459-471
    [28] Tu C J, Chuang L Y, Chang J Y, at el. Feature Selection using PSO-SVM [C]. International Multi-Conference of Engineers and Computer Scientists, Hong Kong, China, 2006, 33(1): 1-6
    [29] Zhang H T, Mao H P. Feature Selection for the Stored-grain Insects Based on PSO and SVM [C]. Second International Workshop on Knowledge Discovery and Data Mining, Moscow, Russia, 2009, 586-589
    [30] Cheng L H, Jian F D. A distributed PSO-SVM hybrid system with feature selection and parameter optimization [J]. Applied Soft Computing, 2008, 8 (4):1381-1391
    [31] Wang Q, Shen Y, Zhang Y, et al. Fast quantitative correlation analysis and information deviation analysis for evaluating the performances of image fusion techniques [C]. IEEE Transactions on Instrumentation and Measurement, 2004, 53(5): 1441-1447
    [32] Kennedy J, Eberhart R. Particle Swarm Optimization [C]. IEEE Int’l Conference. on Neural Networks, Perth, Australia, 1995, 4: 1942-1948
    [33] Kennedy J, Eberhart R. A Discrete Binary Version of the Particle Swarm Algorithm [C]. IEEE Conference on Computational Cybernetics and Simulation, 1997, 5: 4104-4108
    [34]李爱国,覃征,鮑复民,等.粒子群优化算法[J].计算机工程与应用, 2002, 21: 1-17
    [35] Yang Y, Liu X. A re-examination of text categorization methods [C]. Proceedings 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, United States, 1999, 42-49
    [36] Shannon C E. A mathematical theory of communication [J]. Bell Syst.Tech.J. , 1948, 27: 379-423
    [37] Renyi A. On Measures of Entropy and Information [C]. Proc. Fourth Berkeley Symp. on Math. Statist. And Prob, 1961, 1: 547-561
    [38] Asuero A G, Sayago A, Gonzalez A G. The Correlation Coefficient: An Overview [J]. Critical reviews in analytical chemistry, 2006, 36(1): 41-59
    [39]任江涛,孙婧昊,黄焕宇,等.一种基于信息增益及遗传算法的特征选择算法[J].计算机科学, 2006, 33(10): 391-591
    [40] Roulston M S. Estimating the errors on measured entropy and mutual information [C]. Physica D, 1999, 125: 285-294
    [41] Wang Q, Shen Y, Zhang JQ. A nonlinear correlation measure for multivariable data set [C]. Physica D, 2005, 200: 287-295
    [42] Wang Q, Shen Y, Zhang Y, et al. A Quantitative Method to Evaluate the Performance of Hyperspectral Data Fusion. IEEE Instrumentation and Measurement Technology Conference, 2002, 2: 919-923
    [43] Pal N R, Pal S K. Entropy: A new definition and its applications [J]. IEEE Transactions on Systems Man and Cybernetics, 1991, 21(5): 1260-1270
    [44]雷芳,黄进.一种新信息熵及其若干性质[J].重庆邮电学院学报(自然科学版), 2006, 18(6): 778-780
    [45] Tsallis C, Nonextensive Statistics. Theoretical, experimental and Computational evidences and connections [J]. Brazalian Journal of Physics, 1999, 19(1): 1-35
    [46] Kapur J N. Some new nonadditive measures of entropy [J]. Boll.U.M.I., 1988, 7: 253-266
    [47] Sant’anna A P, Tanejia I J. Trigonometric entropies, Jensen difference divergence measures and error bounds [J]. Information sciences, 1985, 35: 145-155
    [48]付祖芸.信息论---基础理论与应用[M].电子工业出版社, 2001
    [49] Li A G, Wang B N. Feature Subset Selection Based on Binary Particle Swarm Optimizaition and Overlap Information Entropy [C]. The 2nd International Conference on Computer Science and Software Engineer, Wuhan, China, 2009
    [50] http://kodiak.cs.cornell.edu/kddcup/datasets.html
    [51] http://data.earthquake.cn/index.do
    [52] http://www.csndmc.ac.cn/newweb/wenchuan/wenchuan_aftershocks.htm
    [53]陈彬,洪家荣,王亚东.最优特征子集选择问题[J].计算机学报, 1997, 20(2): 133-138

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700