基于流形学习的分类算法及其应用研究

英文题名：Research on Manifold Learning Based Classifiers and Their Applications
作者：康莉
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 余震预测 ; 自适应局部线性化 ; K近邻算法 ; 多项式回归算法 ; 异常检测 ; 有监督流形学习
英文关键词：Data mining ; Aftershock Prediction ; Adaptive Local Linear ; K Nearest Neighbor ; Polynomial Regression ; Anomaly Detection ; Supervised Manifold Learning
学位年度：2010
导师：李爱国
学科代码：081202
学位授予单位：西安科技大学

摘要

利用数据挖掘技术进行地震预测是一个令人感兴趣的学术研究领域,有着重要的学术价值和现实意义。本文探索基于数据挖掘技术的余震时间预测和震级预测的新途径,探索将流形学习降维算法应用于余震异常检测。在研究几种数据挖掘算法的基础上,本文探索适合余震时间预测和震级预测的方法以及对地震特征属性降维的方法,并开发出相应的软件原型系统。本文的主要工作包括:
     提出了基于自适应局部线性化ALL的余震间隔时间预测方法。ALL是一种基于奇异值分解的自适应局部线性化方法,它可以自适应确定当前嵌入维数,从而克服病态数据矩阵的影响。实验数据采用汶川地震后震级大于等于4.0级的余震间隔时间数据,评价指标为平均均方根误差、平均绝对偏离和绝对误差。ALL与标准的局部线性化方法和最小二乘拟合预测方法的对比实验显示,自适应局部线性化方法对余震间隔时间预测是一种有效的方法。
     针对决策属性是实数值的预测问题,在K近邻算法KNN的基础上,结合多项式回归模型,提出了一种基于KNN的建模方法PR-KNN。实验数据采用汶川地震后震级大于等于4.0级的余震序列,以余震间隔时间作为条件属性,余震震级作为决策属性,评价指标为相对误差和绝对误差。PR-KNN方法与传统的KNN回归算法和距离加权KNN回归算法的对比实验显示,PR-KNN是预测余震震级的一种有潜力的方法。
     针对有监督流形学习算法中缺少测试样本从高维空间到低维空间的映射函数问题,在有监督局部线性嵌入算法SLLE的基础上,结合KNN和多项式回归算法的思想,提出了一种有监督流形学习算法PR-SLLE,并将其应用于余震异常检测中。实验数据采用汶川地震后的地震特征属性数据,评价指标为准确率、漏报率和误报率,与标准的SLLE算法的对比实验显示,PR-SLLE算法结合朴素贝叶斯分类器预测的效果优于SLLE,说明PR-SLLE是一种可行且有效的降维方法。
     基于上述研究成果,设计并实现了一个基于数据挖掘的余震趋势预测分析原型子系统,该子系统是本项目组开发的基于数据挖掘的地震趋势预报与评判的分析软件原型系统的一个重要组成部分。该子系统包括余震间隔时间预测、余震震级预测和余震异常检测三个模块,测试结果表明该软件原型系统运行正确。开发此原型系统的目的是为后续研究打下基础。
Using data mining technology for earthquake prediction is an interesting field of academic research, which has important academic value and practical significance. This thesis aims to explore a new way to predict time interval between aftershocks and magnitude based on data mining technologies, and to explore that manifold learning dimensionality reduction techniques are applied to anomaly detection of aftershock. With careful study and comparison of several data mining algorithms, this thesis seeks to explore appropriate methods for aftershock time prediction, aftershock magnitude prediction and dimensionality reduction of seismic characters attributes, and to develop the software prototype system. The main contributions are included as follows:
     A prediction method for time interval between aftershocks based on Adaptive Local Linear ALL is proposed. ALL is an adaptive local linear method based on singular value decomposition, and it could determine current embedding dimension adaptively, which thus overcome the impact of pathological data matrix. The experimental datasets are time intervals between aftershocks with magnitude greater than or equal to 4.0 from Wenchuan earthquake. The evaluation criterions are MRMSE (Mean of Root Mean Square Errors), MMAE (Mean of Mean Absolute Errors) and AE (Absolute Error). Comparing with standard local linear and least square fitting, experimental results show that ALL is an effective prediction method for time interval between aftershocks.
     For the prediction problem that decision attribute is real value, a modeling method named PR-KNN (Polynomial Regression and K Nearest Neighbor) is proposed, which is based on combination of K Nearest Neighbor and Polynomial Regression. Experimental data are the sequence data of aftershocks with magnitude greater than or equal to 4.0 from Wenchuan earthquake. Time intervals between aftershocks are considered as condition attribute, and aftershock magnitude as decision attribute. The evaluation criterions are RE (Relative Error) and AE (Absolute Error). Comparing with traditional KNN and Distance-Weighted KNN regression algorithm, experimental results show that PR-KNN is a potential method of aftershock magnitude prediction.
     For the problem that lacking mapping function for test samples from high-dimensional space to low-dimensional space in the supervised manifold learning algorithms, a supervised manifold learning algorithm named PR-SLLE is proposed, which is based on combination of supervised locally linear embedding, K Nearest Neighbor and Polynomial Regression. And this method is applied to anomaly detection. Experimental data are seismic attribute data obtained from Wenchuan earthquake. The evaluation criterions are AR (Accuracy Rate), FR (False alarm Rate) and OR (Omission Rate). Comparing with standard SLLE algorithm, experimental results show that the predicted effect by PR-SLLE and Bayesian classifier is superior to that of SLLE, and also illustrates that PR-SLLE is a feasible and effective dimensionality reduction method.
     On the basis of above research, an aftershock prediction prototype sub-system was developed which was used as one important part of the software prototype system of data mining based earthquake tendency prediction and assessment. The sub-system includes three modules, the module of time interval between aftershocks prediction, the module of aftershock magnitude prediction and the module of aftershock anomaly detection. The test results show that the prototype system runs well and the aim is to lay the foundation for further study.

引文

[1]张晓东,蒋海昆,黎明晓.地震预测与预警探讨[J].中国地震, 2008, 24(1): 67-76
    [2]林德明,刘则渊.国际地震预测预报研究现状的文献计量分析[J].中国软科学, 2009, 6(9): 62-70
    [3]丁鉴海,张国民,余素荣等.近年我国震后趋势判定与后续强震预测研究进展[J].华南地震, 1999, 19(1): 1-7
    [4]韩志军,王桂兰,周成虎,等.地震序列研究现状与研究方向探讨[J].地球物理学进展, 2003, 18(1): 074-078
    [5]张国民,李丽,焦明若.我国地震预报研究近十年的发展与展望[J].地球物理学报, 1997, 40(S1): 396-409
    [6]梅世蓉. 40年来我国地震监测预报工作的主要进展[J].地球物理学报, 1994, 37(S1): 196-207
    [7]董国胜.强余震预报规律的探讨及其应用[J].四川地震, 1991, 4: 30-33
    [8]胡先明.中国大陆强余震的统计特征与发震可能性的判别[J].地震研究, 1995, 18(2): 151-160
    [9]周翠英,王红卫,王梅,等.强余震持续时间的早期估计[J].中国地震, 1997, 13(2): 164-171
    [10]平建军,李永庆,张清荣.地震序列较强余震快速响应灰色预测的方法探讨[J].地震学报, 1999, 21(1): 70-74
    [11]平建军,刘荣环,贾炯,等.地震序列较强余震灰色及最小二乘拟合预测方法的应用研究[J].华北地震科学, 2005, 23(1): 6-13
    [12]平建军,贾炯,刘荣环,等.地震序列较强余震最小二乘拟合预测的方法研究[J].华北地震科学, 2007, 25(3): 1-5
    [13]陈海通,孙次昌,黎向东,等.强震类型划分和后续强震预测方法探索[J].地震学报, 2000, 22(2): 194-199
    [14]蒋海昆,曲延军,李永莉,等.中国大陆中强地震余震序列的部分统计特征[J].地球物理学报, 2006, 49(4): 1110-1117
    [15]黄媛,吴建平,张天中.汶川8.0级大地震及其余震序列重定位研究[J].中国科学D辑:地球科学, 2008, 38(10): 1242-1249
    [16]吴忠芳,周廷刚,张元华,等.汶川“5.12”地震序列余震时空分布的研究[J].生态环境, 2008, 17(4): 1662-1666
    [17] Farahbod A M, Allamehzadeh M. Large aftershocks prediction results in Eastern and Central Iran using Artificial Neural Networks (ANNs) [C]. Proceedings of the third international conference on seismology and earthquake engineering, Tehran, I.R.Iran. International Institute of Earthquake Engineering and Seismology, 1999, 223-228
    [18] Farahbod A M, Lindholm C, Mohktari M, et al. Aftershock Analysis for the 1997 Ghaen-Birjand (Ardekul) Earthquake [J]. Journal of Seismology and Earthquake Engineering, 2003, 5(2): 1-10
    [19] Allamehzadeh M, Mokhtari M. Prediction of aftershocks pattern distribution using Self-Organising Feature Maps (SOFM) and its application on the Birjand-Ghaen and Izmit Earthquakes [J]. Journal of Seismology and Earthquake Engineering, 2003, 5(3): 1-15
    [20] Latoussakis J, Drakatos G. A Quantitative Study of Some Aftershocks Sequences in Greece [J]. Pageoph, 1994, 143(4): 603-616
    [21] Drakatos G. Relative Seismic Quiescence Before Large Aftershocks [J]. Pure and Applied Geophysics, 2000, 157: 1407-1421
    [22]王靖.流形学习的理论与方法研究[D].浙江大学博士学位论文, 2006
    [23] Tenenbaum J B, Silva V Langford J C. A Global Geometric Framework for Nonlinear Dimensionality Reduction [J]. Science, 2000, 290(12): 2319-2323
    [24] Roweis S T, Saul L K. Nonlinear Dimensionality Reduction by Locally Linear Embedding [J]. Science, 2000, 290(5500): 2323-2326
    [25]尹峻松,肖健,周宗潭,等.非线性流形学习方法的分析与应用[J].自然科学进展, 2007, 17(8): 1015-1024
    [26] Donoho D L, Grimes C. Hessian eigenmaps: locally linear embedding techniques for high dimensional data [C]. Proceedings of National Academy of Sciences, 2003, 5591-5596
    [27] Zhang C S, Wang J, Zhao N Y, et al. Reconstruction and analysis of multi-poseface images based on nonlinear dimensionality reduction [J].Pattern Recognition, 2004, 37(1):325-336
    [28]詹德川,周志华.基于集成的流形学习可视化[J].计算机研究与发展, 2005, 42(9): 1533-1537
    [29]罗四维,赵连伟.基于谱图理论的流形学习算法[J].计算机研究与发展, 2006, 43(7): 1173-1179
    [30] Ridder D D, Duin R P W. Locally linear embedding for classification [J]. IEEETransactions on Pattern Analysis and Machine Intelligence, 2002
    [31] Ridder D D, Kouropteva O, Okun O. Supervised locally linear embedding, 2004, http://www.ph.tn.tudelft.nl
    [32] Vlachos M, Domeniconi C, Gunopulos D, et al. Non-linear dimensionality reduction techniques for classification and visualization [J]. KDD 2002: 645-651
    [33] Geng X, Zhan D C, Zhou Z H. Supervised Nonlinear Dimensionality Reduction for Visualization and Classification [J]. IEEE Transaction on Systems, Man, and Cybernetic, 2005, 35(6): 1098-1107
    [34]孟德宇,徐宗本,戴明伟.一种新的有监督流形学习方法[J].计算机研究与发展, 2007, 44(12): 2072-2076
    [35]谷瑞军.基于流形学习的高维空间分类器研究[D].江南大学博士学位论文. 2008
    [36] Han J W, Kamber M著.范明,孟小峰译.数据挖掘概念与技术[M].北京:机械工业出版社, 2007, 1-321
    [37] Jolliffe I T. Principal component analysis [M]. New York: Springer-Verlag, 1986
    [38] Hyvarnen A, Oja E. Independent component analysis: Algorithms and applications [J]. Neural Networks, 2000, 13(4-5): 411-430
    [39] Cox T, Cox M. Multidimensional Scaling. London: Chapman & Hall, 1994
    [40] Seung H S, Lee D D. The manifold ways of perception [J]. Science, 2000, 290(5500): 2268-2269
    [41] Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation [J]. Neural Computation, 2003, 15(6): 1373-1396
    [42] Polito M, Perona P, Grouping and dimensionality reduction by locally linear embedding, 2001, http://www.iipl.fudan.edu.cn/～zhangjp/literatures/MLF/ INDEX.HTM
    [43]喻军.监督流形学习及其应用研究[D].中国农业大学硕士学位论文. 2006
    [44] Kugiumtzis D. State space reconstruction parameters in the analysis of chaotic time series– the role of the time window length [J]. Physica D, 1996, 95:13-28
    [45]李爱国,覃征.自适应局部线性化法预测混沌时间序列[J].系统工程理论与实践, 2004, 24(6): 67-71
    [46]李爱国,邱大山,李战怀.基于自适应局部线性化的软件失效间隔时间预测[J].武汉大学学报(理学版), 2006, 52(S1): 37-40
    [47]叶涛,朱学峰,李向阳,等.基于改进的K-最近邻回归算法的软测量建模[J].自动化学报, 2007, 33(9): 996-999
    [48] Okun O, Kouropteva O. Supervised locally linear embedding algorithm [C]. Proc. of the Tenth Finnish Artificial Intelligence Conference. Finland: FAIC, 2002: 50-61
    [49]高小方.流形学习方法中的若干问题分析[J].计算机科学, 2009, 36(4): 25-28
    [50]何怀玉. LLE算法在地震属性参数降维中的应用[D].成都理工大学硕士学位论文. 2006

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700