地震预报中的数据挖掘方法研究

英文题名：Study on Novel Approaches of Data Ming for Earthquake Prediction
作者：吴绍春
论文级别：博士
学科专业名称：控制理论与控制工程
中文关键词：地震预报 ; 数据挖掘 ; 关联规则 ; 时间序列 ; 序贯模式 ; 地震序列 ; 地震地区相关性 ; 地震前兆观测数据 ; 并行数据挖掘平台
英文关键词：Earthquake Prediction ; Dada Mining ; Association Rules ; Time Series ; Sequential Pattern ; Earthquake Sequence ; Relativity of Earthquake Zones ; Earthquake Auspice Data ; Parallel Data Ming Platform
学位年度：2005
导师：吴耿锋
学科代码：081101
学位授予单位：上海大学
论文提交日期：2005-08-01

摘要

地震预报是一个国际公认的世界性难题。我国地震预报事业经过30多年的发展,积累了丰富的宝贵经验和大量的数据资料,全国的地震台网更是每日都在记录着数以千兆计的海量地震前兆观测数据。本文将数据挖掘技术引入到地震预报领域中,研究现有地震数据处理与数据挖掘技术交叉结合的方法,充分应用现代高性能计算环境,从这些海量数据中挖掘出地震预报所需的规律性知识,以便辅助领域专家提高地震预报的准确性。
     在探讨现阶段数据挖掘算法模型及其实现基础上,本文首先对地震预报的传统方法(地震震例数据和前兆观测数据分析)进行探讨。同时,围绕地震地区相关性分析、地震序列分析和地震前兆的规律性认识等关键问题进行分析研究,实现并行关联规则算法、基于地震相似度的时间序列相似性匹配算法以及序贯模式挖掘算法。然后,基于时序分析技术,提出一系列地震前兆观测数据处理模型和并行实现算法。最后,结合实际应用实现一个地震预报并行数据挖掘平台,为地震预报数据挖掘的海量数据处理提供强大技术支持。
     本文的主要创新性工作包括:
     1.基于关联分析技术研究地震相关地区的搜索方法,提出并实现了一种基于主从模式设计的并行关联规则算法FPM-LP( Fast Parallel Mining of Local Pruning)。本文把地震地区相关性问题转化为时间序列的关联规则挖掘问题,通过相关的实验和结果分析,挖掘出许多有价值的地震区域相关性知识。
     2.基于时间序列相似性匹配技术对地震地区相关性进行分析,实现了基于相似度的地震时间序列相似性匹配算法WSM3S (Whole Sequence Matching Based-on Seismo Similiarity Support)。本文从地震三要素时、空、强的三维角度,给出了地震相似度定义和时间序列相似性匹配模型及算法。通过分析近二十年来我国地震活动频繁区域的历史数据,应用该算法进行多种不同粒度、不同时间差的序列相似性实验分析,取得了可信度较高的结果。
     3.基于序贯模式挖掘技术进行地震序列分析的研究,提出并实现一种基于广义约束规则的序贯模式挖掘算法SPBGC(Sequential Pattern Mining Based on General Constrains)。本文将地震序列的相关领域知识定义为一组广义约束规则,应用该算法从地震震例数据中挖掘广义地震序列,为领域专家进行地震序列的相似性研究提供强有力的支持。
     4.基于时序分析技术重点研究地震前兆观测数据的处理方法,提出一系列实用地震前兆观测数据处理并行实现算法。首先,提出基于动态规划的时间扭曲方法进行子序列搜索的相似性度量,能有效地进行考虑噪声、幅度、偏移等问题
Earthquake prediction is a worldwide challenging problem. With the development of earthquake prediction in the past 30 years, a large amount of prior knowledge and billions of data have been accumulated in our country. The gigantic auspice data under earthquake conditions is recorded by the sensor network of seismological observatory everyday. In this paper, we introduce the advanced data mining techniques into the earthquake prediction field, and several novel approaches between data mining and seismological data analysis are studied. Meanwhile, just by using the techniques of high performance computing and parallel data mining, seismological domain knowledge hidden in the gigantic data can be efficiently discovered to support earthquake prediction, therefore the accuracy of the earthquake prediction can be improved effectively.
     On the basis of discussing the existing data mining algorithms, the paper mainly focuses on the domain knowledge of seismology and the traditional methods for earthquake prediction. Then, by using the relativity analysis on earthquake zones, the earthquake sequences and the rules of earthquake auspice data, it carries out several parallel data mining algorithms such as association rules based parallel mining algorithm, the seismological similarity and similarity-matching algorithm realization, and the sequential pattern mining algorithm etc. Furthermore, the earthquake auspice data processing method and a series of parallel implement algorithms are proposed based on the technique of time series analysis. Finally, the parallel seismological data mining platform is implemented, which integrates all of the algorithms proposed in this paper.
     The main contribution of the dissertation is shown as follows:
     1. By analyzing and discovering the earthquake catalogue data on the relativity of earthquake zones, a Master/Slave mode based parallel mining algorithm FPM-LP (Fast Parallel Mining of Local Pruning) is put forward by using association rules, just as well as the relative preprocessing algorithm is presented. The experimental results demonstrate that the algorithm is satisfactory to find relative earthquake zones.
     2. On the basis of analyzing the relative earthquake zones on the technique of time series similarity matching, the seismological similarity and similarity-matching model on the relative earthquake zones and its algorithm WSM3S (Whole Sequence Matching Based-on Seismo Similarity Support) are proposed according to the three earthquake essential factors, which are named time, space, intensity separately. Just by

引文

[1] Fayyad U, Piatetsky-Shapiro, Smyth, Uthurusamy. Advances in Knowledge Discovery and Data Mining, MIT Press, 1996.
    [2] Jiawei Han, and Micheline Kamber. Data Mining: Concepts and Techniques. Academic Press. 2000.
    [3] 梅世蓉,冯得益,张国民,朱岳清,高旭,张肇诚,中国地震预报概论,北京:地震出版社,1993
    [4] R.Agrawal, R.Srikant. Fast algorithms for mining association rules Proceeding of the 20th international Conference on very large database, Santiago, Chile, Sept, 1994
    [5] Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. Eleventh International Conference on Data Engineering, Taipei, Taiwan, page 3—14, IEEE Computer Society Press, Philip S. Yu and Arbee S. P. Chen editors, 1995.
    [6] Ramakrishnan Srikant and Rakesh Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. 5th Int. Conf. Extending Database Technology, (EDBT), pages 3—17, volume 1057, Springer-Verlag, Peter M. G. Apers and Mokrane Bouzeghoub and Georges Gardarin editors, ISBN 3-540-61057-X, 1996.
    [7] Michael T.Rosenstein and Paul R.Cohen. Concepts from Time Series. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. Pp 739-745.
    [8] 王炜. 《地震预报方法及数据挖掘》. 上海地震局报告. 2004 年
    [9] 郑大林,张肇城.地震震例的研究. 地震, Vol.20, Supplement. Sept. 2000.
    [10] 陆远忠,陈章立,王碧泉等. 地震预报的地震学方法.地震出版社.北京:1985.
    [11] 刘王芬,廖清波. 利用地震活动的相关性进行地震预报的初步探讨. 《地震预报方法实用化研究文集,地震学专辑》.学术出版社.1989. Pp 253-267.
    [12] 敖雪明,王桂岭,黄克强等.相关地震预报方法的研究. 《地震预报方法实用化研究文集, 地震学专辑》. 学术出版社.1989.
    [13] 周水耿,周傲英等. 基于分区的 DBSCAN 算法。《计算机研究与发展》. 2000 年第 10 期.
    [14] Agrawal R, Mamnila H, Srikant R et al. Fast Discovery of Association Rules, in Fayyad U, Piatetsky-Shapiro, Smyth, Uthurusamy. P Eds Advances in Knowledge Discovery and Data Mining, MIT Press, 1996. 307-328.
    [15] S.Brin,R.Motwani,J.Dullman,and S.Tsur. Dymamic Itemset counting and implication rules for market basket data. IN ACM SIGMOD Intl.Conf. on the Management of Data 1997
    [16] A.Savaserse, E.Omiecinski and S.Navathe.An efficient algorithm for mining association rules in large databases. In: proceedings of the 21st Intl.Conf. onVery large Databases,1995
    [17] J.S.Park, M.S.Chen, and P.S.Yu An effective hash-bashed algorithm for mining association rules. Proceedings of ACM SIGMOD Intl.Conf on Management of Data, pages 175-186,San Jose, CA, May 1995.
    [18] R.Agrawal,et al. Parallel mining of association rules. IEEE Transactions knowledge and data engineering, 1996.8(6), 962-969
    [19] Jong Soo Park, Ming_Syan Chen and Philip S.YU Efficient parallel data mining for association rules. In Proc. OF AcM Int’l Conderence on Information and Knowledge Management, pages 31-36,Baltimore, MD。
    [20] Cheng, D. W., Hu, K and Xia, S. A synchronous parallel algorithm for mining association rules on a shared memory multiple processors. In 10th ACM Symp. Parallel Algorithm and Architectures, June 1998. 279～ 288.
    [21] Zaki,M. J. , Parthasarathy, S. , Ogihara,M. , and L i,W: New algorithms for fast discovery of association rules. In 3rd Intl.Cnof. on Knowledge Discovery and Data Mining , August 1997.283-286
    [22] Zak i,M. J. , Parthasarathy, S. , Ogihara,M. , and Li,W: Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, Dec 1997: 1 (4) : 343～ 373
    [23] Eui-Hong(Sam) Han,George Karrypis, and Vipin Kumar. Scalable parallel data mining for association rules. IEEE Transactions ON Knowledge and Data Engineering, 12(3): 352-377,2000
    [24] Mueller,A Fast sequential and parallel algorithms for association rule mining. A Comparison. Technical Report CS-TR-3515, University of Maryland, College Park, August 1995.
    [25] 都志辉. 高性能计算并行编程技术 MPI 并行程序设计. 清华大学出版社. 2001.
    [26] David W.Cheung Jiawei Han Vincent T.Ng Ada W.Fu Yongjian Fu A fast distributed algorithm for mining association rules In Proc. Of 1996 Int’l. Conf. On parallel and Distributed Information Systems. Pages 31-44 Miami Beach, Florida, December 1996.
    [27] 郑魁香. 2001 年台湾地区地震趋势分析. 2001 年台湾地区地震趋势分析论坛论文集.台湾台北,2001.41 52.
    [28] J.Roddick and M.Spiliopoulou. Temporal data mining: Survey and issues. Research Report ACRC-99-007,School of Computer and Information Science, University of South Australia, May 1999.
    [29] Y.Li,X.S.Wang and S.Jajodia. Discovering temporal patterns in multiple granularities. In Proc. Of Int’1 workshop on Temporal, Spatial and Spatio-temporal Data Ming, 2000.
    [30] Eamonn Keogh, Shruti Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration”, in Data Mining and Knowledge Discovery Volume: 7 Issue: 4 ,October 2003.
    [31] 张保健. 时间序列数据挖掘. 西北工业大学博士论文. 2003.
    [32] 张保健、何华灿. 时态数据挖掘研究进展. 计算机科学. Vol.29 No.2 , 2002.
    [33] Povinelli R. Identifying Temporal Pattern for Characterization and Prediction of Financial Time Series Event. In Proc. International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, TSDM2000, Lyon, France.
    [34] M. G.as, K.Lin, H.Mannila etc. Rule Discovery from Time Series, Proc. of the Fourth International Conference on Knowledge Discovery and Data Ming, 1998.
    [35] 欧阳为民、蔡庆生. 数据库中的时态数据挖掘研究. 计算机科学. 1998 25(4),pp 60-63.
    [36] 欧阳为民、蔡庆生. 在数据库中自动发现广义序贯模式. 软件学报,1997年 11 月, 第 8 卷(第 11 期): 864-870.
    [37] 欧阳为民、蔡庆生. 大型数据库中多层关联规则的元模式制导发现. 软件学报. 1997 年,第 8 卷(第 12 期):920-927.
    [38] 欧阳为民,蔡庆生. 发现序贯模式的增量式更新技术. 小型微型计算机系统. 1998 年 11 月,第 19 卷(第 11 期):12-17.
    [39] 欧阳为民、蔡庆生. 在大型数据库中多层序贯模式的发现. 计算机研究与发展. 1998。
    [40] 欧阳为民、蔡庆生. 发现广义序贯模式的增量式更新技术. 软件学报. 1998年.
    [41] 刘念祖. 时态数据挖掘的探讨. 上海第二工业大学学报. No2,2001.
    [42] Agrawal R, Psaila G, Wimmers E, etc.Querying shapes of histories . Proc.of Twenty-first International Conference On Very Large Database (VLDB 95), Zurich, Switerland. Morgan Kaufmann Publishers, Inc. San Francisco, USA.1995.502—514.
    [43] Das G, Gunopulos D, Mannila H. Finding similar time series. Proc. of the first European Symposium on Principle of Data Mining and Knowledge Discovery (PKDD97). Vol 1263 of LANI Springer 1997 88-100.
    [44] Fu-Lai Chung, Member, IEEE, Tak-Chung Fu, Vincent Ng, and Robert W. P. Luk, Senior Member, IEEE, “An Evolutionary Approach to Pattern-Based Time Series Segmentation”, in IEEE Transaction On Evolutionary Computation, Vol. 8, Vol. 5, October 2004.
    [45] 吴开统等. 地震序列概论. 北京大学出版社. 北京. 1990.
    [46] 韩志军、王桂兰、周成虎等. 地震序列研究现状与研究方向探讨.地球物理学进展 ,2003 年 3 月,第 18 卷,第 1 期,Pg074-078.
    [47] Agrawal R, Srikant R. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE'95), Taipei, Taiwan, Mar. 1995, pp 3-14.
    [48] Srikant R, Agrawal R. Mining sequential patterns:Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, Mar. 1996, pp.3-17.
    [49] Zaki M J. Efficient enumeration of frequent sequences. In Proc. 7th Int. Conf. Information and Knowledge Management (CIKM'98), Washington D.C., Nov. 1998, pp.68-75.
    [50] Masseglia F, Cathala F, Poncelet P. The psp approach for mining sequential patterns. In Proc. 1998 European Symp. Principle of Data Mining andKnowledge Discovery (PKDD'98), Nantes, France, Sept. 1998, pp. 176-184.
    [51] J. Han, W. Gong, and Y. Yin. Mining segment-wise periodic patterns in time-related databases. In Proc. 1998 Int’l Conf. on Knowledge Discovery and Data Mining (KDD’98), New York City, NY, August 1998.
    [52] HAN, J., DONG, G. and YIN, Y. Efficient Mining of Partial Periodic Patterns in Time Series Database. Proc. Fifteenth International Conference on Data Engineering, Sydney, Australia, 106-115, IEEE Computer Society.
    [53] 杨学兵、陆勤、蔡庆生. 一种高效的挖掘序贯模式的算法. 小型微型计算机系统. 2001 年 2 月,第 22 卷(第 2 期):201-203.
    [54] Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), Dallas, TX, May 2000, pp.1-12.
    [55] Han J, Pei J. Mortazavi-Asl B, Chen Q, Dayal U, Hsu M C. FreeSpan: Frequent pattern-projected sequential pattern mining. In Proc. 2000 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'00), Boston, MA, Aug. 2000, pp.355-359.
    [56] Pei .J, Han.J, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, “Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach”, IEEE Transactions on Knowledge and Data Engineering, 16(10), 2004.
    [57] Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M C. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proc. 2001 Int. Conf. Data Engineering (ICDE'01), Heidelberg, Germany, April 2001, pp.215-224.
    [58] Pei J, Han J, Wang W. Constraint-based sequential pattern mining in large databases. In Proc. 2002 Int. Conf. Information and Knowledge Management (CIKM'02), McLean, VA, Nov. 2002, pp.18-25.
    [59] Minos N. Garofalakis and Rajeev Rastogi and Kyuseok Shim. SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. The VLDB Journal, pages 223—234, 1999.
    [60] 许绍燮、沈佩文,1980 北京地区地震活动的构造特征,北京市地震地质会战报告.
    [61] 张肇城、郑大林、罗咏生等.《中国震例》前兆资料的初步研究.地震 1990.
    [62] 分析测震数据识别地震前兆的人机结合处理系统.《地震预报方法实用化研究文集,地震学专辑》.学术出版社.1989. pp.527-538.
    [63] 王碧泉,范洪顺,杨锦英等. 模式识别方法应用于测震前兆的综合预测. 《地震预报方法实用化研究文集,地震学专辑》.学术出版社.1989. pp.514-526.
    [64] Pinto H, Han J, Pei J, Wang K, Chen Q, Dayal U. Multi-dimensional sequential pattern mining. In Proc. 2001 Int. Conf. Information and Knowledge Management (CIKM'01), Atlanta, GA, Nov. 2001, pp.81-88.
    [65] Zhu, Yunyu, “High performance data mining in time series: Techniques and case studies”, Dissertation Abstracts International, Volume: 64-11, Section: B, page: 5616, 2004.
    [66] V. Guralnik and J. Srivastava, “Event detection from time series data,” in Proc. 5th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining,1999, pp. 33–42.
    [67] G. F. Bryant and S. R. Duncan, “A solution to the segmentation problem based on dynamic programming,” in Proc. 3rd IEEE Conf. Control Applications, vol. 2, 1994, pp. 1391–1396.
    [68] S. R. Duncan and G. F. Bryant, “A new algorithm for segmenting data from time series,” in Proc. 35th IEEE Conf. Decision Control, vol. 3, 1996, pp. 3123–3128.
    [69] A. N. Srivastava and A. Weigend, “Improving time series segmentation with gated experts through annealing,” Dept. Comput. Sci., Inst. Cognitive Sci., Univ. Colorado, Boulder, CO, Tech. Rep. CU-CS-795-95, 1996.
    [70] Fu-Lai Chung, Member, IEEE, Tak-Chung Fu, Vincent Ng, and Robert W. P. Luk, Senior Member, IEEE, “An Evolutionary Approach to Pattern-Based Time Series Segmentation”, in IEEE Transaction On Evolutionary Computation, Vol. 8, Vol. 5, October 2004.
    [71] Jiawei Han, Guozhu Dong and Yiwen Yin, “Efficient Mining of Partial Periodic Patterns in Time Series Database”, in Proceedings of the 15th International Conference on Data Engineering, March 1999.
    [72] Takahiko Shintani and Masaru Kitsuregawa. Mining Algorithms for Sequential Patterns in Parallel : Hash Based Approach. Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 283—294,1998.
    [73] 郭躬德,王晖. David Bell. 时间序列数据分析与预处理, 小型微型计算机系统. 2003,12.
    [74] Chan K2P, Fu A W2C. Efficient time series matching by wavelets. In: Proceedings of the 15th International Conference on Data Engineering , Sydney , Australia , 1999. 126～133.
    [75] R. Agrawal, C. Faloutsos, A. Swami: "Efficient Similarity Search in Sequence Databases", Proc. of the 4th Int'l Conference on Foundations of Data Organization and Algorithms, Chicago, Oct. 1993, Also in Lecture Notes in Computer Science 730, Springer Verlag, 1993, 69-84.
    [76] R. Agrawal, K. Lin, H. S. Sawhney, K. Shim. " Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases", Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995.
    [77] Kiyoung Yang, Cyrus Shahabi. A PCA-based similarity measure for multivariate time series. In: Proceedings of the 2nd ACM international workshop on Multimedia databases. In November 2004.
    [78] Dimitrios Gunopulos, Gautam Das. Time series similarity measures and time series indexing. ACM SIGMOD Record, June 2001.
    [79] 张海勤,蔡庆生. 基于小波变换的时间序列相似模式匹配, 计算机学报, 2003,03.
    [80] Zbigniew R. Struzik, Arno Siebes. Measuring Time Series' Similarity throughLarge Singular Features Revealed with Wavelet Transformation. In DEXA ,1999.
    [81] 李爱国, 覃征. 在线分割时间序列数据, 软件学报, 2004,Vol.15, No.11.
    [82] Eamonn Keogh , et al. ” An Online Algorithm for Segmenting Time Series”, in Proceedings of the 2001 IEEE International Conference on Data Mining, 2001.
    [83] 章学明, 施法中. 分布式并行数据挖掘系统的研究与实现[J]. 计算机工程与应用, 2002,4:198～200.
    [84] Rajkumar Buyya. High Performance Cluster Computing: Architectures and Systems, Volume 1. 人民邮电出版社. 2002. p3～838.
    [85] 陆丽娜, 孟虹, 魏恒义, 杨麦顺. 并行数据库的改进 Hash 划分方法及并行Join 算法[J]. 计算机研究与发展, 2000,2(37): 161～163.
    [86] P.F.Corbett,D.G.Feitelson, JP Prost etal. Parallel access tofiles in the vesta file system. In:Proc of Supercomputing' 93 .Portland,Oregon,1993 . 472～481.
    [87] 张潇 , 恽爽 , 陆桑璐 , 陈道蓄 . 并行数据挖掘研究 [J]. 计算机工程 , 2003,29(17):58~59.
    [88] 许向阳, 张勇, 王元珍. 并行 PDBMS 的数据划分方法[J]. 计算机工程与应用, 2001,8: Pp90～111.
    [89] 章隆兵, 陈意云, 章峰, 陈国良. 基于分布式共享存储系统的并行文件子系统 DPFS [J]. 计算机研究与发展, 2002,3(39):361～366.
    [90] SA Moyer,V S Sunderam. Parallel I/O as a parallel applica-tion. Emory University,Tech Rep:CSTR- 941101 ,1994.
    [91] 孙加林. 对中国地震预报现状与未来的思考, 国际地震动态. 2003 年 4 月.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700