Hadoop下基于贝叶斯分类的气象数据挖掘研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着气象事业现代化水平不断提高,气象信息资料的数量也越来越庞大,如何高效的处理和计算这些海量的气象数据成为了气象数据挖掘领域中一个重要的问题。分布式技术为解决这一问题提供了可能,其已经成为气象数据挖掘的应用基础。
     本文在分析气象数据的特征和处理过程的基础上,选取中国地面气候资料日值数据集中江苏省(徐州、赣榆、南京、东台)4站自1951年至今气象数据资料作为研究对象,主要做了如下的工作:
     (1)分析了开源云平台Hadoop的相关技术,重点研究了MapReduce编程模型、作业流程和关键技术,并基于MapReduce编程思想实现降雨量分级统计实验,结果表明该数据集的降雨量缺测漏测现象很低,可以作为研究对象。
     (2)研究了朴素贝叶斯(NB)在降雨量分类中的应用方法。针对气象数据集的特征,利用相关系数和PKI离散化法对预测因子进行选择和离散化。通过数据集训练和测试得到分类精度,并从预测因子时间连续性、概率计算中下溢情况和离散化方法3方面分析了NB分类器在降雨量分类应用中的不足。
     (3)针对NB在降雨量分类预测研究中存在的不足以及在大型气象数据处理中的效率问题,对NB中预处理、模型训练和精度评估三个过程进行MapReduce化,提出了基于MapReduce模型进行有效改进的朴素贝叶斯分类器(MRNB)。
     通过降雨量分类实验证明,与NB分类器相比较,本文提出的MRNB分类器能充分利用集群的资源,提高了大数据量的挖掘效率,且在大型气象数据集分类中获得了更好地精度。MRNB分类器具有很好的扩展性,为以后在海量气象数据中分类挖掘相关方法提供了更好的解决方案。
As the modernization of the meteorological service is improving sustainably, how to process and calculate the vast amounts of meteorological data efficiently have been an important issue in the field of data mining in meteorology. Distributed technology has become the foundation to apply data mining technology in meteorology which makes it possible to deal with those data in more efficiently way.
     Based on analyzing the characteristics and processing of meteorological data, we select Chinese terrestrial climatic data sets of daily records in four stations (Xuzhou, Ganyu, Nanjing, Dongtai) in Jiangsu Province since1951for the study. The major work of this paper can be described as follows:
     (1) Introduce the related technology of the open source cloud platform Hadoop and focus on the description of the programming model, job process and key technologies of MapReduce. Meanwhile, by using the MapReduce programming ideas, we make the rainfall data classification and statistics experiment. The result shows the data sets we choose can be used for the study for the amount of the absence and missing data of the rainfall data in the data sets is very little.
     (2) Naive Bayes (NB) classifier is recommended and used in the rainfall data classification. In consideration of the characteristics of meteorological data sets, we use correlation coefficient and PKI discretization method to select and discrete predictors. By training and testing the data sets to get classification accuracy, we analyze the NB classifier's applying shortage in rainfall data classification by three aspects:the predictors'time continuity, the underflow situation of probability calculations and discretization method.
     (3) Considering the problems that NB classifier's shortage in the study of rainfall data classification and its low processing efficiency in handle vast amount of meteorological data, the paper gives an improved based on MapReduce model Naive Bayes classifier (MRNB) which achieves mainly by operate MapReduce ideas on three process:preprocessing, model training and the accuracy assessment.
     Compared with the NB classifier, the proposed MRNB classifier can make full use of cluster resources, improve the data-mining efficiency of the massive data, and get better accuracy in the classification of massive meteorological data sets which can be identified by the rainfall data classification experiment. The improved classifier has good scalability which also provides a better solution for the future's classified data mining in massive meteorological data.
引文
[1]中国气象台站的历史发展:http://www.zsqx.com/weather/public/weatherInfo/105.jsp.
    [2]白玉洁.改进时间序列模型在降雨量预测中的应用研究[J].计算机仿真,2011,28(10):141-145.
    [3]Lorenc A C. Analysis methods for numerical weather prediction[J]. Quarterly Journal of the Royal Meteorological Society,1986,112(474):1177-1194.
    [4]屠伟铭.近十年来国家气象中心业务客观分析技术介绍[J].应用气象学报,1995,5(4):477-482.
    [5]陈东升,沈桐立,马革兰等.气象资料同化的研究进展[J].南京气象学院学报,2004,27(4):550-564.
    [6]Courtier P, Andersson E, et al. The ECMWF implementation of three-dimensional variational assimilation (3D-var). Ⅰ:Formulation[J]. Quarterly Journal of the Royal Meteorological Society,1998,124(550):1783-1807.
    [7]张爱忠,齐琳琳,纪风等.资料同化方法研究进展[J].气象科技,2005,33(5):385-390.
    [8]Haitao Cheng, Taner Z. Sen, Robert L. Jernigan. Data Mining for Protein Secondary Structure Prediction [J]. Structure and Bonding,2010, (134):135-167.
    [9]Harry Zhang, Jiang Su. Naive Bayes for optimal ranking[J]. Journal of Experimental and Theoretical Artificial Intelligence,2008,20(2):79-93.
    [10]吴成东,许可,韩中华等.基于粗糙集和决策树的数据挖掘方法[J].东北大学学报(自然科学版),2006,27(5):481-484.
    [11]Aasia Khanum, Muid Mufti, et al. Fuzzy Case-based Reasoning for Facial Expression Recognition [J]. Fuzzy Sets and Systems,2009,160(2):231-250.
    [12]邹立安,刘立博,王风.人工神经网络BP模型在枯季径流量预测中的应用[J].水资源研究2008,29(3):43-45.
    [13]Xinjun Peng. A bi-fuzzy progressive transductive support vector machine (BFPTSVM) algorithm[J]. Expert Systems with Application,2010,37(1):527-533.
    [14]肖伟平,何宏.基于遗传算法的数据挖掘方法及应用[J].湖南科技大学学报(自然科学版),2009,24(3):82-86.
    [15]胡可云,路玉昌等.粗糙集理论及其应用进展[J].清华大学学报(自然科学版),2001,41(1):64-68.
    [16]熊肖华,姚建初.基于模糊集的数据挖掘研究与应用[J].计算机工程与应用,2002,01:203-205.
    [17]赵军.数值天气预报资源同化技术及并行计算研究[D].四川:国防科技大学,2007.
    [18]Hadoop. The Apache Software Foundation. http://hadoop.apache.org/
    [19]数据挖掘http://baike.baidu.com/view/7893.htm
    [20]马廷淮,穆强等.气象数据挖掘研究[J].武汉理工大学学报,2010,32(16):110-114.
    [21]杨淑群,芮景析,冯汉中.支持向量机(SVM)方法在降水分类预测中的应用[J].西南农业大学学报(自然科学版),2006,28(2):252-257.
    [22]Feng Ling, Dillon Tharam, Liu James. Inter-transactional Association Rules for Multi-dimensional Contexts for Prediction and Their Application to Studying Meteorological Data[J]. Data & Knowledge Engineering,2001, (37):85-115.
    [23]Thomas H Hinke, John Rushing, Heggere Ranganath, et al. Techniques and Experience in Mining Remotely Sensed Satellite Data Artificial [J]. Intelligence Review,2000, (14):503-531.
    [24]Estevam R, Hruschka Jr. Applying Bayesian Networks for Meteorological Data MiningfC]. Proceedings of AI-2005. Cambridge:[s.n.],2005:122-133.
    [25]Tugay Bilgin T, Yilmaz Camurcu A. A Data Mining Application on Air Temperature Database[J]. Lecture Notes in Computer science,2005,3216:68-76.
    [26]陈莹,陈兴伟.基于奇异谱分析的闽江流域径流长期预报研究[J].水资源与水工程学报,2011,22(5):16-19.
    [27]焦飞,黄天文,何华庆.数据挖掘技术在气温长期变化趋势预测中的应用[J].广东气象,2006,(2):33-35.
    [28]向俊莲,王丽珍.PUBLIC在云南气象预报中的应用[J].云南大学学报(自然科学版),2001,23(1):16-19.
    [29]Peters J F, Suraj Z, Shan S, et al. Classi Cation of Meteorological Volumetric Radar Data Using Rough Set Methods[J]. Pattern Recognition Letters,2003,3(24):911-920.
    [30]Tsegaye Tadesse, Donald A Wilhite, Sherri K Harms, et al. Drought Monitoring Using Data Mining Techniques:A Case Study for Nebraska[J]. USA Natural Hazards,2004, 33(1):137-159.
    [31]Asanobu Kitamoto. Spatio-temporal Data Mining for Typhoon Image Collection [J]. Journal of Intelligent Information Systems,2002,19(1):25-41.
    [32]刘鹏.云计算[M].北京:电子工业出版社2010.3,1-11,189-244.
    [33]Chu C-T, Kim S K, Lin Y A, et al. Map-Reduce for machine learning on multicore [C], in NIPS'06:Proceedings of Neural Information Processing Systems Conference. MIT Press, 2006:281-288.
    [34]Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Y. Edward. Chang. PFP:Parallel FP-Growth for query recommendation[C]. In Proceedings of the 2008 ACM conference on Recommender systems, New York, NY, USA,2008.
    [35]基于Hadoop的并行分布式数据挖掘平台PDMiner(Parallel Distributed Miner). http://www.intsci.ac.cn/pdm/pdminer.html
    [36]Agrawa R, Imielinski T, Swarimi A. Database Mining:A Performance Perspective [J]. Knowledge and Data Eng,1993,5(6):12-16.
    [37]Shusaku Tsumoto. Knowledge discovery in clinical databases and evaluation of discovered knowledge in outpatient clinic. Information Sciences,2000,124(1-4):125-137.
    [38]纪希禹.数据挖掘技术应用实例[M].北京:机械工业出版社,2009.
    [39]蒋励.数据挖掘技术综述[J].中国科技博览.2010,(33):477-477.
    [40]陈宝学.数据挖掘技术应用于天气预报的研究[D].哈尔滨:哈尔滨工程大学,2004.
    [41]邵峰晶,于忠清,王金龙等.数据挖掘原理与算法(第二版)[M].北京:科学出版社,2009.
    [42]宫秀军.贝叶斯理论及其应用研究[D].中国科学院研究生院(计算技术研究所),2002
    [43]Friedman N, Goldszmidt M. Building classifiers using Bayesian networks[C]. Proceedings of the 13th National Conference on Artificial Intelligence (AAAI),1996,2:1277-1284.
    [44]Keogh E J, Pazzani M J. Learning Augmented Bayesian Classifiers[C]. Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics,1999:225-230.
    [45]Zhang H, Jiang L, Su J. Hidden Naive Bayes. American Association for Artificial Intelligence, AAAI Press,2005:919-924.
    [46]王晨婉.基于贝叶斯理论的供水管道风险评价研究[C].天津:天津大学,2010:27-32.
    [47]谢作将.面向朴素贝叶斯算法的离散化方法研究[D].北京:北京交通大学,2008:23-36.
    [48]Cestinik B. Estimating probabilities:A crucial task in machine learning [D]. Proceedings of the 9th European Conference on Artificial Intelligence,1999:147-149.
    [49]Kontkanen P, Myllymaki, Silander T, et al. A Bayesian approach to discretization [J]. The European Symposium on Intelligent Techniques,1997:265-268.
    [50]Chmielewski M R, Grzymala-Busse J W. Global discretization of continuous attributes as preprocessing for machine learning[J]. The international Journal of Approximate Reasoning, 1999,15(4):319-331.
    [51]左爱文,数据挖掘技术在气象数据中的应用[D].西安:西安电子科技大学,2006.
    [52]菅志刚,金旭.数据挖掘中数据预处理的研究与实现[J].计算机应用研究,2004,(7):117-118/157.
    [53]Famili A, Evangelos Simoudis. Data pre-processing and Intelligent Data Analysis[J]. Intelligent Data Analysis,1997, 1(1):3-23.
    [54]Chen M S, Han J W, Yu P S. Data Mining:An overview from a Database perspective[J]. IEEE Trans. Knowledge and Data Eng,1996,8(6):866-883.
    [55]闫永慧,胡伍生.空间数据挖掘中的数据预处理技术研究[J].山西建筑,2009,35(14):363-365.
    [56]Chuck Lam, Hadoop in action[M], Maning publication,2010:20-22,247-281.
    [57]Tim White, Hadoop:The Definitive Guide [M], O'Reilly Media, June 2009, ISBN 059652197:15-75,129-257.
    [58]HDFS (hadoop distributed file system) architecture. http://hadoop.apache.org/common/docs/current/hdfs design,2009.
    [59]Sage Weil, Scott A Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn. Ceph:A Scalable, High-Performance Distributed File System[C]. Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI'06). November 2006.
    [60]Nicolae B, Moise D, Antoniu G. BlobSeer: Bringing high throughput under heavy concurrencyto Hadoop Map-Reduce applications[J], IEEE International Symposium on Parallel & Distributed Processing (IPDPS),2010,1-11.
    [61]Zhan Ying, Sun Yong. Cloud Storage Management Technology[C], Second International Conference on Information and Computing Science,2009:301-311.
    [62]陈嘉恒Hadoop实战[M].北京:机械工业出版社,2011.9:55-136.
    [63]刘鹏,黄宜华,陈卫卫等.实战Hadoop-开启通向云计算的捷径[M].北京:电子工业出版社,2011:60-83,114-172.
    [64]Tom M.Mitchell著,曾华军,张银奎等译.机器学习[M],北京:机械工业出版社,2003,第6章.
    [65]张连问,郭海鹏.贝叶斯网引论[M].北京:科学出版社,2006:31-68.
    [66]郑冬冬.基于贝叶斯网络的图像型垃圾邮件识别研究[D].江苏大学.2010.
    [67]loan Pop. An approach of the Naive Bayes Classifier for the documents classification[J]. General Mathematics.2006,14(4):135-138.
    [68]Pedro Domingos, Michael Pazzzani. On the Optimality of the Simple Bayesian Classifier under zero-one Loss[J]. Machine Learning.1997,29:103-130.
    [69]刘方方.基于Laplace算法加速寿命试验的Bayes分析[D].上海师范大学.2004.
    [70]Yang, Y. and Webb, G, I. Weighted proportional k-interval discretization for Naive Bayes classifiers[C]. The 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD),2003:501-512
    [71]Jason D.M.Rennie, Lawrence Shih, Jaime Teevan, David R.Karger. Tackling the poor assumptions of naive bayes text classifiers[J]. AAAI, ICML,2003 616-623.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.