基于GLOBUS的分布式数据挖掘模型研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
世界上万事万物都在不断变化发展,计算机应用模式随着企业应用的发展也在不断变化发展。计算机应甩模式在近50年的发展变化过程中,经历了从集中式到分布式的这一变化路线。网格技术的出现使计算机应用模式再次走向了分布。随着信息技术的发展,各部门内部或者企业内部产生的数据量在急剧增加。爆炸式的数据增长既给企业带来了机遇同时也带来了挑战,如何从这些海量数据中发现知识,以及如何有效的发现知识是当今信息社会遇到的重大挑战。传统的集中式数据挖掘方式虽然能在一定程度上解决由数据分布带来的一些问题,但是面对海量数据,传统的集中式数据挖掘方式在挖掘性能方面越来越不能满足人们的需要。网格应用模式的出现给分布式数据挖掘带来了新的契机。
     本文的研究重点是Globus环境下的分布式数据挖掘模型。分布式数据挖掘要解决的首要问题,是数据资源和计算资源的合理匹配,以达到挖掘性能的优化。传统的分布式数据挖掘模型——移动代码和移动数据模型,虽然各有优点,但是都没有解决数据资源和计算资源的匹配问题,不能对分布式数据挖掘任务进行性能优化。本文提出的PDS模型,结合了移动代码和移动数据模型的优点,并运用最小响应时间作为分布式数据挖掘任务分配策略,对基于多个数据集的分布式数据挖掘任务进行任务优化分配。论文还给出了分布式数据挖掘最小响应时间模型各组成部分的预测方法以及实验结果。
     GS模型是基于Globus网格服务的分布式数据挖掘模型,是PDS模型的简化模型。GS模型运用SOA的架构思想,将分布式数据挖掘功能以网格服务的形式进行封装,客户通过调用网格服务来完成数据挖掘任务,在第5章中作者开发了一个GS模型的服务端程序。
All things are constantly changing and developing, computer application model with the development of enterprise applications are constantly changing and developing too. Computer application model in nearly 50 years of development and changes, has experienced from centralized to distributed models. With the presence of Grid technology, computer application model become distributed again. With the development of information technology, the data produced daily by various departments within the enterprise is increasing dramatically. Explosive growth of data in the enterprise not only brings opportunities but it also brings challenges, and how to discover knowledge and how to effectively discover knowledge from these massive data is a big challenge in today's information society. The traditional centralized data mining approach to some extent, can solve a number of issues brought about by data distribution, but when faced with a mass of data the traditional way of data mining is increasingly unable to meet people's needs. Grid technology brings new opportunities to the distributed data mining.
     This article mainly focused on Distributed Data Mining based on Globus environment. The first problem of DDM wants to slove is the rational matching between data resources and computation resources, in order to archive a good performance. The traditional model of distributed data mining-data transfer model and code transfer model, despite their different advantages, but did not solve the matching between data resources and computation resources, they can not performance task optimization. This article presents the PDS model(Policy , task Dispatching and Scheduling based DDM model, PDS Modle) combines the advantages of data transfer model and code transfer model, and apply minimum response time as a distributed data mining tasks allocation strategy. PDS model can assign task optimization based on multiple data sets DDM. The article also presented a prediction method of DDM minimum response time model.
     GS model is based on the Globus Grid Service, and it is a simplified model of PDS. GS model is a way of using SOA, it packs all function of distributed data mining services to a form of Grid Service, and allow the customer to call these services. In Chapter 5 , the author developed a model of GS.
引文
[1]Foster l.and Kesselman C.(eds.) The Grid:Blueprint for a Future Computing Inf.Morgan Kaufmann Publishers,1999,pp.105-129
    [2]A.Chervenak,I.Foster,C.Kesselman,C.Salisbury,and S.Tuecke.The Data Grid:towards an architectrue for the distributed management and analysis of large scientific datasets.J.of Network and Comp.Appl.(23):187-200,2001.
    [3]Foster I,Kesselman C.Globus:A metacomputing infrastructure toolkit,Int'l Journal of Supercomputer Applications,1997,11(2):115-129.
    [4]Jiawei Han,Micheline Kamber.数据挖掘概念与技术[M].北京:机械工业出版社,2001.
    [5]Foster I,Kesselman C.The Globus project:A status report.In:Proc.of the IPPS/SPDP'98Heterogeneous Computing Workshop.Orlando:IEEE Computer Society Press,1998.4-18.http://ipdps.eece.unm.edu/1998/hcw/foster.pdf
    [6]Byung-Hoon Park and H.Kargupta.Distributed Data Mining:Alogrithms,systems,and Application.To be published in the Data Mining Handbook,Editor:Nong Ye,http://www.csee.umbc.edu/~hillop/PUBS/review.pdf.
    [7]Ian Forster.Globus toolkit version 4:Software for Service-Oriented Systems.Journal of Computer Science and Technology.Vol.21,No.4,July 2006.
    18]L.Morrill,"A teste of what the data mining market has to offertoday",Database Programming and Design,http://www.dbpd.corn/vault/9804descs.htm.
    [9]M.Cannataro,Clusters and Grids for Distributed and Parallel Knowledge Discovery.The 8~(th)International Conference on High Performance Computing and Networking Europe(HPCN Eruope-Cluster Computing Workshop),May 8-10,2000,Amsterdam,The Netherlands,LNCS 1823,pp.708-716,Springer.
    [10]M.Cannataro,D.Talia,P.Trunfio,KNOWLEDGE GRID:High Performance Knowlodge Discovery Services on the Grid.Proceedings 2~(nd) 1nt.Workshop GRID 2001,Denver,CO,LNCS 2242,Springer-Verlag,pp.38-50,November 2001.
    [11]B.Allcock,J.Bester,J.Bresnahan.Data Management and Transfer in High Performance Computational Grid Environments,Parallel Computing Journal,2002,28(5):749-771.
    [12]I.Foster.Internet Computing and the Emerging Grid.Nature Web Matters 2000.(11):11-17.
    [13]A.S.Szalay,P.Z.Kunszt,A.Thakar.Designmg and Mining Multi-Terabyte Astronomy Archives:The Sloan Digital Sky Survey.SIGMOD Record,2000,29(2):451-462.
    [14]Litzkow M.J.,Livny M.,Mutka M.W.,Condor-A hunter of idle workstations.8~(th) International Conference on Distributed Computing Systems,1988,104-111.
    [15]Condor Manual,Condor Team,University of Wisconsin-Madison.
    [16]Segal B.,Grid Computing:The European Data Project,IEEE Nuclear Science Symposium and Medical Imaging Conference,Lyon,15-20,October 2000.
    [17]Foster I,Kesselman C,Tuecke S,The Anatomy of Grid:Enabling Scalable Virtual Organizations.Intemational J.Supercomputer Application,15(3),2001.
    [18]Foster I,Kesselman C,Tuecke S,The Anatomy of Grid:Enabling Scalable Virtual Organizations.International J.Supercomputer Application,15(3):200-222,2001.
    [19]http://www.gridforum.org/ogsi-wg/.
    [20]I.Foster,C.Kesselman(eds.).The Grid:Blueprint for a Future Computing Infrastructure.Morgan Kaufmann:San Francisco,CA,1999.
    [21]Czajkowski,K.,Ferguson,D.,Foster,I.,Frey,J,Craham,S.,Maguire,T.,Snelling,D.,Tuecke,S.:From Open Grid Services Infrastructure to WS-Resource Framework:Refactoring &Evolution,Version 1.0,Feb.2004.
    [22]OGSI Working Group,http://www.Gridforum.org-wg/.
    [23]http://www-106.ibm.com/developerworks/library/specification/ws-notification/.
    [24]Graham,S.,Maguire,T.,Frey,J.,Nagaratnam,N.,Sedukhin,I.,Snelling,D.,Czajkowski,K.,Tuecke,S.,Vambenepe,W.:WS-ServiceGroup Specification,Version1.0,March 2004,http://www-106.ibm.com/developerworks/library/ws-resource/.
    [25]Tuecke,s.,Czajkowski,K.,Frey,J.,Foster,I.Graham,S.,Maguire,T.,Sedukhin,I.Snelling,D.,Vambenpe,W.:Web Services Base Faults,Version1.0,March 2004.
    [26]朱海滨、蔡开裕等编,分布式系统原理与设计,国防科技大学出版社,1997.9.
    [27]Han J W,Towards On-Line Analytical Mining in Large Databases,SIGMOD Record 27(1):97-107,1998.(short version)
    [29]Han J W,Yang Q,and Kim E,Plan Mining by Divide-and-Conquer,Proc.1999 SIGMOD'99Workshop on Research Issues on Data Mining and Knowledge Discovery(DMKD'99), Philadelpia,PA,May 1999,pp.8:1-8:6
    [30]Stolfo S,Prodromidis A,Tselepis S,Lee W,Fan D,Chan P:JAM:Java Agents for Meta-Learning over Distributed Databases.KDD'97,Newport Beach,California,USA:74-81,(1997)
    [31]王宁,分布式数据采掘中若干问题的研究,复旦大学博士学位论文,1999
    [32]Foster I,Kesselman C,Globus:A Metacomputing Infrastructure Toolkit.Intl J,Supercomputer Applilcations.1997,11(2):115-128
    [33]www.cs.waikato.ac.nz/ml/weka/
    [34]FU,Y.,(2001),Distributed Data Mining:An Overview,Newsletter of the IEEE Technical Committee on Distributed Processing,Spring 2001,pp.5-9
    [35]Kamath,C.,(2001),The Role of Parallel and Distributed Processing in Data Mininig,Newsletter of the IEEE Technical Committee on Distributed Processing,Spring 2001,pp.10-15
    [36]http://www.kdkevs.net/forums/2226/showpost.aspx
    [37]D.B.Skillicom,The Case for Datacentric Grids,2 nd Workshop on Massively Parallel Processing,IPDPS2002,April 2002.
    [38]Parthasarathy,S.,and Dwarakadas,S.,(2002),Shared state for Distributed Interactive Data Mining Application,Journal of Distributed and Parallel Databases,Kluwer Academic Press,Vol.11,No.2,pp.129-155.
    [39]M.cannataro,D.Talia,and P.Trunfio.Design and development of distributed data mining applications on the knowledge grid.In Proceedings of High Performance and Distributed Computing,2002.
    [40]I.Foster,et al.,The Open Grid Services Architecture,version 1.5,March 2006
    [41]WSRF-The WS-Resource Framework.http://www.globus.org/wsrf.
    [42]http://www.globus.org
    [43]J.M.Alonso,V.Hemandez,R.Lopez,G.Molto,A service oriented system for on demand dynamic structural analysis over computational Grids,Lecture Noets in Computer Science 4395(2007) 13-26
    [44]http://www.oasis-open.org/committees/tc_cat.php?cat=soa
    [45]Erich Gamma,Richard Helm,Ralph Johnson and John Vlissides,Design Patterns Elements of Reusable Object-Oriented Software,李英军,马晓星,蔡敏,刘建中 等译,机械工业出版社,2003
    [46]KANTARDZICM,DataMining;Methods,Tools and Techniques[J].IEEE Press and John Wiley,2002,(1):380-392

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700