基于云计算的Web结构挖掘算法研究

英文题名：Research on Web Structure Mining Algorithm in Cloud Computing
作者：李远方
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：云计算 ; Web结构挖掘 ; PageRank ; Hadoop ; K-span
英文关键词：Cloud Computing ; Web Structure Mining ; PageRank ; Hadoop ; K-span
学位年度：2011
导师：邓世昆
学科代码：081203
学位授予单位：云南大学
论文提交日期：2011-05-01

摘要

数据挖掘是从大量纷杂的数据中分析并提取有用的知识和信息。当今网络上最重要的资源信息库是Web页,因此研究Web数据挖掘有着重要意义。但随着互联网的高速发展,Web信息日增长呈指数量级发展,要从中分析出有用的信息,单一节点的计算和存储己存在着瓶颈,而最近提出的云计算则提供了一种全新的解决方案,即可以通过网络获取强大的计算能力和存储能力,并行高效的挖掘知识和信息。
     文章在概述了云计算、Web结构挖掘、Hadoop等基本理论知识后,将Web结构挖掘算法和云计算(Hadoop开源云平台)进行了整合,并做了以下工作：
     1.对Web链接结构做了图论抽象,并详细说明了如何取得Web图结构数据,为挖掘算法提供了统一的数据表示方法。
     2.对类似数据库或文件系统的数据对象做了MapReduce抽象,以此说明MapReduce模型应用广泛,能满足实际的需要。
     3.对Hadoop分块(BlockSize)策略进行了研究,并建立了相应的数学模型,在实验结果阶段进行了试探性的研究。
     4.对传统PageRank并行算法进行了改进和移植,并提出了K-span算法。K步跨度算法(K-span)思想：尽量在PageRank并行迭代时减少Hadoop集群节点之间的通信次数,使得PageRank总的迭代时间减少,从而达到快速收敛的目的。具体来讲是Hadoop运行时,可预先将Dk和(AT)k的值依次算出,保存在Hadoop公用访问处,避免了节点之间频繁的通信访问。
     最后搭建Hadoop平台来估计传统PageRank并行算法和K-span并行算法的时间和空间开销,实验结果表明K-span算法的执行时效更好,同时也带来了额外的存储开销,但相对于云平台的高存储量来讲这点牺牲是值得的。
Data mining is a term that defines a method of analyzing and extracting the useful knowledge and information from the abundant and confusing data. Today, the most important resource on the information base on the internet is the web page. Consequently, studying the data mining of the web is of great importance. With the rapid development of the internet, the information of web is growing by magnitudes per day, which makes the computing and storage of a single node quite difficult for the researchers. Nevertheless, the newly-proposed theory---the cloud computing provides us a new solution by which we can mine knowledge and obtain the information efficiently through the Internet, where the cloud computing can get the capability of computing and storage.
     After overviews the cloud computing, web structure mining, Hadoop, and some other basic knowledge, this essay integrates the structure mining algorithms of web and the cloud computing theory (Open-source cloud platform of the Hadoop). Here is the work I did:
     1) Using graphs to analyze the interlinkage of the web and illustrate how to obtain the graph structure data of the web, which provide a unified method of the data representation for the mining algorithm.
     2) Making a MapReduce abstract to the data which are quite similar to the database or the file system through which stating the widely use of the MapReduce which can meet the actual demands of the modern times.
     3) Researching the blocksize strategy of the Hadoop, establishing a mathematical model correspondently and preceding a preliminary research at the terminate stage of the experiment.
     4) Improving and grafting the traditional PageRank algorithm, and proposing the K-span algorithm. The K-step span algorithm (K-span) is:trying hard to reduce the communicating times which exist in the clusters of the nodes when the iterations occur in the PageRank. In this way, the iterative time of the PageRank can be decreased and the speed up the pace of the convergence. That is, working out the values of both and then storing them in the public visiting room of the Hadoop, this avoids the frequent communication between the nodes when the Hadoop is operating.
     Finally, setting up a Hadoop platform to estimate the time and space cost of the traditional PageRank algorithm and the K-span algorithm. The result shows that K-span algorithm has a better execution time though it brings the additional cost, which is worthy in return for the vast memory space.

引文

[1]程川生.Web挖掘技术及其应用[D].山东：山东大学,2004.
    [2]刘鹏.云计算[M].北京：工业出版社,2010.
    [3]王鹏.云计算的关键技术与应用实例[M].北京：人民邮电出版社,2010.
    [4]邵峰晶,于忠清.数据挖掘原理与算法[M].北京：中国水利水电出版社,2003.
    [5]刘军.基于Web结构挖掘的HITS算法研究[D].湖南：中南大学,2008.
    [6]封俊.基于Hadoop的分布式搜索引擎研究与实现[D].太原：太原理工大学,2010.
    [7]Hadoop开发者入门专刊[EB/01]. http://www.hadoopor.com.
    [8]维基百科Cloud computing[EB/01].http://en.wikipedia.org/wiki/Cloud_computing.
    [9]Konstantina. A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework[M].Master of Science School of Informatics University of Edinburgh.2009.pp.990-901.
    [10]Amazon Simple Storage Service [EB/01]. http://aws.amazon.com/s3/.
    [11]Amazon Elastic Compute Cloud [EB/01]. http://aws.amazon.com/ec2/.
    [12]Google App Engine [EB/01]. http://appengine.google.com.
    [13]陈康, 郑纬.云计算：系统实例与研究现状[J].软件学报,2009,20(5)：1337-1348.
    [14]陈涛.云计算理论及技术研究[J].重庆交通大学学报,2009,9(4)：104-106.
    [15]WindowsAzure[EB/01].http://www.microsoft.com/windowsazure/windowsaz ure
    [16]李艳华.云计算技术研究现状综述[J].电脑知识与技术,2009,5(22)：6314-6315.
    [17]Michael O.Rabin.Efficient dispersal of information for security,load balancing and fault tolerance[J]. Journal of the ACM,1989,36(2):335-348.
    [18]云计算的关键技术[EB/01]. http://baike.baidu.com/view/2465492.htm.
    [19]陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,29(9).
    [20]Ralf Lammel.Google's MapReduce programming model-Revisited[J].Science of Computer Programming,2008(70):1-30.
    [21]冶红.基于数据挖掘的Web挖掘系统的研究[D].辽宁：大连理工大学,2003.
    [22]刘岩.基于Web的文本挖掘技术的研究[D].黑龙江：哈尔滨工程大学,2004.
    [23]姬彦利.Web结构挖掘算法研究[D].上海：华中师范大学,2009.
    [24]图论[EB/01]. http://baike.baidu.com/view/79350.htm.
    [25]范聪贤.Web结构挖掘中PageRank算法研究[D].江苏：苏州大学,2009.
    [26]田甜,倪林.基于PageRank算法的权威值不均衡分配问题[J].2007,33(18)：53-55.
    [27]黄德才,戚华春PageRank算法研究[J].计算机工程,2006,32(4)：145-146.
    [28]杨劲松,凌培亮.搜索引擎PageRank算法的改进[J].算机工程,2009,35(22)：35-37.
    [29]A.N.Langville,C.D.Meyer.A Reordering for the PageRank Problem[J].SIAM Journal on Scientific Computing,2006,27(6):2112-2120.
    [30]王德广,周志刚PageRank算法的分析及其改进[J].计算机工程,2010,36(22).
    [31]李雪锋.基于云计算环境的Web数据挖掘算法研究[D].北京：北京交通大学,2010.
    [32]Yang H C,Dasdan A,Hsiao R L,etc.Map-Reduce-Merge:Simplified Relational Data Processing on Large V lusters[C].International Conference on Management of Data Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data.2007:1029-1040.
    [33]柬亚建,黄群民Hadoop平台的性能优化研究[J].计算机工程,2010,36(14)：262-266.
    [34]B.C.Catanzaro,N.Sundaram,and K.Keutzer. A Map Reduce Framework for Programming Graphics Processors.In Workshop on Software Tools for MultiCore,2006.
    [35]C.Ranger,R.Raghuraman,A.Penmetsa,G.R.Bradski and C.Kozyrakis. "Evaluating MapReduce for Multi-core and Multi processor Systems", HPCA 2007, pp.13-24.
    [36]郑启龙,房明.基于MapReduce模型的并行科学计算[J].微电子学与计算机,2009,26(8)：21-23.
    [37]J.Dean and S.Ghemawat. MapReduce:Simplied DataProcessing on Large Clusters.In OSDI'04:Proceedings of the 6th conference on Symposium on Opearting Systems. Design & Implementation. USENIX Association,2004.
    [38]孙广中,肖峰MapReduce模型的调度及容错机制研究[J].微电子学与计算机,2007,24(9)：37-38.
    [39]Abhinav Sarje,Srinivas Aluru. A MapReduce Style Framework for Trees[J].Department of Electrical and Computer Engineering,2008,21(13): 17-18.
    [40]胡或,封俊Hadoop下的分布式搜索引擎[J].计算机系统应用,2010,19(7)：24-26.
    [4]] 张尧学,史美林.计算机操作系统教程[M].北京清华大学出版社,2000,92-93
    [42]SNAP[EB/01]. http://snap.stanford.edu/data/index.html.
    [43]MapReduce_plugin[EB/01].http://www.ibm.com/developerworks/cn/opensour ce/os-cn-hadoop2/mapreduce_plugin.zip.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700