MapReduce与量子进化算法的研究及应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
21世纪是一个信息化的时代,信息以及数据快速增长,这对计算能力提出了更高的要求,云计算在此环境下应运而生,它带来了新的变革。云计算是一种商业计算模型,它将计算任务分布在大量计算机构成的资源池上,使各种应用系统能够根据需要获取计算力、存储空间和信息服务。云计算是分布式计算、网格计算和并行计算的进一步发展,提供了一种更为有效的并行模型,因此如何将现有的并行算法应用于云计算中成为研究的主要内容。
     编写云平台下的并行化程序不同于以往单机环境下的并行程序,以往的并行化实现主要是基于多线程,且局限于单机内。而云计算环境下的并行化注重多机间,甚至是机器集群间的并行化,而且云环境的搭建都是基于普通的计算机。它主要是将大任务划分为多个小任务,然后分配给计算机集群来执行的,这大大降低了成本。数据挖掘领域常常受到海量数据的困扰,如果将云计算引入数据挖掘领域,必然会带来一场新的变革。MapReduce模型是谷歌在2004年提出的,是其开发的在超大集群下进行海量数据计算的一种编程模式,主要被用来处理信息量大且需要在可接受时间内完成的任务。目前MapReduce模型已被用来解决数据挖掘与机器学习中的一些问题了。
     量子进化算法近年来也开始受到广大研究者的关注,量子进化算法具有天然的并行性,非常适合在大规模并行平台上实现,而云平台为量子进化算法并行性的实现奠定了物质基础。覆盖算法是由张铃教授提出的可以解决数据挖掘中分类问题的算法,它是一种构造性的神经网络学习算法,采用的是M-P神经元的几何意义。覆盖算法实质上是将求出的覆盖领域作为三层神经网络的隐含层,输入层看做测试集,输出层看做测试集的分类结果。目前覆盖算法已得到了广泛的推广。
     本文利用量子进化算法的天然并行性及云平台的优越性,在云平台上实现了量子进化算法,结果显示,在云平台下该算法可以达到更好的并行效率。为了进一步研究量子进化算法的性能,本文结合量子进化算法种群规模小,收敛速度快,全局寻优性能强等特点,将其用于覆盖算法,优化覆盖中心,采用适应度来评价解的优劣,提出了一种改进的量子优化覆盖算法,利用五组数据进行分析对比,表明本文提出的改进算法可以有效地提高分类的精度和效率。最后,利用MapReduce模型的并行平台实现了对淘宝网数据信息的处理和检索。
21st century is an era of information, information and data grow rapidly, this is a higher demand on our computing power, cloud computing came into being in this environment, It has brought us a new change. Cloud computing is a commercial calculation model, in this model, the computing tasks are distributed in a pool of a large number of computer resources, and it makes all kinds of application system can acquire computing power, storage space and information services according to the need. Cloud computing is the further development of distributed computing, grid computing and parallel computing, it has given us a more efficient parallel model, so how to put the existing parallel algorithms used in cloud computing become the content of our study.
     Write parallel programs on the cloud platform, which is different from the traditional parallel procedures, traditional parallel realization is mainly based on multi-threaded, and limited in the single machine inside. Parallel thought on cloud platform focuses on multiple computers, even the computer clusters, and the construction of the cloud environment is based on the ordinary computer. The main idea is that dividing a big task into many small tasks which are carried out on computer clusters, which will greatly reduce the cost. The field of data mining is often plagued by huge amounts of data, if introduce the cloud computing to the field of data mining, is bound to bring a new change.MapReduce is put forward in2004by Google, it is a programming model for the mass data calculation under the large cluster, and it is mainly used to handle large amount of information and complete the task within an acceptable time. Now MapReduce model has been used to solve some problems in data mining and machine learning.
     In recent years, many researchers begin to pay more attention to quantum evolutionary algorithm, quantum evolutionary algorithm has natural parallelism, and suits the realization on large-scale parallel computer, and cloud platform for quantum evolutionary algorithm and the realization of the parallel established the material basis. Covering algorithm, a solution of classification problems in data mining algorithms, was originally proposed by Professor Zhang Ling, It is a constructive neural network learning algorithm, using the geometric meaning of the M-P neurons. The essence of Covering algorithm is that regard coverage areas as the hidden layer of the three-layer neural network, regard the test set as input layer, and regard the classification results of the test set as output layer. Currently Covering algorithm has been widely promoted.
     In this paper, taking advantage of the natural parallelism of the quantum evolutionary algorithm and the cloud platform, realized the parallel quantum evolutionary algorithm on cloud platform, and the results show that there will be a better parallel efficiency in a cloud platform. To further study the performance of quantum evolutionary algorithm, this paper takes advantages of it, for example: diversity features good, small population, fast convergence, strong global optimal performance and so on, introduces it into covering algorithm to optimize coverage center, adopts fitness to evaluate the pros and cons of the solution, and propose an improved quantum-optimal coverage algorithm. Through experiment and comparative analysis in five groups of data, its results show that the proposed algorithm can effectively improve the classification accuracy and efficiency. At last, using MapReduce model realized the processing and retrieval of taobao's data information.
引文
[1]《虚拟化与云计算》小组.虚拟化与云计算[M].电子工业出版社.2009:111-187
    [2]李军化,云计算及若干数据挖掘算法的MapReduce化研究[D].成都:电子科技大学.2010:1-4
    [3]陈金,邓倩妮.云计算与其关键技术[J].计算机应用,2009,29(9):2562-2566
    [4]P.Watson, P.Lord, F.Gibson,P.Periorellis,G..Pitsilis. Cloud Computing for e-Science with CARMEN[J]. IBERGRID,2008.05
    [5]P. Sambath,Narayanan. From Grid Computing to Cloud Computing-The IBM Approach. Garuda Partner Meet[R].Bangalore, India,2008.3
    [6]Smith, Roger. Computing in the cloud[J]. Research Technology Management,2009, 52(5):65-68
    [7]赵春艳.云环境下作业调度算法研究与实现[D].北京:北京交通大学2009
    [8]韩力群.人工神经网络教程[M],北京:北京邮电大学出版社,2007:185-201
    [9]李红梅.遗传算法概述[J].软件导刊.2009,8(1):67-68
    [10]李智勇.基于粒子群优化算法的Web挖掘技术的研究[J].中国科技财富.2011(8):178-178
    [11]杨淑媛.量子进化算法的研究及其应用[D].西安:西安电子科技大学.2003
    [12]陈修宽.Web数据挖掘综述[J].山东轻工业学院学报.2009,23(3):23-28
    [13]Abhishek Verma, David E. Goldberg, Roy H. Campbell. Scaling Genetic Algorithms using MapReduce[J]. IEEE,2009:13-18
    [14]江小平,李成华.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报(自然科学版),2011,39(I):120-124
    [15]Gengxin Miao, Yangqiu Song, Dong Zhang, Hongjie Bai, Parallel Spectral Clustering Algorithm for Large Scale Community Data Mining[J]. WWW2008, April 21-25,2008, Beijing, China
    [16]Narayanan A, Moore M. Quantum inspired genetic algorithms[J]. Proc of IEEE Int Conf on Evolutionary Computation. Nagoya,96:61-66
    [17]Han K H, Kim J H. Genetic quantum algorithm and its application to combinatorial optimization problem[J]. Proc of the Congress on Evolutionary Computation. LaJolla,2000:112-117
    [18]杨淑媛.量子进化算法[J].工程数学学报.2006,23(2):235-246
    [19]张凌超.基于“云计算”的数字图书馆建设模式初探[J].图书馆研究.2010(11):39-42
    [20]陈阿林.云计算应用直通车[M].重庆大学出版社.2009:7-14
    [21]高勋.基于云计算的Web结构挖掘算法研究[D].北京:北京交通大学.2010
    [22]Jeffrey D,Sanjay G. MapReduce:Simplified Data Processing on Large Clusters[J]. COMMUNICATIONS OF THE ACM.2008,51(1):107-113
    [23]陈涛.云计算理论及技术研究[J].重庆交通大学学报(社科版),2009,9(4):104-106.
    [24]邓自立.云计算中的网络拓扑设计和Hadoop平台研究[D].合肥:中国科学技术大学.2009
    [25]Tom White. Hadoop:The Definitive Guide[M]. O'Reilly Media, Inc,2009:29-30
    [26]Apache Hadoop Core.http://hadoop.apache.org/core.2009
    [27]Hadoop Site, http://hadoop.apache.org/
    [28]Hadoop 云计算计术介绍. http:www.chinacloud.cn/
    [29]HDFS.http://hadoop.apache.org/common/docs/current/hdfs_design.html
    [30]Papadimitriou S, Jimeng Sun. Large-scale Data Mining:MapReduce and Beyond Part 2:Algorithms.2009
    [31]吴宝贵,丁振国.基于MapReduce的分布式搜索引擎的研究[J].现代图书情报技术.2007(8):52-55
    [32]郑启龙,房明,汪胜,王向前.基于MapReduce模型的并行科学计算[J].微电子学与计算机.2009,26(8):13-17
    [33]谢桂兰,罗省贤.基于Hadoop MapReduce模型的应用研究[J].微型机与应用,2010,25(3):4-7
    [34]侯泽东.基于Hadoop的异构海洋数据处理模型研究[D].上海:上海海洋大学,2011
    [35]范胜辉.量子进化算法及其应用研究[D].南京:南京航空航天大学.2010
    [36]吴瑞祥.智能数据挖掘与知识发现[M].西安电子科技大学出版社.2004:73-88
    [37]钱洁.量子进化算法研究现状综述[J].控制与决策,2011,26(3):321-325
    [38]游晓明,帅典熏,刘升.并行量子进化算法的研究与实现[J].计算机应用与软件,2008,25(5):231-233
    [39]Andrew W. McNabb, Christopher K. Monson, Kevin D. Seppi. Parallel PSO Using MapReduce[J]. IEEE,2007:7-14
    [40]Chao Jin, Christian Vecchiola, Rajkumar Buyya. MRPGA:An Extension of MapReduce for Parallelizing Genetic Algorithms[J]. IEEE,2008:214-221
    [41]宋海洲,魏旭真.求解0-1背包问题的混合遗传算法[J].华侨大学学报(自然科学版).2006,27(1):16-19
    [42]覃朝勇,黄景文,郑建国,莫国莉.求解背包问题的混合量子进化算法[J].小型微型计算机系统.2011,32(2):305-309
    [43]朱玉全.数据挖掘技术[M].东南大学出版社.2006:102-129
    [44]贾瑞玉,李永顺,李景成,冯伦阔.佳点集遗传覆盖算法[J].计算机工程.2009,35(24):196-198
    [45]张玲张钹.遗传算法机制的研究[J].软件学报,2000,11(7):945-952
    [46]李永顺.软计算融合算法及其在Captcha识别方面的应用研究[D].合肥:安徽大学.2010
    [47]贾瑞玉,宁再早.粒子群优化覆盖算法[J].计算机工程.2011,37(21):167-169
    [48]张铃,张钹.M-P神经元模型的几何意义及其应用[J].软件学报.1998,9(5):334-338
    [49]张玲,张钹,殷海风.多层前向网络的交叉覆盖设计算法[J].软件学报.1999,10(7):737-742
    [50]张月琴,丁旭玲.一种改进的领域覆盖算法[J].计算机工程.2011,37(18):190-194
    [51]杨佳,曹长修,混合量子优化算法理论及应用研究[D].重庆:重庆大学博士论文.2009:63-71
    [52]杨靖韬,陈会果.对网络爬虫技术的研究[J].科技创业月刊,2010(10):170-170
    [53]罗刚,王振东.自己动手写网络爬虫[M].北京:清华大学出版社,2010:29-30
    [54]嵌入式数据库系统Berkeley DB[DB/OL]. http://www.ibm.com/developerworks/cn/linux/1-embdb/. Last accessed by 2011-06-15
    [55]MD5算法[DB/OL] http://baike.baidu.com/view/706946.htm. last accessed by 2011-06-15

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700