基于映射/规约的网页聚类算法研究

英文题名：Research of Web Clustering Based on MapReduce
作者：于治海
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：聚类 ; 并行 ; 映射/规约 ; 网页
英文关键词：Cluster ; Parallel ; Map/Reduce ; Web
学位年度：2011
导师：国林
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2010-12-10

摘要

随着网络应用的普及化,网络信息量飞速的增长。因此,人们如何在海量的数据中获取有用的知识变得越来越重要。通过长时间的研究与探索,人们提出了数据挖掘技术,该技术是一门多专业交叉、综合的学科,使用该技术可以有效的将用户所需的知识提取出来,聚类分析是数据挖掘领域中的重要的内容和基本工具之一。
     数据量呈级数的增长和应用开发的复杂性严重阻碍了多核处理器和多处理器系统发展,进而导致数据不能有效的利用。经典的处理方法是开发一个具有信息传递接口(MPI)的分布式系统,由于该接口在并行应用中只能提供细粒度控制,因此,经典方法的抽象性和复杂性超出了现有的计算能力。与传统的分布式系统相比,映射/规约框架提供了一种比MPI更高级的抽象概念,可以被应用于许多数据密集型的批量处理任务中并且该框架的抽象性和复杂性在现有的计算能力范围内能够被处理。本文在并行计算与映射/规约编程框架研究分析的基础上,对映射/规约框架进行了理论上改进,使改进后框架的计算处理性能提高。在改进框架的基础上,实现了一种基于映射/规约的MRK-Means算法,该算法采用迭代操作的计算,能够实现多次执行映射/规约操作,同时将该算法与网页的海量、动态、更新快等属性特征相结合,提出一种具有属性特征的在线OMRK-Means算法,该算法能够提高在线聚类方法的伸缩性和聚类精确度,并且缩短了聚类操作时间,有效的处理增量式数据。
     通过实验表明,基于映射/规约框架的MRK-Means算法在保证执行效果的基础上,与传统K-Means算法相比,有效地提高了聚类的速度。通过对OMRK-Means算法的收敛性和执行时间、精确度和伸缩性进行试验分析,表明本文提出的在线OMRK-Means算法在数据并行增量的情况下,能加快大型数据集的交互分析,提高聚类处理的精确度,并且有利于可伸缩的网页挖掘。
With the popularity of network applications and network information is growing rapidly, in the flood of data to obtain useful knowledge becomes more and more important for people. Through a long period of research and exploration, data mining techniques have been proposed, which is a multi-disciplinary cross, comprehensive discipline, this technology can extract the desired knowledge for users effectively. Clustering analysis is one of the most important part and the basic tools in data mining.
     The series growth of data and complexity of application development impede the development of multi-core processors and multi-processor system seriously, thus can not effectively use the data. The classic approach is to develop a distributed system with the message passing interface (MPI), which only provides fine-grained control by implementing parallel applications. Therefore, the abstraction and complexity of this method are out of the existing computing ability. Map/Reduce programming framework provides a higher abstraction than MPI, can be used in many data-intensive batch processing tasks, of which the abstraction and complexity are not higher than the present computing ability. Based on the study about distributed computing and Map/Reduce programming framework, this framework is improved, and computing ability of the improved framework is analyzed in theory. MRK-Means by iterating calculation, which performs multiple Map/Reduce operation, meanwhile, this improved Map/Reduce programming framework combines attribute property of web, such as massive, dynamic, fast updates, to explore OMRK-Means with attribute property based on Map/Reduce programming framework, which aim for increasing the scalability of online clustering method, reducing time of clustering, improving clustering accuracy.
     In ensuring the implementation, the experiment shows that OMRK-Means is faster than the traditional clustering algorithm on clustering, such as convergence and time analysis, precision analysis and scalability analysis. It indicates that the proposed method can speed up the interactive analysis of large data sets, improve clustering accuracy and be good for scalable web mining in the case of parallel incremental data.

引文

[1]Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhafl, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS'99), Atlanta, Georgia,1999,5:10-22P
    [2]L. Augusteijn. Sorting morphisms. In S. Swierstra,E Henriques, and J. Oliveira, editors, 3rd International Summer School on Advanced Functional Programming, SpringerVerl-ag,1998,1:1-7P
    [3]孟海东,王淑玲,郝永宽.动态增量聚类的设计与实现.计算机工程与应用2009,45：130-133页
    [4]朱明编著.数据挖掘.第二版.中国科学技术大学出版社,2008：25-42页
    [5]钱彦江.大规模数据聚类技术研究与实现.2009：9-15页
    [6]Douglas Thain, Todd Tannenbaum, and Miron Livny. Distributed computing in pr-actice:The Condor experience. Concurrency and Computation:Practice and Experience, 2004
    [7]L. G Valiant. A bridging model for parallel computation. Communications of the ACM, 1997,33(8):103-111P
    [8]吴晓伟.MapReduce并行编程模式的应用与研究.中国科学技术大学硕士学位论文.2008：29-31页
    [9]Dean, J., and Ghemawat, S. MapReduce:simplified data processing on large clusters. ACM2008,2008,51:107-113P
    [10]Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. Mapreduce online. Tech. Rep. UCB/EECS-2009-136, EECS Department, University of California, Berkeley,2009,10
    [11]Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad:distributed data-parallel programs from sequential building blocks. In EuroSys (2007), ACM 2007, 2007,5:59-72P
    [12]Chen, S., and Schlosser, S. W. Map-reduce meets wider varieties of applications. Tech. Rep. IRP-TR-08-05, Intel Labs Pittsburgh Tech Report,2008,5
    [13]Olston, C., Bortnikov, E., Elmeleegy, K., Junqueira, F., and Reed, B. Interactive analysis of web-scale data. In CIDR 2009,10
    [14]Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., and Welton, C. MAD skills:New analysis practices for big data.2009,2:1481-1492P
    [15]Qiming Chen, Andy Therber, Meichun Hsu, Hans Zeller, Bin Zhang, Ren Wu. Efficiently Support MapReduce-like Computation Models Inside Parallel DBMS. IDEAS 2009, September 16-18, Cetraro, Calabria, ACM 2009.
    [16]Hellerstein, J. M., Haas, P. J., and Wang, H. J. Online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data.1997,5(26):171-182P
    [17]Hellerstein, Avnur, Chou, Hidber, Olston, Raman, Roth, and Haas. Interactive data analysis:The control project. COMPUTER:IEEE Computer 321999.
    [18]Cladio Carpineto, Stanisiaw Osinski, Giovanni Romano, Dawid Weiss. A Survey of Web Clustering Engines. ACM Computing Surveys,2009,6:17-17:38P
    [19]M.Ester, H.-P. Krieel, J. Sander, et al. incremental Clustering for Mining in a Data Warehousing Environment [C]. In Proceedings of the 24th International Conference on Very Large Data Bases, New York, Morgan Kaufmann Publishers Inc.,1998:323-333
    [20]陈峰.基于聚类的增量数据挖掘研究.大连海事大学硕士学位论文.2007.
    [21]S. Asharaf, M. Narasimha Murty S.K. Shevade Rough set based incremental clustering of interval data. Pattern Recognition Letters 2006,4:515-519P.
    [22]C.-C.Hsu, and Y.-P. Huang. Incremental clustering of mixed data based on distance hierarchy.2008,10(28):1177-1185P.
    [23]Domingos, P., and Hulten, G. Mining high-speed data streams. In Knowledge Discovery and Data Mining.2000,3:71-80P
    [24]Domeniconi, C., and Gunopulos, D. Incremental support vector machine construction. In ICDM'01,2001:589-592P
    [25]Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., and Xu, X. Incremental clustering for mining in a data warehousing environment. In VLDB'98,1998:323-333P
    [26]Edward Y. Chang, Hongjie Bai, Kaihua Zhu. Parallel Algorithms for Mining Large-scale Rich-media Data. ACM 2009,2009:917-918P
    [27]Graf, H. P., Cosatto, E., Bottou, L., Durdanovic, I., and Vapnik, V. Parallel support vector machines.2004
    [28]Dhillon, I. S., and Modha, D. S. A data-clustering algorithm on distributed memory multiprocessors. In Workshop on Large-Scale Parallel KDD Systems.2000:245-260P
    [29]Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. Map-reduce for machine learning on multicore.2006:281-288P
    [30]Alvaro Pereira, Ricardo Baeza-Yates. A Model for Fast Web Mining Prototyping. ACM2009,2009:114-123P
    [31]Boutsinas B, Gnardellis. On Distributing the Clustering Process. Pattern Recognition, 2002,23:999-1008P.
    [32]陆建江,徐宝文.区间数据的并行模糊聚类算法,东南大学学报(自然科学版),2003,33(4)：406-409页
    [33]OGC(Open GIS Consortium).Open GISguide:Introduction to interoperable geoprocessing
    [34]张锡琴.多数据流的增量聚类实现与应用.计算机工程,2009-7：49-51页
    [35]Arasu, A., Babu, S., and Widom, J. The CQL continuous query language:semantic foundations and query execution.2006:121-142P
    [36]Aaron Kimball, Sierra Michels-Slettvet, Christophe Bisciglia. Cluster Computing for Web-Scale Data Processing. SIGCSE'08, March 12-15,2008, Portland, Oregon, USA. ACM2008,2008:116-120P
    [37]周兵,沈均毅,彭勤科.集群环境下的并行聚类算法.计算机工程,2004,30(4)：4-7页
    [38]Lamehamedi H, Bensaid A D, Kebbal E G, et al. Adaptive programming:application to a semi-supervised point prototype clustering algorithm. In:International Conference on Parallel and Distributed Processing Techiques. Nevada, USA,1999:2753-2759P
    [39]吉根林,凌霄汉,杨明.一种基于集成学习的分布式聚类算法.东南大学学报2007,7(37)：585-588页
    [40]陆建江,徐宝文.区间数据的并行模糊聚类算法,东南大学学报,2003,33(4)：406-409页
    [41]郑晓曦,宋浩远,王引辉.大型数据库中聚类方法的分析.计算机与现代化.2008(4)：27-35页
    [42]Vignesh T.Ravi, Gagan Agrawal. Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core Cluster. ACM 2009,2009:308-315P
    [43]http://hadoop.apache.org/
    [44]Jeffrey dean, Sanjay Ghemawat. MapReduce:A Flexible Data Processing Tool. ACM 2010,5(53):72-77P
    [45]周锋,李旭伟.一种改进的MamReduce并行编程模型.计算技术与信息发展.2009,2：65-66页
    [46]孙广中,肖锋,熊曦MapReduce模型的调度及容错机制研究.微电子学与计算机2007,9(24)：178-181页
    [47]Bingsheng He, Wenbin Fang, Qiong Luo. Mars:A MapReduce Framework on Graphics Processors. ACM2008,2008,8:260-269P
    [48]万至臻.基于MapReduce模型的并行计算平台的设计与实现.浙江大学硕士学位论文.2008：16-20页
    [49]Thomas Sandholm, Kevin Lai. MapReduce Optimization Using Regulated Dynamic Prioritization. ACM2009,2009:299-310P
    [50]Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber. Bigtable:A Distributed Storage System for Structured Data.2007:205-218P
    [51]Jaideep Dhok, Nitesh Maheshwari, Vasudeva Varma. Learning Based Opportunistic Admission Control Algorithm for MapReduce as a Service. ACM 2010,2010:153-160P
    [52]Jimmy Lin, Chris Dyer. Data Intensive Text Processing with MapReduce. Proceedings of NAACL HLT 2009,2009,7:1-2P
    [53]Jeffrey, sanjay. Mapreduce:Simplified Data Processing on Large Clusters. ACM 2008, 2008,3(51):107-113P

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700