用户名: 密码: 验证码:
基于Hadoop平台的自适应局部超平面K近邻算法的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
自适应局部超平面K近邻分类算法(ALH)是一种对KNN算法的改进,在很多数据集上有着比其他KNN算法有着更出色的表现。但我们发现ALH算法有着极大的运算量,耗时很长。对每一个待分类的测试样本,需要计算其到每一个类别K近邻所构成的局部超平面的距离,有着矩阵的加减相乘求逆等复杂运算。在多次的实验中发现传统的ALH算法只能适用于小规模的数据,当数据的维度增大或样本数量增加时,传统ALH算法会因计算量的猛增而明显变慢,这也是许多传统数据挖掘算法的瓶颈。
     在这种密集的运算之下,我们发现ALH算法在对某些数据(如高维度且样本数众多的数据)计算上述的超平面距离时,不得不面对这样的一种非常耗时的运算:一个有着几万个元素甚至更大的矩阵与另一个矩阵相加相减相乘,或对一个不小的方阵求逆。在本论文中,我们将讨论并分析ALH算法各环节运行时间所占的比例,并且得出了求k个近邻与计算超平面距离是算法主要耗时的结论,并且在实验中发现了随着k值的增加计算超平面距离所使用的时间增加得更为明显。
     近年来随着云计算概念的兴起,许多海量数据与海量运算逐渐迁移到云平台上,这是一个明显的趋势。其中Hadoop作为对云计算最为出色的开源实现,被应用于许多领域,包括数据挖掘与机器学习。Hadoop平台非常擅长处理大规模的数据与大规模运算,当我们对算法进行合理的设计并实现到Hadoop上时,就能借助Hadoop集群的计算能力很好地解决上述问题。本论文在对ALH算法运行的各环节时间比例作出分析后,提出了对ALH算法的关键步骤并行化改进的设计。然后在对Hadoop平台的研究基础上,利用Hadoop对ALH算法进行MapReduce化,使ALH算法能并行地运行在Hadoop平台之上。在有效地利用了多台计算机的运算能力后,在实验中我们可以看出基于Hadoop的ALH算法,能有效地调度每台计算机的计算资源,大幅地减少高维数据分类的运行时间。
Adaptive local hyperplane K Nearest Neighbors Classification Algorithm (ALH) is an improvement on the KNN algorithm, which has a more excellent performance in many data sets. However, we found that ALH algorithm is large-scale and complex in computing. For each testing instance, ALH has to calculate the distance between the testing instance and the local hyperplane constrcuted by its K nearest neighbor of each class, with some complex matrix calculation such as matrix multiplication, matrix inverse and so on. In a number of experiments we find that the traditional ALH algorithm can only apply to small-scale data. When the data dimension increases or the number of samples increases, the traditional ALH algorithm will obviously slow down because of the rapid increasement of calculation, which is the bottlenecks of many traditional data mining algorithms.
     Under the circumstance of the massive calculation, we do find that when we use ALH to calculate such hyperplane distance on some data sets (such as the high-dimensional data set with large number of samples), we have to finish some very time-consuming matrix operations:the sum, multiplication, or subtraction of two matrixes with tens of thousands of elements, or the inversion of a large matrix. In this paper, we will discuss the ALH algorithm and analyze the proportion of running time in some important steps of ALH. Then we find that the most time-consuming steps are as follow:finding the k neighbors of one testing instance in every class, and calculating the distance between the testing instance and the local hyperplane constrcuted by its K nearest neighbor of each class. In addition, in our experiments we find that with the increasement of the value of k, the running time of calculating the local hyperplane distance will increase more obviously.
     Recent years, with the rise of the concept of cloud computing, massive data processing and massive calculation are gradually migrating to cloud computing platform, which is a clear trend. Hadoop, as the most outstanding open source implementation of cloud computing, is used in many areas, including data mining and machine learning. The Hadoop platform is good at handling large-scale data and large-scale computing. For the above reasons, we try to make ALH run on Hadoop platform. When we correctly redesign the algorithm and make it run on Hadoop, we can use the power of Hadoop clusters and solve these problems. In this paper, the proportion of some import steps' running time is analyzed, and then we redesign and parallelize some important steps of ALH algorithm. Based on the research of the Hadoop platform, we propose the corresponding MapReduced ALH algorithm, enabling the ALH algorithm running on Hadoop. With the effective use of the computing power of multiple computers, in the experiments, we can see that the ALH algorithm based Hadoop can effectively schedule computing resources of each computer and dramatically reduce the running time on some high dimensional data sets.
引文
[1]Cover T., Hart P.. Nearest Neighbor Pattern Classification[J]. IEEE Trans Inormation Theory.1967(13):21-27
    [2]Pascal Vincent, Yoshua Bengio. K-Local Hyperplane and Covex Distance Nearest Neighbor Algorithms [J]. Advances in Neural Information Processing Systems, 2002(14):985-992
    [3]Tao Yang, Vojislav Kecman. Adaptive local hyerplane classification [J]. Neurocomputing, 2008(71):3001-3004
    [4]Jeffrey Dean, Sanjay Ghemawat. MapReduce:Simplified Data Processing on Large Clusters[R]. OSDI'04:Sixth Symposium on Operating System Design and Implementation. San Francisco, California, USA:Google,2004
    [5]Wikepedia. Cloud Computing [EB/OL]. http://en.wikipedia.org/wiki/Cloud_computing, 2009
    [6]Apache Hadoop. Apache Software Foundation [EB/OL]. http://projects.apache.org/ project/hadoop.html,2011
    [7]Fayed H A, Atiya A F. A Novel Template Reduction Approach for the K-Nearest Neighbor Method [J]. IEEE Transactions on Neural Networks,2009,20(5):890-896
    [8]P E Hart. The Condensed Nearest Neighbor Rule[J]. IEEE Trans. Information Theory, 1968,14(3):515-516
    [9]Wilson D.. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data [J]. IEEE Transactions on Systems, Man and Cybernetics,1972(SMC-2):408-421
    [10]P Devijver, J Kittler. Pattern Recognition:A Statistical Approach[M]. Prentice Hall, 1982
    [11]Chattratichat J, Darlington J, Guo Y, Hedvall S, Koler M, Syed J. An architecture for distributed enterprise data mining. HPCN Europe 1999, Lecture Notes in Computer Science,1999(1593):573-582
    [12]Stolfo SJ, Prodromids AL, Tselepis S, Lee W, Fan DW, Chan PK. JAM:Java agents for meta-learning over distributed databases[J]. International KDD'97 Conference,1997: 74-81
    [13]M Cannataro, D Talia, P TRunfio. KNOWLEDGE GRID:High Performance Knowledge Discovery on the Grid[J]. Lecture Notes In Computer Science, Proceedings of the Second International Workshop on Grid Computing,2001 (2242):38-50
    [14]Foster I, Kesselman C. Globus:a metacomputing infrastucture toolkit[J]. International Journal of Supercomputing Applications,1997(11):115-128
    [15]Foster I, Kesselman C. The Grid:Blueprint for a Future Computing Infrastructure[M]. Morgan Kaufmann Publishers,1999:105-129
    [16]Barroso LA, Dean J, Holzle U. Web search for a planet:The Google cluster architecture[J]. IEEE Micro,2003,23(2):22-28
    [17]Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine[J]. Computer Networks,1998,30(1-7):107-117
    [18]Burrows M. The chubby lock service for loosely-coupled distributed systems[A]. Proc.of the 7th USENIX Symp.on Operating Systems Design and Implementation[C]. Berkeley: USENIX Association,2006:335-350
    [19]IBM. IBM Blue Clound Project[EB/OL]. http://wwwO3.ibm.com/press/us/en/ pressrelease/22613.wss,2010
    [20]Boss G, Malladi P, Quan D, et al. Cloud computing.IBM White paper [EB/OL]. http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf,2007
    [21]Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization[A]. In:Proc.of the 9th ACM Symp.on Operating Systems Principles[C]. New York:Bolton Landing, 2003:164-177
    [22]XenSource Company. Citrix systems, citrix XenServer:Efficient virtual server software[EB/OL]. http://www.xensource.com,2010
    [23]IBM. IBM virtualization[EB/OL]. http://www.ibm.com/virtualization,2009
    [24]Amazon. Amazon elastic compute cloud(Amazon EC2)[EB/OL]. http://aws.amazon. com/ec2,2009
    [25]Arun C Murthy. The Next Generation of Apache Hadoop MapReduce [EB/OL]. http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/, 2011
    [26]陈康,郑纬民.云计算:系统实例与研究现状[J]. Journal of Software.2009(120): 1337-1348
    [27]曾理.Hadoop的重复数据清理模型研究与实现[D].南华大学,2009
    [28]夏祎.Hadoop平台下的作业调度算法研究与改进[D].华南理工大学,2010
    [29]程锦佳.基于Hadoop的分布式爬虫及其实现[D].北京邮电大学,2010
    [30]朱珠.基于Hadoop的海量数据处理模型研究和应用[D].北京邮电大学,2008
    [31]Apache Mahout. Apache Software Foundation[EB/OL]. http://mahout.apache.org/,2011
    [32]纪俊.一种基于云计算的数据挖掘平台架构设计与实现[D].青岛大学,2009
    [33]李军华.云计算及若干数据挖掘算法的MapReduce化研究[D].电子科技大学,2010
    [34]Apache Hadoop. Apache HDFS Archietecture[EB/OL]. http://hadoop.apache.org /common/docs/current/hdfs_design.html,2010
    [35]Apache Hadoop. Apache MapReduce Archietecture[EB/OL]. http://hadoop.apache.org /common/docs/current/mapreduce_design.html,2010
    [36]Danziger S.A., Baronio R., Ho L., Hall L., Salmon K., Hatfield G.W., Kaiser P., Lathrop R.H.. Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning [J]. PLOS Computational Biology,2009,5(9): el000498-el000498
    [37]Danziger S.A., Zeng J., Wang Y., Brachmann R.K., Salmon K., Lathrop R.H.. Choosing where to look next in a mutation sequence space:Active Learning of informative p53 cancer rescue mutants [J]. Bioinformatics,2007,23(13):104-114
    [38]Danziger S.A., Swamidass S.J., Zeng J., Dearth L.R., Lu Q., Chen J.H., Cheng J., Hoang V.P., Saigo H., Luo R., Baldi P., Brachmann R.K., Lathrop R.H.. Functional census of mutation sequence spaces:the example of p53 cancer rescue mutants[J]. IEEE/ACM transactions on computational biology and bioinformatics, 2006(3):114-125
    [39]李荣陆.复旦大学中文文本分类语料[EB/OL]. http://www.nlp.org.cn/docs/ download.php?doc_id=281,2011
    [40]谭松波,王月粉.中文文本分类语料库-TanCorpV 1.0[EB/OL]. http://www.searchforum .org.cn/tansongbo/corpus.htm,2010
    [41]Songbo Tan,et al. A Novel Refinement Approach for Text Categorization [J]. In CIKM'05: Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany,2007

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700