基于Map/Reduce框架的分布式日志分析系统的研究及应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本论文研究了云计算集群中基于Hadoop框架的分布式日志分析系统,利用Map Reduce计算模型进行分布式计算,并结合HDFS进行分布式存储,利用分而治之的策略去分析企业云计算平台中产生海量数据,监控云计算集群中服务器的运行状态,并从海量数据中挖掘有价值的资源。
     本系统首先在被监控的集群上使用RandomAccess类收集包括各个节点上的系统日志、线程池中产生的日志等数据。然后,使用SSH工具将这些收集的日志数据传输到负责分析数据的集群中去,在这个过程中,使用一些聚类的方法将数据重新组织。之后,在负责分析数据的集群上,我们在Hadoop平台上,分别实现了MapReduce框架中的map模块和reduce模块,以此达到分布式分析日志的功能,并根据用户自定义的配置,实现定制化的日志分析。最后,导入生成的分析报告到Excel VBA,将图形化的分析报告呈现给用户。
     另外,我们将该分布式日志分析系统应用在Hadoop基准测试中,进行了大量的实验和数据分析,描述了在不同底层IO软件(主要包括IO调度器和文件系统)的选择上IO密集型基准测试的性能比较,方便Hadoop用户选择这些底层软件。并通过调节这些底层软件层的参数,对Hadoop上运行的MapReduce程序提供了优化方案。我们比较了TeraSort基准测试在几种不同IO调度算法和几大代表性的文件系统上的测试性能,并使用之前研究的分布式日志分析工具对其进行数据的收集与分析。
     之后,我们通过对最终结果进行优化,比如改进IO调度算法和调节文件系统参数,进一步优化了Hadoop IO密集型基准测试的性能。
In this paper, a MapReduce-Based Framework is implemented to analyze the distributed log generated in cloud computing. The framework is built on top of Hadoop, an open source distributed file system and MapReduce implementation.
     We first make use of Random Access File to realize an incremental way for aggregating system logs from each node of the monitored cluster, and collect them to the analysis cluster. Then, we integrate the collected logs. After that, we implement a MapReduce-Based algorithm to parser these clustered log files. Furthermore, in order to make the best use of this collected data, a flexible and powerful way is utilized to display monitoring and analysis results.
     Besides, we quantitatively evaluate and characterize the Hadoop framework through I/O extensive benchmarking, so as to optimize the performance and understand the tradeoffs of system designs for the MapReduce-based data analysis using Hadoop.
     First, we characterize and evaluate workload performance of I/O intensive benchmarking with different underlying software choices, both on I/O schedulers and native filesystems.
     Then, we provide some potential enhanced solutions to optimize performance of Hadoop benchmarking, and conclude our experiments in the end.
引文
[1] M.Tim Jones. Linux下的云计算[EB/OL]. http://www.ibm.com/developerworks/linux/library/l-cloud-computing/index.html?S_TACT=105AGX52&S_CMP=content.
    [2]高勋.基于云计算的Web结构挖掘算法研究[D]:[硕士学位论文].北京:北京交通大学,2010.
    [3]郭亨亨.海量RDF数据的分布式存储研究[D]:[硕士学位论文].西安:西安建筑科技大学,2010.
    [4]刘义军.基于云计算平台的个人信息融合系统的研究与实现[D]:[硕士学位论文].北京:北京邮电大学,2010.
    [5]百度百科.摩尔定律[EB/OL]. http://baike.baidu.com/view/17904.htm.
    [6] Yi Ming Huang,Zhao Hui Nie.用Linux和Apache Hadoop进行云计算[EB/OL]. http://www.ibm.com/developerworks/cn/aix/library/au-cloud_apache/index.html.
    [7] Jeffrey Dean,Sanjay Ghemawat. Map Reduce: Simplified Data Processing on Large Clusters [EB/OL]. Google Inc.
    [8]邓自立.云计算中的网络拓扑设计和Hadoop平台研究[D]:[硕士学位论文].安徽:中国科学技术大学,2009.
    [9] Hadoop主页[EB/OL]. Hadoop. http://hadoop.apache.org/.
    [10]夏祎. Hadoop平台下的作业调度算法研究与改进[D]:[硕士学位论文].广州:华南理工大学,2010.
    [11] Sysstat [EB/OL]. http://sebastien.godard.pagesperso-orange.fr/.
    [12] Wikipedia. Iostat [EB/OL]. http://en.wikipedia.org/wiki/Iostat.
    [13] Wikipedia. Vmstat [EB/OL]. http://en.wikipedia.org/wiki/Vmstat.
    [14] Wikipedia. Netstat [EB/OL]. http://en.wikipedia.org/wiki/Netstat.
    [15][21] ThreadDump[EB/OL]. http://lzmhehe.javaeye.com/blog/335526.
    [16][22] ThreadDump[EB/OL]. http://www.linuxsir.org/main/?q=node/211.
    [17][23]付文娟. InstLink系统的安全技术研究与实现[D]:[硕士学位论文].西安:西安电子科技大学,2009.
    [18] RandomAccessFile [EB/OL]. http://cuijiemin.javaeye.com/blog/902377.
    [19] Wikipedia. SSH [EB/OL]. http://zh.wikipedia.org/zh/SSH.
    [20] Wikipedia. ExcelVBA [EB/OL]. http://wiki.services.openoffice.org/wiki/VBA.
    [24]百度百科. SSH [EB/OL]. http://baike.baidu.com/view/16184.htm.
    [25]吴鹏冲.非默认端口网络协议识别系统的研究与实现[D]:[硕士学位论文].北京:北京邮电大学,2009.
    [26] Wikipedia. SSH [EB/OL]. http://zh.wikipedia.org/zh/%E5%82%B3%E8%BC%B8%E5%B1%A4%E5%8D%94%E8%AD%B0.
    [27] Hadoop基准测试工具使用[EB/OL]. http://blog.csdn.net/dajuezhao/archive/2011/01/07/6122033.aspx.
    [28] TeraGen program Available in Hadoop source distribution since 0.19 version [EB/OL].src/examples/org/apache/hadoop/examples/terasort/TeraGen.
    [29] Wikipedia. Nehalem [EB/OL]. http://en.wikipedia.org/wiki/Nehalem_(microarchitecture).
    [30]张江,吴庆波. Linux日志文件系统及性能分析[EB/OL]. http://www.ibm.com/developerworks/cn/linux/l-jfs/.
    [31] Steven.块设备层分析[EB/OL]. http://blogold.chinaunix.net/u2/74194/showart_1089929.html.
    [32] Wikipedia. I/O scheduler [EB/OL]. http://en.wikipedia.org/wiki/I/O_scheduler.
    [33] A. L. N. Reddy, J. Wyllie. Disk scheduling in a multimedia I/O system [C]. In MULTIMEDIA’93: Proceedings of the first ACM international conference on Multimedia, 225–233, New York, NY,USA, 1993.
    [34] S. C. John, J. A. Stankovic, J. F. Kurose, D. Towsley. Performance evaluation of two new disk scheduling algorithms for real-time systems [J]. Journal of Real-Time Systems, 1991, 3:307–336.
    [35] Wikipedia. Anticipatory Scheduler[EB/OL]. http://fr.wikipedia.org/wiki/Anticipatory_scheduling.
    [36] Wikipedia. Completely_Fair_Queuing Scheduler[EB/OL]. http://fr.wikipedia.org/wiki/Completely_Fair_Queuing.
    [37] Wikipedia. Deadline Scheduler[EB/OL]. http://fr.wikipedia.org/wiki/Deadline_scheduler.
    [38] Wikipedia. Noop Scheduler[EB/OL]. http://fr.wikipedia.org/wiki/Noop_scheduler.
    [39] M.Tim Jones. Linux文件系统剖析[EB/OL]. http://www.ibm.com/developerworks/cn/linux/l-linux-filesystem/.
    [40]百度百科.文件系统[EB/OL]. http://baike.baidu.com/view/266589.htm.
    [41]赵蔚. Ext2文件系统的硬盘布局[EB/OL]. http://www.ibm.com/developerworks/cn/linux/filesystem/ext2/.
    [42] Wikipedia. EXT3[EB/OL]. http://zh.wikipedia.org/zh/Ext3.
    [43] M.Tim Johns. Linux日志文件系统解析[EB/OL]. http://www.ibm.com/developerworks/cn/linux/l-journaling-filesystems/.
    [44] Roderick W. Smith.迁移到EXT4[EB/OL]. http://www.ibm.com/developerworks/cn/linux/l-ext4/index.html.
    [45]百度百科. XFS[EB/OL]. http://baike.baidu.com/view/1222157.htm.
    [46] Wikipedia. BTRFS[EB/OL]. http://en.wikipedia.org/wiki/Btrfs.
    [47] Wikipedia. IOMeter[EB/OL]. http://en.wikipedia.org/wiki/Iometer.
    [48]记测试工具iozone,iometer,bonnie++. http://blog.csdn.net/chinalinuxzend/archive/2008/09/04/2878444.aspx.
    [49] XFS优化[EB/OL]. http://saplingidea.javaeye.com/blog/636770.
    [50] Wikipedia. Atime[EB/OL]. http://en.wikipedia.org/wiki/Atime.
    [51] Wikipedia. INode[EB/OL]. http://en.wikipedia.org/wiki/Inode.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700