基于HDFS的多用户并行文件IO的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机网络及其应用的快速发展,特别是Google提出基于Internet的海量数据存储和Map-reduce并行计算思想以来,网络化的数据存储管理和并行分析处理成为学术界和产业界研究的焦点,其中Hadoop作为该思想的参考实现之一,受到了广泛的关注。
     Hadoop的核心HDFS分布式文件系统采用锁机制控制文件并行IO,不支持多用户对同一文件的读、写并行,限制了多用户并行文件操作的性能,为此,本文针对海量日志类型数据的特点,提出了一种非基于锁机制的并行文件IO模型,并通过实验,验证了本模型的有效性。
     本文主要工作包括:
     (1)对Hadoop的相关工作进行了深入的分析,特别在深入分析其分布式文件系统HDFS的基础上,针对HDFS不支持多用户文件并行读写的不足,提出了使其支持多用户并行文件读写的改进思想。
     (2)通过分析HDFS的并发控制模型,针对海量日志类数据特点,提出了一种不使用互斥机制的分布式文件系统的多用户并行IO模型,基于该模型,在适当降低数据读取完整性的条件下,可以实现对于同一个文件的多用户读写并行、读读并行。
     (3)通过对原有HDFS实现的改进,设计实现了一个支持多用户并行IO的分布式文件系统。实验表明,本改进有效提高了多用户并行文件IO的性能。
With the rapid development of computer networks and its applications, especially since Google proposed Internet-based mass data storage and Map-reduce parallel computing ideas, data storage management based on network and parallel analysis and processing has become the focus of academia and industry. As one of the reference implementation of the idea, Hadoop has been widespread concern.
     In order to control file parallel IO, the core of Hadoop—Hadoop Distributed File System(HDFS) use lock mechanism, but does not support multiple users read and write in parallel on the same file. So, this paper proposes a parallel file IO model based on Block granularity, and finally experiments to verify the availability of this model.
     In this paper, the main works are:
     (1) Related work on Hadoop was deeply analyzed, particularly on Hadoop distributed file system (HDFS), because of the deficiency of Hadoop on multi-user file parallel IO, improvement ideas was taken out in this paper.
     (2) By analyzing the implementation of Hadoop, A multi-user parallel IO model without mutual exclusion mechanism was proposed for distributed file system, based on the model, under the right condition of reducing the integrity of the data reading, multi-user reading and writing in parallel on the same file was realized.
     (3) By modifying the source code, we implement the function described in the model designed, and then carry out experiments to verify the function and performance of the model.
引文
[1] John F Gantz, Christopher Chute, Alex Manfrediz. The Diverse and Exploding Digital Universe. http://www.emc.com/collateral/analyst-reports/diverse-explo ding-digital-universe.pdf
    [2]刘国燊.数据库技术基础及应用.电子工业出版社.2003.
    [3]论数据库技术的发展史. http://news.ccidnet.com/art/1032/20040706/128300_1.html
    [4] Message Passing Interface. http://en.wikipedia.org/wiki/Message_Passing_Interface.
    [5] J. Dean and S. Ghemawat. MapReduce: Simpli_ed data processing on large clusters. In Proc. OSDI, 2004.
    [6] Michael Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, 2007
    [7]郭斯杰,贾鸿飞,熊劲.互联网海量数据存储和处理技术综述.信息技术快报.Vol.7. No.5. Sep.2009.
    [8]云计算对企业的创新和示范案例.赛科云港信息技术有限公司.第二届中国云计算大会.2010.
    [9]谷歌趋势.http:// trends.google.com
    [10]刘鹏.云计算.电子工业出版社. 2010.
    [11]云计算:数据及其安全面临挑战. http://www.gocom.cc/news/comprehensive/industry_information/2010-10-11/2643.html.
    [12]云计算.维基百科. http://zh.wikipedia.org/zh-cn/%E9%9B%B2%E7%AB%AF%E9%81%8B%E7%AE%97.
    [13] Tom White著.曾大聃,周傲英译.Hadoop权威指南.清华大学出版社. 2010.
    [14]什么是DFS(分布式文件系统). http://www.5dmail.net/html/2006-8-29/2006829124847.htm
    [15]黄华,杨德志,张建刚.分布式文件系统介绍. http://trac.nchc.org.tw/grid/raw-attachment/wiki/ DFS_Cencept/分散式文件系統.pdf
    [16] R. Sandberg. Sun NetWork File System Past, Present and Future A Distributed File System for 2006 March 6, 1996.
    [17] S. Shepler, B. Callaghan. RFC 3530: NetWork File System(NFS) version 4Protocol. The Internet Society, 2003
    [18] M. Satyanarayanan, J. H. Howard, D. N. Nichols, R. N. Sidebotham, A.Z. Sepector and M. J. West, The ITC Distributed File System: Principles and Design, Proceedings of the 10th Symposium on Operating System Principles(SOSP), Orcas Island, Washington, U.S., ACM Press, December 1985.
    [19] Howard, J.H. An Overview of the Andrew File System, Proceedings of the USENIX Winter technical Conference Feb. 1988, Dallas, TX
    [20] Michael N. Nelson, Brent B. Welch, and John K. Ousterhout. Caching in the Sprite Network File System. ACM Transactions on Computer Systems, 6(1), February 1988.
    [21] Randolph Y. Wang and thomas E. Anderson. xFS: A Wide Area Mass Storage File System. In Proceedings of the Fourth Workshop on Workstation Operation Systems, pages 71--78, October 1993.
    [22] Roger Haskin and Frank Schmuck, The Tiger Shark File System, Proceedings of IEEE 1996 Spring COMPCON, Santa Clara, CA, Feb, 1996.
    [23] Chandramohan A. Thekkath, TimothyMann, Edward K. Lee, Frangipani: A Scalable Distributed File System, Symposium on Operating Systems Principles(SOSP), 1997
    [24] Kenneth W. Preslan, Andrew P. Barry, etc. A 64-bit, Shared Disk File System for Linux, Storage Conference, 1999.
    [25] Frank Schmuck and Roger Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Proceedings of the Conference on File and Storage Technologies (FAST’02). 28–30 January 2002, Monterey, CA, pp. 231–244.
    [26] Charlotte Brooks, Ravi Khattar, Satoshi Suzuki, Mats Wahlstrom, IBM TotalStorage: Introducing the SAN File System, IBM International Technical Support Organization, November 2003.
    [27] http:// www.lustre.org.
    [28] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. Google.
    [29] HDFS设计模型. http://cn.hadoop.org/doc/hdfs_design.html.
    [30] Dereen L. Galli著.徐良贤,唐英,毛家菊,金恩华等译.分布式操作系统原理与实践.机械工业出版社. 2003
    [31]何炎祥.分布式操作系统.高等教育出版社. 2005
    [32]夏卫民,罗宇,肖侬等.并行操作系统原理与技术.国防工业出版社. 2002
    [33] Lamport, L.―How to Make a Multiprocessor Computer that Correctly Excutes Multiprocess Programs.‖IEEE Transactions on Computers. Vol. C-28, No.9.pp.690-691:1979
    [34] Hutto, P. and M. Ahamad.―Low Menmory: Weakening Consistency to Enhance Cponcurrency in distributed Shared Memories.‖Proceedings of the 10th IEEE International Conference on Distributed Computing Systems. Pp.302-311:1990
    [35] Dubois, M., C. Scheurich, and F. Briggs.―Synchronization, Coherence, and Event Ordering in Multiprocessors.‖IEEE Computer. Vol. 21, No. 2, pp. 9-21:1988
    [36] George Clulouris, Jean Dollimore, Tim Kindberg著,金蓓弘等译.分布式系统概念与设计(第三版).机械工业出版社,中信出版社. 2004.
    [37] Kung, H.T. and Robinson, J.T. Optimistic Methods for Concurrency Control. ACM Trans, on Database Systems, Vol. 6, No. 2, 1981, pp. 213-226.
    [38]金松昌,杨树强,方滨兴等.基于Hadoop的网络安全日志分析系统的设计与实现.第25次全国计算机安全学术交流会, 2010.9
    [39]骆卫华.Hadoop安装于配置手册. http://www.docin.com/p-35179180.html

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700