面向海量邮件存储的分布式文件系统研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网技术的迅猛发展和网络用户相互交流的迫切需要,电子邮件日益成为人们办公和沟通的重要途径,它的数据规模也呈飞速膨胀的趋势。传统的文件系统很难满足海量数据存储和读取的性能要求,而现有的分布式文件系统并没有对海量邮件存储提供很好的支持,本文正是在这种前提下,对面向海量邮件存储的分布式文件系统进行了研究。
     分布式文件系统主要是利用网络将多台机器构成一个虚拟的文件系统。本文主要研究并实现了一个面向海量邮件存储的分布式文件系统,它除具有很强的容错性、可用性和可扩展性之外,还必须具有很高的I/O性能。针对邮件来源的特殊性,系统必须支持多种数据源的直接写入。为此,本文重点研究了如下问题并依此实现了本系统:
     首先,本文根据项目对文件系统的的需求,在合理分析了已有的分布式架构的基础上,设计出本分布式文件系统的架构。根据架构,设计并实现了系统的各个组成部分。
     其次,在开始设计分布式文件系统的内部写入和读出算法时,引入读写锁和租约。在读出和写入数据的过程中,研究系统的不同组成部分的多策略的负载平衡。把块副本冗余作为系统核心的容错方式,设计出系统中的每个组成部分的容错方案。
     再次,针对邮件来源的不同,有一般的数据源FTP,HTTP,FILE,也有专门的邮件源SMTP,IMAP和POP3,研究多数据源的公共接口并实现了公共接口的分布式文件系统写入。为了增强系统的I/O性能和数据完整性,在存储的文件格式中加入压缩和同步信息。
     最后,对分布式文件系统进行I/O性能测试。在机器数量有限的情况下,为了使现有系统的I/O性能的测试结果,在更大规模的机群上也成立,提出了速度稳定性测试。写入速度的测试结果高于20MB/s,而读出速度测试则约为40MB/s,这个测试结果也证明了此系统具有很高的I/O性能。
With the rapid development of the Internet technology and urgent need of the Internet users’communication, E-mail increasingly becomes one of important ways of communication, and the scale of its data has the trend to expand fast. But the traditional file systems are difficult to meet the performance requirement of massive data. Meanwhile, current general distributed file systems don’t give a good support to massive E-mails. In such context, this paper presents the research of distributed file system dedicated to massive E-mails’storage.
     Distributed file system is a virtual file system formed of multiple connected computers. This paper mainly studies and implements a distributed file system dedicated to massive E-mails’storage. Besides its excellent fault tolerance, availabity and scalabity, the system is of high I/O performance. As the speciality of Email’s source, the system must support writing several protocols’data source into the file system directly. Therefore, this paper focuses on the following research and implements the system according the research result.
     Firstly, according to the project’s need on the file system, based on reasonably analyzing the architecture that has been proposed, we design the architecture of the distributed file system. In accordance with the architecture, we design and implement each components of the system.
     Secondly, the system introducs read-write lock and lease at the start of designing the reading and writing algorithms of the file system. While in the process of designing and implementing, the paper studies the load balance on reading and wiring operation of the system. The core of the system’s fault tolerance is block replicas. With replicas we design special fault tolerance of each system’s component.
     Thirdly, there are many E-mail data sources. Generally, we have data source: FTP, HTTP and FILE (Local File System). Specailly, E-mail has its own data source: POP3, IMAP and SMTP. This paper studies the multiple protocols’common interface and implements the system’s writing support according the interface. For raising the system’s I/O performance and data integraty, the file format of the system adds compression and sync info.
     Finally, we evaluate the system’s I/O performance. Under the circumstance of limited number of machines, we try to make the result evaluated on smaller cluster fit to larger cluster’s evaluation. This paper proposes the test of speed stability. In the evaluation of speed test, writing speed is above 20MB/s while reading speed is about 40MB/s. The evaluation proves that the system is high of I/O performance.
引文
1 Satyanarayanan, M. A Survey of Distributed File Systems. In Annual Review of Computer Science. Annual Reviews, Inc, 1989
    2 Sun Microsystems, Inc. RFC1094: NFS - Network File System Protocol specification, http://www.faqs.org/rfcs/rfc1094.html, March 1989
    3 S. Shepler, B. Callaghan. RFC 3530: Network File System (NFS) version 4 Protocol. The Internet Society, 2003
    4 M. Satyanarayanan, J. H. Howard, D. N. Nichols, R. N. Sidebotham, A. Z. Spector and M. J. West, The ITC Distributed File System: Principles and Design, Proceedings of the loth Symposium on Operating System Principles (SOSP), Orcas Island, Washington, U.S., ACM Press, December 1985
    5 Howard, J.H. An Overview of the Andrew File System, Proceedings of the USENIX Winter Technical Conference Feb. 1988, Dallas, TX
    6 Peter Braam, Philip Nelson, Removing Bottlenecks in Distributed Filesystems: Coda InterMezzo as examples, www.inter-mezzo.org
    7 Braam, P. J. The Coda Distributed File System, Linux Journal, June 1998
    8 Randolph Y. Wang and Thomas E. Anderson. xFS: A Wide Area Mass Storage File System. In Proceedings of the Fourth Workshop on Workstation Operation Systems. October 1993:71~78
    9 Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck, Scalability in the XFS File System. Proceedings of the USENIX, 1996
    10 www.lustre.org
    11 Frank Schmuck and Roger Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Proceedings of the Conference on File and Storage Technologies (FAST’02). 28–30 Monterey, CA, January 2002: 231~244
    12 Darrell C. Anderson, Jeffrey S. Chase, Amin M. Vahdat, Interposed Request Routing for Scalable Network Storage, Proceedings of the Fourth Symposiumon Operating System Design and Implementation, October 2000
    13杨德志,黄华,张建刚,许鲁.大容量、高性能、高扩展能力的蓝鲸分布式文件系统,计算机研究与发展, 2005.06: 108~112
    14 GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. In 19th SOSP, December 2003: 29~43
    15史小冬,孟丹,祝明发, COSMOS:一种可扩展单一映象机群文件系统,南京大学学报(自然科学), 2001.10
    16吴思宁,贺劲,熊劲,孟丹, DCFS机群文件系统服务器组的设计与实现, 2002全国开放式分布与并行计算学术会(DPCS2002), 2002
    17 B.C.Neuman. The Virtual System Model: A Scalable Approach to Organizing Large Systems.PhD thesis, University of Washington, June 1992
    18 B.C.Neuman. The Prospero File System: A Global File System Based on the Virtual System.Computing Systems. 1992, 5(4): 407~432
    19 A.S.Tanenbaum. Distributed Operating System.Prentice-Hall International, Inc. 1995: 564~571
    20 Guy.Ficus. A Very Large Scale Reliable Distributed File System. Ph.D. dissertation, University of California, Los Angeles, June 1991
    21 C. Gray and D. Cheriton. Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency. Proc. 12th Int'l Symp. Operating System Principles. 1989: 202~210
    22 Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis Apostolides, Ben A. Bottos, Sailesh Chutani, Craig F. Everhart, W. Antony Mason, ShuTsuiTu, and Edward R. Zayas. DEcorum file system architectural overview. In Proceedings of the Summer USENIX Conference. June 1990: 151~164
    23 Brent B. Welch. Measured Performance of Caching in the Sprite Network File System. Computing Systems. 1991,4(3):315~342
    24 Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. In Proceedings of the 15th Symposium on Operating Systems Principles. Copper Mountain Resort, Colorado. December 1995:109~126
    25 Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Wide-area cooperative storage with CFS. Symposium on Operating SystemsPrinciples (SOSP’01), October 2001
    26 http://www.cse.scu.edu/~jholliday/REL-EAR.htm
    27 J. Postel, J. Reynolds. RFC 959 - File Transfer Protocol, http://www.faqs.org/rfcs/rfc959.html, Oct 1985
    28 Jonathan B. Postel, RFC 821 - SIMPLE MAIL TRANSFER PROTOCOL, http://www.ietf.org/rfc/rfc0821.txt , Aug 1982
    29 J. Myers. RFC 2554 - SMTP Service Extension for Authentication, http://tools.ietf.org/html/rfc2554 , March 1999
    30 J. Klensin. RFC 2821 - Simple Mail Transfer Protocol, http://tools.ietf.org/html/rfc2821 , April 2001
    31 J. Myers, M. Rose. RFC1939 - Post Office Protocol - Version 3, http://www.faqs.org/rfcs/rfc1939, May 1996
    32 R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, RFC2616 - Hypertext Transfer Protocol -- HTTP/1.1, http://www.faqs.org/rfcs/rfc2616.html, June 1999
    33 M. Crispin. RFC2060 - Internet Message Access Protocol - Version 4rev1, http://www.faqs.org/rfcs/rfc2060.html, December 1996
    34 http://java.sun.com/products/javamail/
    35郭威.分布式文件系统ZD-DFS的设计与实现.浙江大学硕士论文. 2006.7
    36 P. Deutsch, J.-L. Gailly, ZLIB Compressed Data Format Specification version 3.3, RFC Editor, 1996
    37 Deutsch, P.,“GZIP File Format Specification Version 4.3,”Network Working Group, May 1996:12
    38 J. Ziv and A. Lempel. Compression of individual sequencesvia variable length coding. IEEE Transactions on Information Theory, 1978, 24:530~536
    39 7zip的官方网站, http://www.7-zip.org/
    40 Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates. Compression: A key for next-generation text retrieval systems. IEEE Computer. November 2000, 33(11):37~44
    41 Brin S, Page L. The Anatomy of a Large Scale Hypertextual Web Search Engine [C]. In Proc. Seventh World Wide Web Conf. (WWW7), International World Wide Web Conference Committee (IW3C2) [A]. 1998:107~117
    42 GOSLING, J., JOY, B., STEELE, G., AND BRACHA, G. Java Language Spec. (2nd Ed.). Addison-Wesley, 2000:205~213
    43 Satyanarayanan, M. Coda: A Highly Available File System for a Distributed Worksattion Environment, Proceedings of the Second IEEE Workshop on Workstation Operation System Sep. 1989, Pacific Grove, CA
    44 Mary Baker, John Qusterhout, Availability in the Sprite Distributed File System, ACM Operating System Review, 25(2), Apirl 1991:95~98.
    45 John H. Hartman and John K. Ousterhout. The Zebra striped network file system. ACM Symposium on Operating System Principles(Asheville, NC), December 1993:29~43
    46 K. Fu, M. Kaashoek D. Mazieres. Fast and secure distributed read-only file system. OSDI, Oct, 2000:153~168
    47 Andy Watson. Multi protocol Data Access: NFS, CIFS and HTTP (TR-3014). Network Appliance Technical Report: NA-96-2534. 1996:32~45
    48 Peter M. Chen , Edward K. Lee , Garth A. Gibson , Randy H. Katz , David A. Patterson, RAID: high-performance, reliable secondary storage, ACM Computing Surveys (CSUR), v.26 n.2, June 1994:145~185
    49 David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, Chicago, Illinois, United States, 1988, June 01-03: 109~116
    50 Mendel Rosenblum , John K. Ousterhout, The design and implementation of a log-structured file system, Proceedings of the thirteenth ACM symposium on Operating systems principles, Pacific Grove, California, United States, October 13-16, 1991:1~15
    51 James S. Plank , Kai Li , Michael A. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, October 1998, v.9 n.10:972~986

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700