基于数据消冗和Chord协议的分布式存储技术研究

英文题名：Research on Distributed Storage Technology Based on Data De-Duplication and CHORD Protocol
作者：金雪姣
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：分布式存储 ; Chord协议 ; 数据消冗 ; Rabin指纹
英文关键词：distributed storage ; Chord protocol ; de-duplication ; Rabin fingerprint
学位年度：2010
导师：李东
学科代码：081201
学位授予单位：哈尔滨工业大学
论文提交日期：2010-06-01

摘要

随着信息时代数据规模急剧增长,信息量不断激增,数据信息已成为人类宝贵的财富,数据的价值已经远远超过了计算机系统本身的价值;另一方面,各种不确定因素又使得数据极易丢失,从而给用户带来了巨大的损失。因此,面对海量数据对存储系统各方面需求的挑战,高效率的数据存储技术受到了人们的广泛关注。
     为适应海量数据对存储系统各方面的需求,本文首先研究了现有的分块级数据消冗技术,比较了定长分块数据消冗和变长分块数据消冗的优缺点,分析了影响数据消冗效果的因素。接着重点研究了基于Rabin指纹的变长分块算法,提出了一种新型的文件切点查找算法。
     本文还根据基于分块的数据消冗技术和基于Chord的分布式存储技术的特点对文件资源定位进行了设计,并根据Chord协议的特点将文件分块的索引信息按区间分布在不同的节点中,以二级索引的方式解决了集中式分块索引的难题。本文最后提出了基于Chord协议的分布式存储技术和基于Rabin指纹的变长分块的数据消冗相结合的的分布式存储系统结构。
     实验结果表明,在基于Chord协议的分布式存储系统中引入数据消冗技术,可以降低整个分布式存储系统的存储负担。此外,数据传输量的减少也有利于提高低速网络下的数据备份与恢复的效率。
With the rapid growth of data size in the information age, the amount of information increases quickly, data information has become a valuable asset to the mankind. The value of data has far exceeds the value of the computer system itself. On the other hand, various uncertain factors make the data vulnerable to lost, which will bring huge losses to the users. Therefore, faced with the challenge of massive data to the all various aspects of storage system, high efficiency data storage technology has been widely concerned.
     To meet the needs of massive data to the storage system, we study the existing block-level data de-duplication technology first, compare the advantages and disadvantages of the fixed-length block data de-duplication and variable length block duplication, analysis the factors which affect the efficiency of duplication. We then focus on the Rabin fingerprint-based variable-length block algorithm, propose a new document cut point search algorithm.
     According to the characteristics of Chord protocol and data de-duplication technology, we design the location of file resources and data duplication filtering strategy. By storing the block index information in different nodes according to the characteristics of Chord, we solve the problem of the centralized block indexing. Finally this paper proposes a distributed storage system architecture build up with Chord-based distributed storage technology as well as Rabin fingerprint-based variable-length block de-duplication technology.
     The experiment results show that the introduction of data de-duplication technology to the distributed storage system based Chord protocol reduces the system storage burden. Besides, the reduction of data transmission amount increases the efficiency of data backup and recovery under the low-bandwidth network.

引文

1. R. Sandberg. The Sun NetWork Filesysrem: Design, Implementation and Experience, in Proeeedihgs of the 1987 Summer Usenix Conference.1987: 300~314
    2. J.H. Morris, M. Satyanarayanan, M.H. Conner, et al. Andrew: a distributed Personal computing environment. Communications of the ACM, 1986. 29(3): 184~201
    3. M. Satyanarayanan, J.J. Kistler, P.Kumar, et al. Coda: a highly available file system for a distributed workstationenvironment. IEEE Transactions on Computers, 1990.39(4): 447~459
    4. T.E. Anderson, M. Dahlin, J.M. Neefe,et al. Serverless Network File Systems. ACM Trans.Comput.Syst., 1996.14(l):41~79
    5. R.Latham, N.Miller, R.Ross, et al. A next generation Parallel file system for Linux clusters. Linux-World, 2004: 56~59
    6. Redhat. GFS Project Page. http://sources.redhat.com/cluster/gfs/
    7. T.E. Anderson, D.E. Culler, D.A. Patterson. A Case for NOW(Networks of Workstations). IEEE Micro, 1995.15(l): 54~64
    8. Q. Xin, E.L. Miller, S.J. Thomas Schwarz, et al. Reliability Mechanisms for Very Large Storage Systems, in 20th IEEE/ 11th NASA Goddard Conference on Mass Storage Systems and Technologies. 2003: 146~156
    9.田敬,代亚非. P2P持久存储研究.软件学报, 2007.18(6): 1379~1399
    10. Napster. 2001. http://www.napster.com/
    11.黄道颖,李祖鹏等.分布式Peer-to-Peer网络Gnutella型研究.计算机工程与应用, 2003, 39(5): 60~63
    12. J.Kubiatowiez, C.Wells, B.Zhao, et al. OceanStore: An architecture for global-scale persistent storage, in Proc. Of the 9th Int’Conf. on Architectural Support for Programming Languages and Operating System. 2000
    13. S.Rhea, P.Eaton, D.Geels, et al. Pond.:The OceanStore re prototype, in Proc. Of the2nd USENIX Conf.on File and Storage Technologies.2003: 1~14.
    14. F.Dabek, M.Kaashoek, D.Karge, et al. Wide-Area cooperative storage with CFS, in Proc. Of the 18th ACM Symp.on Operating Systems Principles. 2001: 202~215
    15. P.Drusehel, A. Rowstron. PAST: A large-scale, persistent peer-to-peer storage utility, in Proc. Of the 8th IEEE Workshop on Hot Topics in Operating Systems. 2001
    16. A.Adya, W.R,B.W,et al. Farsite: Federated, available, and reliable storage for an incompletely trusted environment, in Proc. Of the 5th Symp. On Operating Systems Design and Implementation. 2002
    17. I.Stoiea, M.R, K.D, et al. Chord: A scalable peer-to-peer lookup service forinternet applications, in Proc. Of the 2001 SIGCOMM Conf. 2001: 149~160
    18. A.Rowstron, P.Drusehel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems, in Proc. IFIP/ACM Middleware.2001
    19. B.zhao, J.Kubiatowiez, A.JosePh. Tapestry: An infrastructure for fault-tolerant wide-area location and routing, in Technical Report, UCB//CSD-01-1141.2001
    20. J. Tate, F. Lucchese, R. Morre. Introduction to storage area networks. Redbooks Publications(IBM), 2006:1~28
    21. J.Kubiatowicz, et.al.Oceanstore: An Architecture of Global-Scale Persistent Storage. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2000), November 2000:190~201
    22.胡进锋,基于对等结构的广域网分布式存储系统研究,清华大学工学博士论文. 2005
    23. Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: A Distributed Anonymous Information Storage and Retrieval System. In DIAU, July 2000
    24. Druschel P. Rowstron A.PAST: A large-scale persistent peer-to-peer storage utility. In: Proceedings of the 8th IEEE Workshop on Hot Topics in Operating Systems VIII.2001
    25. Andrew Tridgell. Efficient Algorithms for Sorting and Synchronization. The Australian National University, 1999
    26. Muthitacharoen, B. Chen, and D. Mazi`eres. A low bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001: 174~187
    27. Chuanyi LIU, Yingping Lu, Chunhui Shi, Guanlin Lu, David H.C. Du, Dongsheng WANG. ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System. Digital Object Identifier
    10.1109/SNAPI.2008.11, 22-22 Sept. 2008: 29~35
    28. Tianming Yang, Dan Feng, Jingning Liu, Yaping Wan, FBBM: A new Backup Method with Data De-duplication Capability. Digital Object Identifier
    10.1109/MUE.2008.33, 24-26 April 2008: 30~35
    29.崔兴华,杜晓黎,赵晓睿.重复数据检测在多版本数据备份中的应用.成都:计算机应用研究, 2009
    30.程菊生.重复数据删除技术的研究.华赛科技, 2008.12第4期
    31.朱立谷.重复数据删除技术解析.中国计算机报, 2007
    32. Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: A Distributed Anonymous Information Storage and Retrieval System. In DIAU, July 2000.
    33.杨天路,刘宇宏. P2P网络技术原理与系统开发案例.北京:人民邮电出版社, 2007: 189~202
    34. Ratnasamy, S. Francis, P. Handley, M. Karp. R and Shenker. S:A scalable content-addressable network, In Proc.ACM SIGCOMM(San Diego, CA, August2001):161~172
    35.易红军,佘名高. MD5算法与数字签名.计算机与数字工程. 2006, 34(5):44~46
    36.郑东,李祥学,黄征.密码学-密码算法与协议.北京:电子工业出版社, 2009: 90~94
    37.沈琦.基于Chord的高性能文件存储技术的设计和研究.浙江大学硕士论文, 2007
    38.蔡盛鑫,姚文斌,伍淳华,王枞.一种基于重复数据删除的备份系统.中国科技论文在线. 2009
    39. M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981
    40. Kave Eshghi, Hsiu Khuern Tang. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett- Packard Development Company Report, 2005
    41.严蔚敏,吴伟民.数据结构.北京:清华大学出版社, 1997: 208~215

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700