网络备份中重复数据删除技术研究

英文题名：Research on Data De-duplication Technology in Network Backup
作者：杨天明
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：数据备份 ; 重复数据删除 ; 磁盘索引 ; 指纹查询 ; 索引更新 ; 后处理
英文关键词：Data backup ; De-duplication ; Disk index ; Fingerprint lookup ; Index update ; Post-processing
学位年度：2010
导师：冯丹
学科代码：081201
学位授予单位：华中科技大学
论文提交日期：2010-01-01

摘要

科技的飞速发展和生产力的突飞猛进正在加速产生大量高价值数据,对这些数据的存储和备份需求可以达到PB级(千万亿字节)。尽管数据呈爆炸性增长,但研究表明,重复数据大量存在于信息处理和存储的各个环节,如文件系统、邮件附件、web对象,以及操作系统和应用软件中。传统的数据保护技术如周期性备份、版本文件系统、快照和连续数据保护等更是加速着重复数据的增长,导致网络带宽和存储空间资源的紧缺以及数据管理成本的快速上升。为了抑制数据过快增长,提高资源利用率,降低成本,重复数据删除技术已经成为一个备受关注的研究课题。
     数据的持续增长和应用的高连续性对备份性能的要求越来越高,在大规模网络备份系统中实现重复数据删除,提高存储空间效率的同时,必须保证系统具有良好的性能和可扩展性。因此,围绕重复数据删除性能和可扩展性,在大规模重复数据删除系统架构、元数据管理、索引维护、高性能数据备份和恢复等方面进行研究,取得了相应的研究进展。
     针对已有的重复数据删除技术采用单服务器架构、可扩展性较差,难以满足大规模分布式数据备份需要的问题,提出了一种基于集中式管理、网络数据备份的层次化重复数据删除系统架构。该架构由一台主服务器对整个系统进行管理,支持多台备份服务器并行作业。数据流由备份客户端经过备份服务器流入后端存储节点中,实现了控制流和数据流的有效分离。多层数据索引技术把逻辑数据和底层物理数据有效分离开来,支持高性能层次化重复数据删除以及备份服务器层和存储节点层的动态扩展,使得系统具有良好的性能、可管理性和可扩展性。
     现有的重复数据删除技术在数据写入后台存储系统的过程中在全局范围内查询指纹以消除重复数据。随着备份数据量的增长,用来加速指纹查询的内存数据结构所消耗的存储空间会越来越大,使得系统规模最终受服务器内存空间限制。为此,设计了一种基于小范围检测的指纹过滤器用于在备份过程中对数据进行初步过滤,消除周期性备份产生的重复数据,节省网络带宽,提高备份效率。该技术把指纹查询的范围限定在作业链内,备份的内存开销和系统规模无关,另外,其在备份过程中收集指纹,便于系统使用高性能后处理重复数据删除算法对数据进行集中处理,消除了磁盘索引查询和更新对应用系统的影响。实验表明,该技术能消除备份流中大部分重复数据,既节省网络带宽又减少了需要在后台进一步处理的数据量,提高了系统整体性能。
     提出了一种后处理重复数据删除算法对备份数据进行集中处理,该算法顺序扫描磁盘索引一次性批处理大量指纹,从而有效消除了指纹查询和索引更新的随机磁盘I／O瓶颈。该算法使用固定大小的存储容器保护新数据块逻辑顺序,支持高性能数据恢复,另外,使用一种无状态路由算法把存储容器分发到后台存储节点中,支持后台存储节点的负载平衡、数据迁移和动态扩展。实验表明,相较于目前主流的重复数据删除技术,该算法在相同内存开销下支持更大的系统物理容量,更重要的是,它支持多服务器并行操作,具有良好的可扩展性。
     后处理重复数据删除算法顺序扫描数据块索引(磁盘索引)进行批处理指纹查询和索引更新,因而在一定系统规模下维持较小的数据块索引对于提高系统性能来说至关重要。目前在数据块索引空间利用率方面尚没有发现相关的研究工作。因此,设计了一种基于前缀映射的磁盘哈希表作为数据块索引,保证了良好的索引可扩展性,同时着重研究了数据块索引溢出概率和空间利用率问题。研究表明,使用恰当大小的索引桶,既能避免过高的桶内指纹查询开销,又能降低索引溢出概率,提高数据块索引空间利用率,从而有效降低索引存储开销,提高索引扫描性能。
Today, the ever-growing volume and value of digital information have raised a critical and mounting demand on large-scale and high-performance data protection. The massive data needing backup and archiving has amounted to several perabytes and may soon reach tens, or even hundreds of perabytes. Despite the explosive growth of data, research shows that a large number of duplicate data exists in the information processing and storage of all aspects, such as file systems, e-mail attachments, web objects, and the operating system and application software. Traditional data protection technologies such as periodic backup, version file system, snapshot and continuous data protection magnify this duplication by storing all of this redundant data over and over again. Due to the unnecessary data movement, enterprises are often faced with backup windows that roll into production hours, network constraints, and too much storage under management. In order to restrain the excessive growth of data, improve resource utilization and reduce costs, data de-duplication technology has become a hot research topic.
     Due to the continued growth of data and high-continuity of application, it's very important to ensure that the system has good performance and scalability while performing data de-duplication in a large-scale network backup system to improve storage space efficiency. Therefore, our work mainly focuses on data de-duplication performance and scalability. A distributed hierarchical data de-duplication architecture based on centralized management is presented, and then the metadata management, index maintenance, scalable and high performance data de-duplication technology are researched in detail. The main contributions of this dissertation include:
     To overcome the shortcomings of existing de-duplication solutions, which obtain high backup performance, but suffer from poor scalability for large-scale and distributed backup environments because of their single-server architecture, we present a distributed hierarchical data de-duplication architecture based on centralized management for network backups. The architecture supports a cluster of backup servers to perform data de-duplication in parallel, and uses a master server, which handles job scheduling, metadata management and load balancing to improve the system's scalability. Data stream is transfered directly from the client to the backup server, deduplicated in-batch and then sent to the back-end storage nodes, which separates the control flow from the data flow effectively. Multi-layer data indexing technology supports high-performance hierarchical data de-duplication and dynamic expansion of both backup server and storage node, which provides the system good performance, manageability and scalability.
     Exising de-duplication technologies lookup fingerprints in the global system to eliminate duplicate while writing data to the back-end storage, with the growth in the amount of data, the memory overhead for accelerating fingerprint lookup will grow increasingly, and thus the system physical capacity will be limited inevitably by the amount of physical memory that the server can offer. This paper implemented an in-memory fingerprint filter based on small-scale detection, which is deployed in the backup process to eliminate duplicate generated by periodic backups. The designed filter limits the fingerprint lookup within the scope of the job chain, so, its memory overhead is independent with the system scale. In addition, it collects fingerprints during backup, which enables high-performance post-processing duplication; and thus avoiding the impact of time-consuming disk index acess on application system. Experiments show that, by using this filter, most of the duplicate in backup can be eliminated, which in turn improves overall system performance by reducing not only the bandwidth requirement for backups but also the number of chunks needing to be further processed in the background.
     Data that was processed by the in-memory fingerprint filter is further deduplicated using a post-processing duplication algorithm in the background. The algorithm processes a large number of fingerprints in a single pass over the disk index, thus effectively eliminating the disk index assess bottleneck. In addition, the algorithm protects the logical order of new chunks by storing them in fixed-size containers, which enables high-performance data recovery. Containers are distributed to back-end storage nodes using a stateless routing algorithm, which supports load balancing, data migration and dynamic expansion in the back-end storage. Experiments show that, compared to current mainstream data de-duplication technology, the algorithm supports larger system physical capacity in the same memory overhead, more importantly, it supports multiple servers to perform de-duplication storage in parallel, and thus applicable to large scale and distributed environments.
     Post-processing data de-duplication algorithm improves fingerprint lookup by performing a sequential scan over the entire disk index, so it's favorable to improve the disk index utilization to enable a smaller disk index for a given system scale. While the relevant aspect of research on disk index utilization is still not found in literature, we implemented the disk index as a disk-resident hash table based on prefix mapping, and studied the index utilization through both theoretical analysis and extensive experiments. The results show that using appropriate-sized buckets, the disk index utilization can be effectively improved with acceptable CPU overhead, which in turn not only reduces metadata storage but also improve index scan performance.

引文

[1]张江陵,冯丹.海量信息存储.北京：科学出版社,2003.
    [2]Gantz J. F., Ghute C., Manfrediz A., Minton S., Reinsel D., Schlichting w., and Toncheva A. The Diverse and Exploding Digital Universe:An updated forecast of worldwide information growth through 2010. IDC white paper, March 2008.2-16.
    [3]刘群.基于可扩展对象的海量存储系统研究[博士学位论文].华中科技大学图书馆,2006.
    [4]Sanjay Ghemawat S, Gobioff H and Leung S. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, Lake George, NY, Oct.,2003.29-43.
    [5]Mass Storage System. http://www.cisl.ucar.edu/hss/mssg/mss.jsp.
    [6]T. E. Denehy and W. W. Hsu. Duplicate management for reference data. Computer Sciences Department, University of Wisconsin and IBM Research Division, Almaden Research Center," Tech. Rep., Oct.2003.1-15.
    [7]L. You, K. Pollack, and D. D. E. Long, "Deep store:an archival storage system architecture," in Proceedings of the 21st International Conference on Data Engineering, Apr.2005.804-815.
    [8]National Energy Research Scientific Computing Center. HPSS Mass Storage. http://www.nersc.gov/nusers/systems/HPSS/.
    [9]Lawrence L. You, Efficient Archival Data Storage, Technical Report UCSC-SSRC-06-04, Storage Systems Research Center Baskin School of Engineering University of California, Santa Cruz, June 2006.
    [10]GartnerGroup. Total Cost of Storage Ownership. A User-oriented Approach, Research note,2000.
    [11]C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani:A Scalable Distributed File System. In Proceedings of the 16th Symposium onOperating Systems Principles (SOSP-97), volume 31,5 of Operating Systems Review, New York, ACM Press. Oct.1997.224-237.
    [12]A. C. Veitch, E. Riedel, S. J. Towers, and J. Wilkes. Towards Global Storage Management and Data Placement. In Eighth IEEE Workshop on Hot Topics in Operating Systems (HotOS-Ⅷ). MD 20910, USA, IEEE.2001.184-184.
    [13]J. Wilkes. Traveling to Rome:Qos specifications for automated storage system management. In Proc. of the Int. Workshop on QoS (IWQoS'2001). Karlsruhe, Germany, June 2001.75-91.
    [14]A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file system. In SOSP,2001.174-187.
    [15]M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. Journal of the ACM. May 2002.49(3):318-367.
    [16]N. Tolia, M. Kaminsky, D. G. Andersen, and S. Patil. An architecture for Internet data transfer. In Proc.3rd Symposium on Networked Systems Design and Implementation (NSDI), San Jose, CA, May 2006.253-266.
    [17]J. C. Mogul, Y.-M. Chan and T. Kelly. Design, implementation, and evaluation of duplicate transfer detection in HTTP. In Proceedings of the First Symposium on Networked Systems Design and Implementation,2004.
    [18]F. Douglis and A. Iyengar. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference, San Antonio, Texas, June 2003.113-126.
    [19]Tim D. Moreton, Ian A. Pratt, and Timothy L. Harris. Storage, Mutability and Naming in Pasta. In Proceedings of the International Workshop on Peer-to-Peer Computing, Networking, May 2002.3-23.
    [20]N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Proceedings of the Second International Workshop on the World Wide Web and Databases (WebDB'99). LNCS,1999.204-212.
    [21]W. Xiao, Y. Liu, Q. Yang, J. Ren, and C. Xie. Implementation and performance evaluation of two snapshot methods on iscsi target storages. In NASA/IEEE 14th Conf. on Mass Storage Systems and Technologies,2006.
    [22]Jingning Liu, Tianming Yang, Zuoheng Li, Ke Zhou. TSPSCDP:A Time-Stamp Continuous Data Protection Approach Based on Pipeline Strategy. Japan-China Joint Workshop on Frontier of Computer Science and Technology (FCST 2008).27-28 December 2008.96-102.
    [23]Xu Li, Changsheng Xie, Qing Yang. Optimal Implementation of Continuous Data Protection (CDP) in Linux Kernel. NAS'08. International Conference on Networking, Architecture, and Storage June 12-14,2008.28-35.
    [24]N. Zhu and T. Chiueh. Portable and Efficient Continuous Data Protection for Network File Servers. In Proc. of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 07), Edinburgh, UK, June 2007.687-697.
    [25]D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R.W. Carton, and J. Ofir. Deciding when to forget in the elephant file system. In SOSP,1999. 110-123.
    [26]Z. Peterson and R. Burns. Ext3cow:a time-shifting file system for regulatory compliance. ACM Transactions on Storage, vol.1, May 2005.190-212.
    [27]A. C. amd Vivekenand Vellanki and Z. Kurmas. Protecting file systems:A survey of backup techniques. In In Proceedings Joint NASA and IEEE Mass Storage Conference, March 1998.
    [28]Tianming Yang, Dan Feng, Jingning Liu, Yaping Wan, Zhongying Niu, Yuchang Ke.3DNBS:A Data De-duplication Disk-based Network Backup System. Proc. of the 2009 IEEE International Conference on Networking, Architecture, and Storage (NAS 2009), Zhang Jia Jie, China. July 9-11,2009.287-294.
    [29]N. C. Hutchinson, S. Manley, M. Federwisch, G. Harris, D. Hitz, S. Kleiman, and S. O'Malley. Logical vs. physical file system backup. In Proceedings of the 3rd Symposium on Operating Systans Design and Implementation (OSDI-99), (Berkeley, CA), Usenix Association, Feb.22-25 1999.239-250.
    [30]P. Cervantes. Disaster recovery with Tivoli Storage Manager. Sys Admin:The Journal for UNIX Systems Administrators, vol.10, Sept.2001.16-20.
    [31]G. Duzy. Match snaps to apps. Storage, Special Issue on Managing the information that drives the enterprise, Sept.2005.46-52.
    [32]A. Azagury, M. E. Factor, J. Satran, and W. Micka. Point-in-time copy:Yesterday, today and tomorrow. May 2002.259-270.
    [33]L. Shrira and H. Xu. Thresher:An efficient storage manager for copy-on-write snapshots. In USENIX'06 Annual Technical Conference, (Boston, MA), June, 2006.57-70.
    [34]王树鹏,云晓春,郭莉.连续数据保护技术的发展综述.信息技术快报.2008,6(6).
    [35]K. McCoy. Vms file system internals. Newton, MA, USA:Digital Press.1990. 460-460.
    [36]Peter Gerr, Reference Information:The Next Wave. Enterprise Storage Group, Research Report. Milford, MA,2003.1-8.
    [37]R. Moore, C. Barua, A. Rajasekar, B. Ludaescher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta. Collection-based persistent digital archives—part 1. D-LIB Magazine. vol.6, no.3, Mar.2000.
    [38]R. Moore, J. Lopez, C. Lofton, W. Schroeder, G. Kremenek, and M. Gleicher. Configuring and tuning archival storage systems. In Proceedings of the 7th NASA Goddard Conference on Mass Storage Systems and Technologies, Mar.1999. 158-168.
    [39]D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID).In Proceedings ACM SIGMOD, ACM, June 1988.
    [40]ASARO, T., AND BIGGAR, H. Data De-duplication and Disk-to-Disk Backup Systems:Technical and Business Considerations. The Enterprise Strategy Group July 2007.2-15.
    [41]BIGGAR, H. Experiencing Data De-Duplication:Improving Efficiency and Reducing Capacity Requirements. ESG white paper, Feb.2007.2-11.
    [42]G. Gibson and R. Meter. Network attached storage architecture. Communications of ACM, vol.43, no.11, November 2000.37-45.
    [43]B. Phillips. Have storage area networks come of age? IEEE Computer, vol.31, no. 7, July 1998.10-12.
    [44]I. C. Compaq Corp. and Microsoft Corp. Virtual Interface Architecture Specification, Version 1.0.1997.
    [45]G. F. Pfister. Aspects of the infiniband(tm) architecture. In CLUSTER, IEEE Computer Society,2001.369-371.
    [46]W. chun Feng, Justin, Hurwitz, H. B. Newman, S. Ravot, R. L. Cottrell, O. Martin, F. Coccetti, C. Jin, D. Wei, and S. Low. Optimizing 10-gigabit Ethernet in networks of workstations, clusters, and grids:A case study. In Proc. Of ACM/IEEE Supercomputing 2003:High-Performance Networking and Computing Conference,2003.50-50.
    [47]BINARY TESTING LTD. HP StorageWorks D2D4000 Backup System:a report and full performance test on Hewlett-Packard's SME data deduplication appliance. July 2008.2-15.
    [48]L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009:The Israeli Experimental Systems Conference,2009.1-14.
    [49]EMC Corporation. EMC Centera:Content Addressed Storage System, Data Sheet. USA. Apr.2002.1-4.
    [50]B. Zhu, H. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File And Storage Technologies,2008.269-282.
    [51]S. Quinlan and S. Dorward. Venti:a new approach to archival storage. In Proceedings of the USENIX Conference on File And Storage Technologies, January 2002.89-101.
    [52]M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. Sparse indexing:Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File And Storage Technologies, 2009.111-123.
    [53]A. Muthitacharoen, B. Chen, and D. Mazieres. A low bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP'01), Oct.2001.174-187.
    [54]K. Sayood, editor. Lossless Compression Handbook. Academic Press, an imprint of Elsevier Science,2003.3-8.
    [55]M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, May 1994.
    [56]J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory. May 1977. IT-23(3):337-343.
    [57]R. Rivest. The MD5 message-digest algorithm. Request For Comments (RFC) 1321, IETF, Apr.1992.
    [58]Secure Hash Standard, U.S. Department of Commerce N.I.S.T., National Technical Information Service, Springfield, VA, April 1995.
    [59]M.O. Rabin. Fingerprinting by random polynomials. Technical Report, Harvard University,1981.15-81.
    [60]ESHGHI, K. A framework for analyzing and improving contentbased chunking algorithms. Tech. Rep. HPL-2005-30(R.1), Hewlett Packard Laboratories, Palo Alto,2005.2-11.
    [61]Liu, Chuanyi, et al, Semantic Data De-duplication for Archival Storage Systems. In The Thirteenth IEEE Asia-Pacific Computer Systems Architecture Conference, 2008.1-9.
    [62]R. A. Wagner and M. J. Fisher. The string-to-string correction problem. J. ACM, January 1973.21(1):168-173.
    [63]W. Tichy. The string-to-string correction problem with block moves. ACM Transactions on Computer Systems, November 1984.2(4):309-321.
    [64]R. Burns. Differential compression:a generalized solution for binary files. Master's thesis, University of California, Santa Cruz, Dec.1996.
    [65]R. Burns and D. D. E. Long. A linear time, constant space differencing algorithm. In Proceedings of the 16th IEEE International Performance, Computing and Communications Conference (IPCCC'97), Phoenix, Arizona, IEEE. Feb.1997. 429-436.
    [66]J. MacDonald. File system support for delta compression. MS Thesis, UC Berkeley, May 2000.
    [67]Torsten Suel and Nasir Memon. Algorithms for Delta Compression and Remote File Synchronization. CIS Department, Polytechnic University,2002.
    [68]W. F. Tichy. RCS—a system for version control. Software—Practice and Experience, July 1985.15(7):637-654.
    [69]J. W. Hunt and M. D. McIlroy. An algorithm for differential file comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, New Jersey,1976.
    [70]M. Ajtai, R. Burns, R. Fagin, D. Long, and L. Stockmeyer. Compactly encoding unstructured input with differential compression. Journal of the ACM, May 2002. 49(3):318-367.
    [71]D. G. Korn and K.-P. Vo. Engineering a differencing and compression data format. In Proceedings of the 2002 USENIX Annual Technical Conference, Monterey, California, USENIX Association. June 2002.219-228.
    [72]Jeffrey Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. Potential benefits of delta-encoding and data compression for HTTP. In Proceedings of ACM SIGCOMM'97 Conference, September 1997.181-194.
    [73]Neil T. Spring and David Wetherall. A protocol independent technique for eliminating redundant network traffic. In Proceedings of ACM SIGCOMM, August 2000.87-95.
    [74]L. L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, Maryland, Apr.2004.227-232.
    [75]POLICRONIADES, C. AND PRATT, I. Alternatives for detecting redundancy in storage systems data. In Usenix Annual Technical Conference.2004.73-86.
    [76]P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey. Redundancy elimination within large collections of files. In Proceedings of the 2004 USENIX Annual Technical Conference, Boston, Massachusetts, June 2004.59-72.
    [77]N. Jain, M. Dahlin, and R. Tewari. Taper:Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the 4th USENIX Conference on File And Storage Technologies,2005.281-294.
    [78]U. Manber. Finding similar files in a large file system. In USENIX Winter Technical Conference,1994.1-10.
    [79]Purushottam Kulkarni, Fred Douglis, Jason LaVoie, and John M. Tracey. Redundancy Elimination within Large Collections of Files. In Proceedings of 2004 USENIX Technical Conference, Boston, Massachusetts, USA,2004.59-72.
    [80]Bloom B. Space/Time trade-offs in hash coding with allowable errors. Communications of the ACM,1970,13(7):422-426
    [81]R. Burns and D. Long. Efficient distributed backup with delta compression. In Proc. of the Fifth Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1997.26-36.
    [82]S. Rhea, R. Cox, and A. Pesterev. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the 2008 USENIX Annual Technical Conference, Boston, Massachusetts, June 2008.143-156.
    [83]A. Broder and M. Mitzenmacher. Network applications of Bloom filters:a survey. Internet Mathematics, vol.1,2005.485-509.
    [84]Tianming Yang, Hong Jiang, Dan Feng, Zhongying Niu, Ke Zhou, Yaping Wan. "Debar:A scalable and high-performance de-duplication storage system for backup and archiving". In Proc. of the 24th IEEE International Parallel & Distributed Processing Symposium. Atlanta (Georgia) USA, April 19-23,2010. 1-12.
    [85]VERITAS Global Data Manager 5.1:System Administrator's Guide for UNIX and Windows. VERITAS Software Corporation.2004.21-25.
    [86]贺翔.一种基于NDMP的块级备份和数据管理方法及其实现.中国科学院硕士学位论文,2006.
    [87]Wei, X., Min, W., Xiang, H., Zhenjun, L. BM-CVI:A backup method based on a cross-version integration mechanism In Proceedings of the 2007 International Conference on Convergence Information Technology,2007.781-788.
    [88]Henson, V. An analysis of compare-by-hash. In HotOS:Hot Topics in Operating Systems, USENIX,2003.13-18.
    [89]J. Black. Compare-by-hash:A reasoned analysis. In USENIX Annual Tech. Conf., 2006.85-90.
    [90]C. M. Riggle and S. G. McCarthy. Design of error correction systems for disk drives. IEEE Transactions on Magnetics, July 1998.34(4):2362-2371.
    [91]IBM. IBM OEM hard disk drive specification for DPTA-3xxxxx 37.5 GB-13.6 GB 3.5 inch hard disk drive with ATA interface, revision (2.1). (Deskstar 34GXP and 37GP hard disk drives), July 1999.
    [92]Hitachi Global Storage Technologies. Hitachi hard disk drive specification: Deskstar 7K400 3.5 inch Ultra ATA/133 and 3.5 inch Serial ATA hard disk drives, version 1.4, Aug.2004.
    [93]R. Rivest. The MD4 message digest algorithm. Request For Comments (RFC) 1186, IETF, Oct.1990.
    [94]NIST FIPS 180-2:Secure Hash Standard, Aug.2002.
    [95]Tianming Yang, Dan Feng, Zhongying Niu, Yaping Wan. Scalable High Performance De-Duplication Backup via Hash Join. Journal of Zhejiang University Science C, May 2010.11 (5):315-327.
    [96]CLEMENTS, A. T., AHMAD, I., VILAYANNUR, M., AND LI, J. Decentralized deduplication in san cluster file systems. In Proceedings of the 2009 USENIX Annual Technical Conference,2009.
    [97]SHAPIRO, L. I. Join processing in database systems with large main memories. ACM Transactions on Database Systems. vol.11, no.3,1986.239-264.
    [98]K. Eshghi, M. Lillibridge, L. Wilcock, G. Belrose, and R. Hawkes. Jumbo store: Providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the 5th USENIX Conference on File And Storage Technologies,2007.22-38.
    [99]Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. Hydrastor:a scalable secondary storage. In Proceedings of the 7th USENIX Conference on File And Storage Technologies.2009.197-210.
    [100]A. Tridgell and P. Macherras. The rsync algorithm.Technical report, TR-CS-96-05, Australian National University, Jun 1996.3-8.
    [101]L. Zeng, K. Zhou, Z. Shi, D. Feng, F. Wang, C. Xie, Z. Li, Z. Yu, J. Gong, Q. Cao, Z. Niu, L. Qin, Q. Liu, Y. Li, and H. Jiang, Hust:A heterogeneous unified storage system for gis grid. In Finalist Award, HPC Storage Challenge, the 2006 International Conference for High Performance Computing, Networking, Storage and Analysis (SC06), tampa, FL, Nov.13-17,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700