数据备份系统中数据去重技术研究

英文题名：Study on Data Deduplication Technique for Data Backup Systems
作者：谭玉娟
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：数据备份系统 ; 数据去重技术 ; 源端去重方法 ; 备份窗口 ; 恢复时间 ; 数据碎片
英文关键词：Data Backup System ; Data Deduplication ; Source Deduplication ; Backup
英文关键词：Window ; Restore Time ; Data Fragmentation
学位年度：2012
导师：冯丹
学科代码：081201
学位授予单位：华中科技大学
论文提交日期：2012-05-01

摘要

随着社会信息化的发展和数据量的暴增,数据备份系统中所要处理的备份和恢复的数据越来越多,数据去重技术已作为一种数据无损压缩技术广泛应用在数据备份系统中。不过,虽然数据去重技术能大幅度压缩数据量,提高存储空间和网络带宽利用率,但数据去重作为一门新兴的数据压缩技术,目前还面临着诸多问题和挑战。如在云备份服务中,现有的源端去重方法无法大幅度减少数据备份和恢复时间,满足人们对备份服务的需求；另外,现有的数据去重方法由于需要删除多个文件之间共享的重复数据块,会导致备份系统中存储许多数据碎片,影响数据备份和去重性能。
     在云备份系统中,受低带宽广域网络的限制,数据备份速度非常慢。对于大部分用户来说,数据备份速度过慢将直接影响正常的业务运营。针对此问题,本文提出基于文件语义的多层次源端去重方法(Semantic-Aware Multi-Tiered Source De-duplication Framework, SAM),减少数据备份时间。在SAM提出之前,现有云备份主要采用基于源端的全局数据块级去重和局部数据块级去重方法在客户端对重复数据进行删除,减少广域网络上传输的备份数据量。前者主要在全局范围内删除各用户之间产生的重复数据,所需要的数据去重时间较长；而后者虽然仅删除同一个用户所产生的重复数据,数据去重时间较短,但能获得重复数据删除率较低,需要较长的数据传输时间。经分析,这两种方法各有所长,但都无法大幅度减少数据备份时间,缓解数据备份过程中遇到的数据传输瓶颈。SAM结合这两种方法的优势,提出了将基于源端的全局文件级去重和局部数据块级去重进行结合的方法,同时在全局文件级和局部数据块级的去重过程中挖掘诸多文件语义信息,缩减重复数据的查找范围,加快重复数据的查找过程。经理论分析和试验数据分析,与现有的两种源端去重方法相比,SAM能较好地权衡所获得的重复数据删除率以及所引入的去重时间开销,可以大幅度减少数据备份时间。
     不过,现有的源端去重方法,包括SAM,都仅关注云备份中的数据备份时间,而对数据恢复时间的关注却很少。虽然这些源端去重方法都能好地满足大部分用户的需求,但对可靠性要求很高的企业用来说,数据恢复时间至关重要。当数据受损时,数据恢复时间的长短直接关系到经济利益损失的多少。针对此问题,本文提出基于因果关系的数据去重方法(Causality-based Deduplication Performance Booster, CABdedupe),不仅可以减少数据备份时间,也可以减少数据恢复时间。经观察分析,重复数据不仅存在于数据备份过程中,也存在于数据恢复过程中,且这些重复数据的存在与文件之间的因果关系息息相关。CABDedupe通过监控文件系统调用,捕捉文件之间的这些因果关系信息,不仅可以消除数据备份过程中的重复数据,也可以消除数据恢复过程中的重复数据,同时加速数据备份和恢复过程。另外,CABDedupe是一个辅助备份系统进行数据去重的中间件,CABDedupe的失效只会使部分重复数据无法被删除,降低CABDedupe对数据备份和恢复性能的优化效果,而不会影响备份系统中日常的数据备份和恢复功能。
     无论使用何种数据去重方法,由于需要对多个文件或数据流之间的重复数据块进行删除,数据去重都会使备份系统存储很多数据碎片。并且随着备份系统所存储的备份数据量的增多,这些数据碎片会越来越多,严重影响数据备份和数据去重性能。针对此问题,本文通过建立分析模型和实验统计数据,详细分析了数据碎片给数据冗余局部性以及数据去重性能所带来的负面影响,并提出通过减少数据碎片来提高数据去重性能的方法De-Frag。De-Frag的核心思想是通过保留小部分重复数据不被删除,减少所产生的数据碎片,维护备份数据流之间的数据冗余局部性；同时通过使用一个阈值来限制未删除的重复数据量,期望以牺牲较少的重复数据删除率来提高数据去重性能。实验数据表明,通过减少数据碎片量,De-Frag能在现有的数据去重方法的基础上提高数据去重吞吐率、去重数据的读性能、以及重复数据删除率等。
With the explosive of the data growth, data deduplication has been used as a common compression component in large-scale data backup systems, due to its lossless compression functions and it can get high compression ratios. However, the data deduplication now faces some problems and challenges varying the backup detests. For example, the source deduplication that is used for cloud backup services can not largely reduce the backup times, and the existing deduplication methods can cause many data fragments that affect the deduplication performance.
     Due to the low bandwidth of WAN (Wide Area Network) that supports cloud backup services, the backup times is in desperate need to be reduced. The existing source deduplication methods, including source global chunk-level deduplication and source local chunk-level deduplication, have been used to remove the redundant chunks before sending them to the remote backup destination. The former can remove the duplicate data across different clients globally but needs long deduplication time while the latter only removes the duplicate data locally within the same client to reduce deduplication time but it only get low deduplication elimination ratio and needs long data transmission time. Both of these two source deduplication methods can not largely reduce the backup time. In this paper, we propose a semantic-aware multi-tiered deduplication framework (SAM) for cloud backup services. SAM combines the source global file-level deduplication and local chunk-level deduplication while at the same time exploits some file semantics, thus to narrow the search space of the duplicate data and reduce deduplication overhead. Compared with the existing source deduplication methods, SAM can get higher deduplication elimination ratio than source local chunk-level deduplication and needs shorter deduplication time than source global chunk-level deduplication, achieving an optimal tradeoff between the deduplication elimination ratio and deduplication overhead and largely shorten the backup window.
     However, the existing source deduplication methods, including SAM, only focus on removing the redundant data for cloud backup operations, while paying little attention to the restore time. But according to our survey, the restore time is very important and critical to some enterprises that demand high data reliability, due to needing large financial cost while facing data disasters. In this paper, we propose a causality-based deduplication performance booster (CABDedupe) for cloud backup services, which captures and preserves causal relationship among chronological versions of the backup datasets. According to this causality information, CABDedupe can not only remove the redundant data for cloud backup operations to reduce backup time, but also can remove the redundant data for cloud restores to reduce restore time. Moreover, CABDedupe is a middleware that is orthogonal to and can be integrated into any existing backup system, and its failure only cause some redundant data to be kept remained and transmitted for backup/restores, but will not disturb backups/restores themselves or cause their failures.
     Due to the removal of the redundant data, any deduplication approaches will force the same file or data stream to be divided into multiple parts and cause many data fragments. The data fragmentation would get much more server especially for the long term backups and retentions, significantly affecting the deduplication performance including the deduplication throughput, data read performance and data reliability that is related to the deduplication process. In this paper, we analyze the negative effects of the data fragmentation to the deduplication performance and propose a simple but effective approach to alleviate the data fragmentation, called De-Frag. The key idea of De-Frag is to keep some redundant not removed, thus to reduce the data fragments and preserve data locality. Moreover, De-Frag uses threshold to restrict the amount of the unremoved redundant data. During our extensive experiments driven by real-world datasets, it is shown that De-Frag can effectively improve the deduplication performance while sacrificing little deduplication elimination ratio based on existing deduplication approaches.

引文

[1]K. Keeton, C. A. Santos, D. Beyer, et al. Designing for Disasters in:Proceedings of the 3rd USENIX Conference on File and Storage Technologies. San Francisco, CA, USA,2004,59-62.
    [2]L. E. Cox, C. D. Murray, B. D. Noble. Pastiche:Making Backup Cheap and Easy in:Proceedings of 5th Symposium on Operating System Design and Implementation Boston, MA, USA,2002,285-298.
    [3]A. Chervenak, V. Vellanki, Z. Kurmas. Protecting file systems:A survey of backup techniques in:Proceedings Joint NASA and IEEE Mass Storage Conference.1998, 898-911.
    [4]J. F. Gantz, D. Reinsel, C. Chute, et al. The Diverse and Exploding Digital Universe:An updated forecast of worldwide information growth through 2010. IDC white paper,2008,2-16.
    [5]M. Armbrust, A. Fox, R. Griffith, et al. Above the Clouds:A Berkeley View of Cloud Computing. UC Berkeley, Tech Rep 2009-28,2009,
    [6]张江陵,冯丹.海量信息存储.北京：科学出版社,2003
    [7]N. T. Spring, D. Wetherall. A Protocol-Independent Technique for Eliminating Redundant Network Traffic, in:Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. Stockholm, Sweden,2000,87-95.
    [8]F. Guo, P. Efstathopoulos. Building a High-performance Deduplication System in: Proceedings of the 2011 USENIX Annual Technical Conference. Portland, OR, USA,2011,271-284.
    [9]T. E. Denehy, W. W. Hsu. Duplicate management for reference data. IBM Research Report, RJ 10305 (A0310-017), IBM Research Division,2004,
    [10]C. R. Sapuntzakis, R. Chandra, B. Pfaff, et al. Optimizing the Migration of Virtual Computers in:Proceedings of the 5th Symposium on Operating Systems Design and Implementation. Boston, MA, USA,2002,377-390.
    [11]U. Manber. Finding Similar Files in a Large File System in:Proceedings of the Winter 1994 USENIX Technical Conference. San Fransisco, CA, USA,1994, 1-10.
    [12]D. Rasch, R. Burns. In-Place Rsync:File Synchronization for Mobile and Wireless Devices in:Proceedings of the 2003 USENIX Annual Technical Conference. San Antonio, TX, USA,2003,91-100.
    [13]G. Forman, K. Eshghi, S. Chiocchet. Finding Similar Files in Large Document Repositories in:Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, Illinois, USA,2005, 394-400.
    [14]L. Fan, P. Cao, J. Almeida, et al. Summary Cache:A Scalable Wide-Area Web Cache Sharing Protocol. IEEE/ACM Transaction on Networking,2000,8(3): 281-293.
    [15]P. Nath, M. A. Kozuch, D. R. O'hallaron, et al. Design Tradeoffs in Applying Content Addressable Storage to Enterprise-scale Systems Based on Virtual Machines in:Proceedings of the 2006 USENIX Annual Technical Conference. Boston, MA, USA,2006,71-84.
    [16]S. Al-Kiswany, D. Subhraveti, P. Sarkar, et al. VMFlock:Virtual Machine Co-Migration for the Cloud in:Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing. San Jose, CA, USA, 2011,159-170.
    [17]Z. Li, E. M. Greenan, A. W. Leung, et al. Power Consumption in Enterprise-Scale Backup Storage Systems in:Proceedings of the 10th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2012,65-72.
    [18]E. Kruus, C. Ungureanu, C. Dubnicki. Bimodal Content Defined Chunking for Backup Streams in:Proceedings of the 8th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2010,239-252.
    [19]N. Tolia, M. Kozuch, M. Satyanarayanan, et al. Opportunistic Use of Content Addressable Storage for Distributed File Systems in:Proceedings of the 2003 USENIX Annual Technical Conference. San Antonio, TX, USA,2003,127-140.
    [20]P. Shilane, M. Huang, G. Wallace, et al. WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression in:Proceedings of the 10th USENIX conference on File and Storage Technologies. San Jose, CA, USA,2012, 49-64.
    [21][敖莉,舒继武,李明强.重复数据删除技术.软件学报,2010,21(5)：916-929.
    [22]N. Tolia. Using Content Addressable Techniques to Optimize Client-Server Systems:[Doctoral Thesis],2007
    [23]W. J. Bolosky, S. Corbin, D. Goebel, et al. Single Instance Storage in Windows 2000 in:Proceedings of the 4th USENIX Windows Systems Symposium. Seattle, WA, USA,2000,13-24.
    [24]P. Kulkarni, F. Douglis, J. Lavoie, et al. Redundancy Elimination Within Large Collections of Files in:Proceedings of the 2004 USENIX Annual Technical Conference. Boston, MA, USA,2004,59-72.
    [25]F. Douglis, A. Iyengar. Application-specific Delta-encoding via Resemblance Detection in:Proceedings of the 2003 USENIX Annual Technical Conference. San Antonio, TX, USA,2003,113-126.
    [26]A. Adya, W. J. Bolosky, M. Castro, et al. FARSITE:Federated, Available, and Reliable Storage for an Incompletely Trusted Environment in:Proceedings of the 5th Symposium on Operating Systems Design and Implementation. Boston, MA, USA,2002,1-14.
    [27]J. R. Douceur, A. Adya, W. J. Bolosky, et al. Reclaiming Space from Duplicate Files in a Serverless Distributed File System in:Proceedings of The 22nd International Conference on Distributed Computing Systems. Vienna, Austria, 2002,617-624
    [28]EMC Centera:Content Addressed Storage System. EMC CORPORATION,2003,
    [29]C. Policroniades, L. Pratt. Alternatives for Detecting Redundancy in Storage Systems Data in:Proceedings of the 2004 USENIX Annual Technical Conference. Boston, MA, USA,2004,73-86.
    [30]M. O. Rabin. Fingerprinting by Random Polynomials. Technical Report, No TR-15-81, Center for Research in Computing Technology, Harvard University, 1981,
    [31]C. Rechberger, V. Rijmen, N. Sklavos. The NIST cryptographic workshop on hash functions. IEEE Security& Privacy,2006,4(1):54-56.
    [32]V. Henson. An Analysis of Compare-by-hash in:Proceedings of The 9th Workshop on Hot Topics in Operating Systems. Lihue, Hawaii, USA,2003,13-18.
    [33]Lior Aronovich, R. Asher, E. Bachmat, et al. The design of a similarity based deduplication system in:Proceedings of SYSTOR 2009:The 2th Annual Haifa Experimental Systems Conference. Haifa, Israel,2009,1-14.
    [34]H. Pucha, D. G. Andersen, M. Kaminsky. Exploiting similarity for multi-source downloads using file handprints in:Proceedings of the 4th USENIX conference on Networked systems design& implementation Cambridge, MA,2007,15-28
    [35]B. Zhu, K. Li, H. Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System in:Proceedings of the 6th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2008.
    [36]M. Vrable, S. Savage, G. M. Voelker. Cumulus:Filesystem Backup to the Cloud. ACM Transactions on Storage,2009,5(4):1-28.
    [37]J. Wei, H. Jiang, K. Zhou, et al. DBA:A Dynamic Bloom Filter Array for Scalable Membership Representation of Variable Large Data Sets in:Proceedings of the 19th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. Singapore,2011,466-468.
    [38]C. Ungureanu, B. Atkin, A. Aranya, et al. HydraFS:a High-Throughput File System for the HYDRAstor Content-Addressable Storage System in:Proceedings of the 8th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2010,225-238.
    [39]A. T. Clements, I. Ahmad, M. Vilayannur, et al. Decentralized Deduplication in SAN Cluster File Systems in:Proceedings of the 2009 USENIX Annual Technical Conference. San Diego, CA, USA,2009,101-114.
    [40]Y. Fu, H. Jiang, N. Xiao, et al. AA-Dedupe:An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment in:Proceedings of the 2011 IEEE International Conference on Cluster Computing. Austin, TX, USA,2011,112-120.
    [41]Y Tan, D. Feng, G. Zhou, et al. DAM:A DataOwnership-Aware Multi-Layered De-duplication Scheme in:Proceedings of the 6th IEEE International Conference on Networking, Architecture, and Storage (NAS2010). Macau, China,2006, 403-411.
    [42]N. Jain, M. Dahlin, R. Tewari. TAPER:Tiered Approach for Eliminating Redundancy in Replica Synchronization in:Proceedings of the 4th USENIX Conference on File and Storage Technologies. San Francisco, CA, USA,2005, 281-294.
    [43]K. Eshghi, M. Lillibridge, L. Wilcock, et al. Jumbo Store:Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service in: Proceedings of the 5th USENIX conference on Networked systems design and implementation. San Jose, CA, USA,2007,123-138.
    [44]Y Saito, C. Karamanolis, M. Karlsson, et al. Taming aggressive replication in the Pangaea wide-area file system in:In Proceedings of the 5th Symposium on Operating Systems Design and Implementation. Boston, MA, USA,2002,15-30.
    [45]M. Kim, L. P. Cox, B. D. Noble. Safety, Visibility, and Performance in a Wide-Area File System in:Proceedings of the 1st USENIX conference on File and storage technologies. Monterey, CA, USA,2002,131-144.
    [46]R. A. Ferreira, M. K. Ramanathan, A. Grama, et al. Randomized Protocols for Duplicate Elimination in Peer-to-Peer Storage Systems. IEEE Transactions on Parallel and Distributed Systems,2007,18(5):686-696.
    [47]S. Quinlan, S. Dorward. Venti:A New Approach to Archival Storage in: Proceedings of the USENIX Conference on File and Storage Technologies. Monterey, CA, USA,2002,89-101.
    [48]L. L. You, K. T. Pollack, D. D. E. Long. Deep Store:An Archival Storage System Architecture in:Proceedings of the 21st International Conference on Data Engineering. Tokyo, Japan,2005,804-815.
    [49]C. Liu, Y. Lu, C. Shi, et al. ADMAD:Application-Driven Metadata Aware De-duplication Archival Storage System in:Proceedings of the 5th IEEE International Workshop on Storage Network Architecture and Parallel I/Os. Baltimore, MD, USA,2008,29-35.
    [50]G. Forman, K. Eshghi, J. Suermondt. Efficient Detection of Large-Scale Redundancy in Enterprise File Systems. Operating Systems Review,2009,43(1): 84-91.
    [51]K. Srinivasan, T. Bisson, G. Goodson, et al. iDedup:Latency-aware, inline data deduplication for primary storage in:Proceedings of the 10th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2012,299-312.
    [52]G. Wallace, F. Douglis, H. Qian, et al. Characteristics of Backup Workloads in Production Systems in:Proceedings of the 10th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2012,33-48.
    [53]H. Pucha, M. Kaminsky, D. G. Andersen, et al. Adaptive File Transfers for Diverse Environments in:Proceedings of the 2008 USENIX Annual Technical Conference. Boston, MA, USA,2008,157-170.
    [54]P. Nath, B. Urgaonkar, A. Sivasubramaniam. Evaluating the Usefulness of Content Addressable Storage for High-Performance Data Intensive Applications in: Proceedings of the 17th International Symposium on High-Performance Distributed Computing Boston, MA, USA,2008,35-44.
    [55]A. Muthitacharoen, B. Chen, D. Mazieres. A Low-bandwidth Network File System in:Proceedings of the eighteenth ACM symposium on Operating systems principles. Banff, Alberta, Canada,2001,174-187.
    [56]S. Rhea, R. Cox, A. Pesterev. Fast, Inexpensive Content-Addressed Storage in Foundation in:Proceedings of the 2008 USENIX Annual Technical Conference. Boston, MA, USA,2008,143-156.
    [57]C. Dubnicki, L. Gryz, L. Heldt, et al. HYDRAstor:a Scalable Secondary Storage in:Proceedings of the 7th USENIX Conference on File and Storage Conference. San Francisco, CA, USA,2009,197-210.
    [58]B. Han, P. Keleher. Implementation and Performance Evaluation of Fuzzy File Block Matching in:Proceedings of the 2007 USENIX Annual Technical Conference. Santa Clara, CA, USA,2007,199-204.
    [59]D. R. Bobbarjung, S. Jagannathan, C. Dubnicki. Improving Duplicate Elimination in Storage systems. ACM Transactions on Storage,2006,2(4):424-448.
    [60]K. Eshghi, H. K. Tang. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Technical Report, No HPL-2005-30(R1), Hewlett-Packard Laboratories,2005,
    [61]A. Katiyar, J. Weissman. ViDeDup:An Application-Aware Framework for Video De-duplication in:Proceedings of the of the 3rd USENIX conference on Hot Topics in Storage and File Systems. Portland, OR, USA,2011,1-5.
    [62]Y. J. Guanlin Lu, D. H. C. Du. Frequency Based Chunking for Data De-Duplication in:Proceedings of the 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. Miami Beach, FL, USA,,2010,287-296.
    [63]B. Romanski, L. Heldt, W. Kilian, et al. Anchor-Driven Subchunk Deduplication in:Proceedings of SYSTOR 2011:The 4th Annual Haifa Experimental Systems Conference. Haifa, Israel,2011,1-13.
    [64]D. Meister, A. Brinkmann. Multi-Level Comparison of Data Deduplication in a Backup Scenario in:Proceedings of SYSTOR 2009:The 2th Annual Haifa Experimental Systems Conference. Haifa, Israel,2009.,1-12.
    [65]B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM,1970,13(7):422-426.
    [66]A. Broder, M. Mitzenmacher. Network Applications of Bloom Filters:A Survey. Internet Mathematics,2005,1(4):485-509.
    [67]M. Lillibridge, K. Eshghi, D. Bhagwat, et al. Sparse Indexing:Large Scale, Inline Deduplication Using Sampling and Locality in:Proceedings of the 7th USENIX Conference on File and Storage Technologies. San Francisco, CA, USA,2009, 111-123.
    [68]D. Bhagwat, K. Eshghi, D. D. E. Long, et al. Extreme Binning:Scalable, Parallel Deduplication for Chunk-based File Backup in:Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. London UK,2009,237-245.
    [69]B. Debnath, S. Sengupta, J. Li. ChunkStash:Speeding up Inline Storage Deduplication using Flash Memory in:Proceedings of the 2010 USENIX Annual Technical Conference. Boston, MA, USA,2010,215-229.
    [70]W. Xia, H. J. D. Feng, Yu. Hua. SiLo:A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput in:In Proceedings of the 2011 USENIX annual technical conference. Portland, OR, USA,2011,285-298.
    [71]Y. Tan, D. Feng, F. Huang, et al. SORT:A Similarity-Ownership Based Routing Scheme to Improve Data Read Performance for Deduplication Clusters. International Journal of Advancements in Computing Technolog,2011,3(9): 270-277.
    [72]T. Yang, D. Feng, Z. Niu, et al. Scalable high performance de-duplication backup via hash join. Journal of Zhejiang University-Science C,2010,11(5):315-327.
    [73]T. Yang, H. Jiang, D. Feng, et al. DEBAR:A Scalable High-Performance De-duplication Storage System for Backup and Archiving in:Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium. Atlanta, GA, USA,2010,1-12.
    [74]J. Wei, H. Jiang, K. Zhou, et al. MAD2:A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services in:Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies. Incline Village, NV, USA,2010,1-14.
    [75]W. Dong, F. Douglis, K. Li, et al. Tradeoffs in Scalable Data Routing for Deduplication Clusters in:Proceedings of the 9th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2011,15-29.
    [76]D. Bhagwat, K. Pollack, D. D. E. Long, et al. Providing High Reliability in a Minimum Redundancy Archival Storage System in:Proceedings of the 14th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. Monterey, CA, USA,,2006,413-421.
    [77]X. Li, M. Lillibridge, M. Uysal. Reliability analysis of deduplicated and erasure-coded storage. ACM SIGMETRICS Performance Evaluation Review 2010, 38(3):4-9.
    [78]C. Liu, Y. Gul, L. Sunl, et al. R-ADMAD:High Reliability Provision for Large-Scale De-duplication Archival Storage Systems in:Proceedings of the 23rd international conference on Supercomputing. Yorktown Heights, New York, USA., 2009,370-379.
    [79]Y. Tan, H. Jiang, D. Feng, et al. SAM:A Semantic-Aware Multi-Tiered Source De-duplication Framework for Cloud Backup in:Proceedings of the 39th International Conference on Parallel Processing. San Diego, CA, USA,2010, 614-623.
    [80]Y. Tan, H. Jiang, D. Feng, et al. CABdedupe:A Causality-based Deduplication Performance Booster for Cloud Backup Services in:Proceedings of the 25th IEEE International Parallel& Distributed Processing Symposium. Anchorage, Alaska, USA,2011,1266-1277.
    [81]EMC Avamar. http://www.emc.com/avamar
    [82]D. T. Meyer, W. J, W. J. Bolosky. A Study of Practical Deduplication in: Proceedings of the 9th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2011,1-13.
    [83]N. Agrawal, W. J. Bolosky, J. R. Douceur, et al. A Five-Year Study of File-System Metadata in:Proceedings of the 5th USENIX Conference on File and Storage Technologies. San Jose, CA, USA,2007,31-45.
    [84]J. C. Tang, C. Drews, M. Smith, et al. Exploring patterns of social commonality among file directories at work in:Proceedings of the SIGCHI conference on Human factors in computing systems. San Jose, CA, USA,2007,951-960.
    [85]P. Xia, D. Feng, H. Jiang, et al. FARMER:a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance in:Proceedings of the 17th International Symposium on High-Performance Distributed Computing. Boston, MA, USA,2008,185-196.
    [86]Z. Li, Z. Chen, Y. Zhou. C-Miner:Mining Block Correlations in Storage Systems. ACM Transactions on Storage,2004,1(2):213-245
    [87]L.-F. Cabrera, R. Rees, S. Steiner, et al. ADSM:A Multi-Platform, Scalable, Backup and Archive Mass Storage System in:COMPCON'95. San Francisco, CA, USA,1995,420-427.
    [88]J. Gahm, J. Mcknight. Medium-size business server& storage priorities. Enterprise Strategy Group Research Report,2008,
    [89]Symantec's fifth annual IT Disaster Recovery Survey Symantec Corporation Survey Report,2009,
    [90]K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, et al. Provenance-Aware Storage Systems in:Proceedings of the 2006 USENIX Annual Technical Conference. Boston, MA, USA,2006,43-56.
    [91]S. Shah, C. a. N. Soules, G. R. Ganger, et al. Using Provenance to Aid in Personal File Search in:Proceedings of 2007 USENIX Annual Technical Conference Santa Ctara,CA,2007,171-184.
    [92]K.-K. Muniswamy-Reddy, D. A. Holland. Causality-Based Versioning in: Proceedings of the 7th USENIX Conference on File and Storage Conference. San Francisco, CA, USA,2009,15-28.
    [93]C. a. N. Soules, G. R. Goodson, J. D. Strunk, et al. Metadata Efficiency in Versioning File Systems in:Proceedings of the 2nd USENIX conference on File and storage technologies. San Francisco, CA,2003,43-58.
    [94]C. a. N. Soules, G. R. Ganger. Why can't I find my files? New methods for automating attribute assignment in:Proceedings of The 9th Workshop on Hot Topics in Operating Systems. Lihue, Hawaii, USA,2003,115-120.
    [95]S. Annapureddy, M. J. Freedman, D. Mazieres. Shark:Scaling File Servers via Cooperative Caching in:Proceedings of the 2th USENIX conference on Networked systems design& implementation Boston, MA, USA,2005,129-142.
    [96]H. Lufei, W. Shi, L. Zamorano. On the Effects of Bandwidth Reduction Techniques in Distributed Applications in:Proceedings of 2004 International Conference on Embedded and Ubiquitous Computing (EUC'04). Aizu-Wakamatsu, Japan,2004,796-806.
    [97]J. C. Mogul, Y. M. Chan, T. Kelly. Design, implementation, and evaluation of duplicate transfer detection in HTTP in:Proceedings of the 1 st conference on Symposium on Networked Systems Design and Implementation. San Francisco, CA, USA,2004,43-56.
    [98]N. Tolia, J. Harkes, M. Kozuch, et al. Integrating Portable and Distributed Storage in:Proceedings of the 3rd USENIX Conference on File and Storage Technologies. San Francisco, CA, USA,2004,227-238.
    [99]D. S. Roselli, J. R. Lorch, T. E. Anderson. A comparison of file system workloads in:Proceedings of 2000 USENIX Annual Technical Conference. San Diego, CA, USA,2000,41-54.
    [100]Amazon Simple Storage Service. http://aws.amazon.com/s3

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700