高效重复数据删除技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

高效重复数据删除技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Technologies for High-effect Data De-duplication
作者：王国华
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：重复数据删除 ; 指纹摘要阵列 ; 内存索引 ; 流水线
英文关键词：Data de-duplication ; Fingerprint summary array ; Memory index ; Pipeline
学位年度：2014
导师：赵跃龙
学科代码：081203
学位授予单位：华南理工大学
论文提交日期：2014-04-24
答辩委员会主席：潘久辉

摘要

当今人类社会已经进入高速发展的信息化时代，各种数据信息呈爆炸性增长的趋势，计算机存储系统中所存储的数据信息已经越来越多，并且其中包含了大量的冗余数据，这些冗余数据还会随着时间的推移而日益增多，这不仅会浪费大量的存储空间，而且会降低存储系统的性能，导致数据管理成本增加等诸多问题。因此，研究数据缩减技术以减少存储系统中的冗余数据对优化和提高存储系统的性能有重要的理论意义和现实意义。
     重复数据删除技术是一种数据缩减技术，它可以消除存储系统中的大量冗余数据，提高存储空间利用率和减少数据管理成本，现在已经成为计算机存储领域的一个研究热点。
     目前重复数据删除技术所面临的主要技术挑战是如何通过提高重复数据的删除效率来改善存储系统的性能问题。重复数据删除效率主要体现在重复数据删除过程中的重复数据删除策略、重复数据的重删率和重复数据检测速度等方面，它对提高存储空间的利用率和优化存储系统性能有重要的影响作用。因此，本文以提高存储系统的重复数据删除效率为技术主线，重点对重复数据删除架构、全局重复数据删除策略、加快重复数据检测速度的内存索引技术和基于流水线的重复数据检测方法等关键技术问题进行了深入研究。论文作者的主要研究工作和创新点包括以下四个方面：
     （一）针对传统的重复数据删除架构可扩展性差的缺陷，提出了一种集群式的两级重复数据删除架构（Clustered Two-level Data De-duplication Architecture，CTDDA）。CTDDA主要由客户端、元数据服务器和多个存储节点组成，并且可根据需要随时添加新的节点，能够方便地实现系统容量扩展。CTDDA支持文件级和数据块级的两级重复数据删除，它首先通过元数据服务器进行文件级的重复数据删除，然后再将非重复文件均匀分布到存储集群的各节点上并行进行数据块级的重复数据删除。采用两级重复数据删除和各节点并行操作的方式可以提高存储系统的重复数据删除效率。
     （二）为了消除存储集群中各节点之间的冗余数据，提出了一种基于Bloom Filter的全局重复数据删除策略（Global Data De-duplication Strategy based on Bloom Filter，GDDSBF）。为了防止CTDDA架构中各节点在各自的节点范围内进行局部重复数据删除，GDDSBF利用Bloom Filter技术为集群中每一个节点建立一个指纹摘要向量，并将所有向量聚合在一起形成一个全局的指纹摘要阵列（Fingerprint Summary Array，FSA）。通过查询FSA，各节点就可以进行全局范围的重复数据删除，从而可以获得较高的重删率。此外，GDDSBF还适应于系统的可扩展性需求，当增加新的存储节点时，通过在指纹摘要阵列中增加新节点的指纹摘要向量，就可以将重复数据的检测范围扩展至包含新节点在内的所有节点。实验研究表明，与局部策略相比，GDDSBF策略能够删除更多的冗余数据，重删率较高，因此，它提高了存储系统的存储空间利用率。
     （三）为了提高存储系统中重复数据的检测速度，提出了一种基于哈希表的内存索引方法（Memory Index Method based on Hash Table，MIMHT）。在重复数据删除过程中，一般需要通过查询数据块索引表来检测存储系统中的重复数据；但是随着数据量的不断增加，驻留在内存中的数据块索引表也会不断增长，甚至会超出可用的内存空间，因而必须将其存放在磁盘上，这样在查询数据块索引表的时候必然会产生频繁的磁盘I/O操作。因此，MIMHT方法的思想是将磁盘索引表中的一部分“热点”数据缓存到内存中，并将属于同一容器的索引项通过环形链表链接起来，形成一种基于哈希表的内存索引结构。这样，索引项的预取和替换就以容器为单位，这可以提高内存索引查询的命中率，减少磁盘索引的访问次数。理论分析和实验结果表明，MIMHT方法比DDFS（DataDomain File System）和无向图遍历分组法具有更高的内存命中率和更快的重复数据检测速度，它提高了存储系统的I/O性能。
     （四）结合全局指纹摘要阵列和内存哈希索引结构，在对重复数据检测过程进行阶段化分析的基础上，提出了一种基于流水线的重复数据检测方法（Duplicate DataDetection Method based on Pipeline，DDDMP）。DDDMP的主要思想是在各存储节点并行进行重复数据检测的基础上，在每一个存储节点内部采用流水线技术进行重复数据检测的再次加速。此外，在相邻流水段之间采用双缓冲队列来实现线程的同步，以减少线程共享单缓冲队列方式的同步开销，并对会引起流水线停顿的内存索引查询阶段进行了优化。实验结果表明，DDDMP方法明显优于顺序执行方式，可以进一步加快重复数据的检测速度，同时也提高了重复数据删除效率和整个存储系统的性能。
Nowadays, since human society has entered the era of information technology, storagesystems have saved more and more redundant information which is keeping increasing timelywith the explosion of digital data. This redundant information not only occupies more storagespace, but also decreases the performance of storage system and increases the cost of datamanagement. Therefore, it is extraordinarily significant to do some research on data reductiontechniques to eliminate the duplicate data for optimizing and improving the performance ofstorage systems.
     Data de-duplication, a kind of data reduction technique, can increase storage spaceutilization and reduce the cost of data management by deleting a large number of redundantdata. It has now become a hot research topic in the field of computer storage.
     Currently, the main technical challenge of data de-duplication technology is theapproaches to improve the performance of the storage system by enhancing the efficiency ofdata de-duplication. The efficiency, which is a critical factor in improving the utilization ofstorage space and optimizing the performance of the storage system, mainly consists of threemain factors as the strategies, de-duplicate ratio and detection speed of data de-duplication.This dissertation investigates the approaches to enhance the efficiency of data de-duplication,especially focusing on data de-duplication architecture, global data de-duplication strategy,memory index method, and data duplicate detection method based on the pipeline. The mainresearch works and innovations are as follows:
     (1) To improve the poor scalability of the traditional de-duplication architecture, aclustered two-level data de-duplication architecture (CTDDA) is proposed. CTDDA iscomposed of client, metadata server and multiple storage nodes, in which new nodes can beadded as needed whenever to expand the system capacity. CTDDA supports both file-leveland chunk-level data de-duplication. It firstly eliminates duplicate files by metadata server,and distributes the non-duplicate files to each node evenly and de-duplicates at chunk level inparallel. The data de-duplication efficiency can then be substantially enhanced with thesetwo-level data de-duplication architecture and parallel operations in all nodes.
     (2) In order to eliminate redundant data among storage nodes, a global datade-duplication strategy based on Bloom Filter (GDDSBF) is suggested. To prevent each nodein CTDDA eliminates duplicate data locally, GDDSBF creates a fingerprint summary vectorfor each node using Bloom Filter, and all the vectors are gathered to generate a globalfingerprint summary array (FSA). Thus, each node can detect duplicate data globally by searching in FSA, and achieve a high data de-duplication ratio. Furthermore, when a newnode is added to the storage cluster, GDDSBF will extend the detection range to all nodesincluding the new node by inserting the fingerprint summary vector of the new node into FSA.Theory analysis and experimental results show that the GDDSBF can delete more redundantdata and attain a higher data de-duplication ratio than the local data de-duplication strategy.Therefore, it improves the space utilization of storage systems.
     (3) In order to accelerate duplicate data detection, a memory index based on the hashtable (MIMHT) is presented. In the process of data de-duplication, data index is generallyused to detect duplicate data. With the increasing of data, the index will become very hugeand sometimes may beyond the range of memory, so the index should be saved in disks. Toalleviate disk I/O bottleneck during the duplicate detecting, MIMHT reads the hot part of theindex from a disk to memory and creates a memory index based on a hash table. The indexentries belonging to the same container are connected by a circular linked list. Thus, thereading and replacement of index entries in MIMHT is in utits of container by which a higherhit rate can be achieved and the disk index access frequency will be reduced. Theexperimental analysis shows that MIMHT has a higher hit rate and detection speed thanDDFS (Data Domain File System) and grouping prediction method based on the undirectedgraph. It improves the I/O performance of storage systems.
     (4) Combining FSA and memory index and dividing the duplicate detection process intomultiple stages, we propose a duplicate data detection method based on pipeline (DDDMP).DDDMP can further accelerate the process of duplicate detection inside each node usingpipeline. Double buffer queues are utilized to synchronize the threads of adjacent pipelinestages, and the memory index querying stage which may cause pipeline stall is optimized byusing multiple threads. The experimental results show that DDDMP is significantly superiorto the sequential method, because it can further accelerate the duplicate detection and improvethe data de-duplication efficiency, as well as the performance of the overall system.

引文

[1] Gantz J. F., Reinsel D, Chute C., et al. The Expanding Digital Universe: A Forecast ofWorldwide Information Growth through2010[R]. USA: IDC White Paper,2007
    [2] Gantz J. F., Reinsel D. Extracting Value from Chaos[R]. USA: IDC White Paper,2011
    [3] Gantz J. F., Reinsel D. The Digital Universe in2020: Big Data, Bigger Digital Shadows,and Biggest Growth in the Far East–United States[R]. USA: IDC White Paper,2013
    [4] Reinsel D.. Our Expanding Digital World: Can we Contain it? Can we Manage it?[R]USA: Intelligent Storage Workshop (ISW2008),2008
    [5]廖海生.基于重复数据删除技术的数据容灾系统的研究[D].广州:华南理工大学,2009.11
    [6] Rivest R.. The MD5Message Digest Algorithm[EB/OL]. RFC1321,http://www.ietf.org/rfc/rfc1321.txt,1992
    [7] Eastlake D., Jones P.. US Secure Hash Algorithm1(SHA-1)[EB/OL]. RFC3174,http://www.faqs.org/rfcs/rfc3174.html,2001
    [8] Bryson J., Gallagher P.. Secure Hash Standard (SHS)[EB/OL]. FIPS180-4,http://csrc.nist.gov/publications/fips180-4/fips-180-4.pdf,2012
    [9] EMC Corporation. EMC Centera: Content Addressed Storage System[EB/OL],http://www.emc-centera.com/pdf/Centera Data Sheet.pdf,2008
    [10]Adya A., Bolosky W. J., Castro M., et al. FARSITE: Federated, Available, and ReliableStorage for an Incompletely Trusted Environment[A]. In Proceedings of the5thSymposium on Operating Systems Design and Implementation[C], Boston, MA, USA,2002:1-14
    [11]Sean Q. and Sean D.. Venti: A New Approach to Archival Storage[A]. In Proceedings ofthe USENIX Conference on File and Storage Technologies[C], Berkeley, CA, USA:USENIX Association,2002:89-101
    [12]Kubiatowicz J., Bindel D., Chen Y., et al. OceanStore: An Architecture for Global-ScalePersistent Storage[A]. In Proceedings of the9th International Conference onArchitectural Support for Programming Languages and Operating Systems[C],Cambridge, MA, USA,2000:190-201
    [13]Tolia N., Kozuch M., Satyanarayanan M., et al. Opportunistic Use of ContentAddressable Storage for Distributed File Systems[A]. In Proceedings of the2003USENIX Annual Technical Conference[C], San Antonio, TX, USA: USENIXAssociation,2003:127-140
    [14]敖莉,舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5):916-929
    [15]Rabin M. O.. Fingerprinting by random polynomials[R]. Cambridge, MA, USA: HarvardUniversity Technical Report (TR-15-81),1981
    [16]Broder A. Z.. Some applications of Rabin’s fingerprinting method[J]. R. Capocelli, A. D.Santis, and U. Vaccaro(eds), Sequences II: Methods in Communications, Security, andComputer Science, Springer-Verlag,1993:143–152
    [17]Muthitacharoen A., Chen B., and Mazières D.. A low-bandwidth network file system[A].In Proceedings of the18th ACM symposium on Operating systems principles Banff[C],Alberta, Canada: ACM Press,2001:174-187
    [18]Cox L. P., Murray C. D., and Noble B. D.. Pastiche: Making backup cheap and easy[A].In Proceedings of the5th Symposium on Operating Systems Design and Implementation[C], Boston, MA, USA,2002:285-298
    [19]You L. L., Pollack K. T., Long D. D. E.. Deep store: An archival storage systemarchitecture[A]. In Proceedings of the21st International Conference on DataEngineering[C], Washington: IEEE Computer Society Press,2005:804–815
    [20]Tan Y., Jiang H., Sha E. H., et al. SAFE: A Source Deduplication Framework forEfficient Cloud Backup Services[J]. Journal of Signal Processing Systems,2013,72:209-228
    [21]Fu Y., Jiang H., Xiao N., et al. AA-Dedupe: An Application-Aware Source DeduplicationApproach for Cloud Backup Services in the Personal Computing Environment[A]. InProceedings of2011IEEE International Conference on Cluster Computing[C], Austin,TX, USA,2011:112-120
    [22]Tan Y., Feng D., Zhou G., et al. DAM: A DataOwnership-Aware Multi-LayeredDe-duplication Scheme[A]. In Proceedings of the6th IEEE International Conference onNetworking, Architecture, and Storage[C], Macau, China,2010:15-17
    [23]Fu Y., Jiang H. and Xiao N.. A Scalable Inline Cluster Deduplication Framework for BigData Protection[A]. In Proceedings of the13th International Middleware Conference[C],New York, USA,2012:354-373
    [24]Sengar S. S. and Mishra M.. A Parallel Architecture for In-line Data De-duplication[A].In Proceedings of the2nd International Conference on Advanced Computing&Communication Technologies[C], Rohtak, Haryana, India,2012:399-403
    [25]Yang T., Jiang H., Feng D., et al. DEBAR: A Scalable High-Performance De-duplicationStorage System for Backup and Archiving[A]. In Proceeding of the24th IEEEInternational Parallel and Distributed Processing Symposium[C], Atlanta, GA, USA,2010:1-12
    [26]Kathpal A., John M. and Makkar G.. Distributed Duplicate Detection in Post-ProcessData De-duplication[A]. In Proceedings of the annual IEEE International Conference onHigh Performance Computing[C], Bangalore, India,2011:1-5
    [27]Wallace G., Douglis F., Qian H., et al. Characteristics of Backup Workloads inProduction Systems[A]. In Proceedings of the10th USENIX Conference on File andStorage Technologies[C], San Jose, CA, USA: USENIX Association,2012:33-48
    [28]Xu Y., Wu S., Chen G., et al. Data De-duplication for Primary Storage System[A]. Inproceedings of the11th International Conference of Computer and InformationScience[C], Shanghai,2012:193-198
    [29]Mao B., Jiang H., Wu S.., et al. Leveraging Data Deduplication to Improve thePerformance of Primary Storage Systems in the Cloud[A]. In Proceedings of the4thannual Symposium on Cloud Computing[C], Santa Clara, CA, USA,2013:1-2
    [30] Srinivasan K., Bisson T., Goodson G., et al. iDedup: Latency-aware, Inline DataDeduplication for Primary Storage[A]. In Proceedings of the10th USENIX Conferenceon File and Storage Technologies[C], San Jose, CA, USA: USENIX Association,2012:299-312
    [31]Constantinescu C., Glider J. S., and Chambliss D. D.. Mixing deduplication andcompression on active data sets[A]. In Proceedings of the2011Data CompressionConference, Snowbird[C], UT, USA,2011:393-402
    [32]You L. L. and Karamanolis C.. Evaluation of efficient archival storage techniques[A]. InProceedings of the21st IEEE Symposium on Mass Storage Systems and Technologies[C],College Park, MD, USA,2004:227-332
    [33]Bobbarjung D. R., Jagannathan S. and Dubnicki C.. Improving duplicate elimination instorage systems[J]. ACM Transactions on Storage,2006:424-448
    [34]Kruus E, Ungureanu C, Dubnicki C. Bimodal Content Defined Chunking for BackupStreams[A]. In Proceedings of the8th USENIX Conference on File and StorageTechnologies[C], San Jose, CA, USA: USENIX Association,2010:239-252
    [35]Lu G., Jin Y.and Du D.. Frequency Based Chunking for Data De-Duplication[A]. InProceedings of the18th Annual IEEE/ACM International Symposium on Modeling,Analysis and Simulation of Computer and Telecommunication Systems[C], Miami Beach,FL, USA,2010,287-296
    [36]Kulkarni P., Douglis F., LaVoie J., et al. Redundancy Elimination Within LargeCollections of Files[A]. In Proceedings of the2004USENIX Annual TechnicalConference[C], Boston, MA, USA: USENIX Association,2004:59-72
    [37]Lelewer D. A. and Hirschberg D. S.. Data Compression[J]. ACM Computing, SpringerVerlag (Heidelberg, FRG and NewYork NY, USA)-Verlag Surveys; ACM CR8902-0069,19(3),1987
    [38]Ajtai M., Burns R., Fagin R., et al. Compactly encoding unstructured input withdifferential compression[J]. Journal of ACM,49(3),2002:318-367
    [39]周敬利,聂雪军等.基于存储环境感知的重复数据删除算法优化[J].计算机科学,2011,38(2):63-67
    [40]Eshghi K. and Tang H. K.. A Framework for Analyzing and Improving Content-BasedChunking Algorithms[EB/OL].http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf,2005
    [41]Liu C., Lu Y., Shi C., et al. ADMAD: Application-Driven Metadata AwareDe-duplication Archival Storage System[A]. In proceeding of the5th InternationalWorkshop on Storage Network Architecture and Parallel I/Os[C], Baltimore, MD, USA,2008:29-35
    [42]Romanski B., Heldt L., Kilian W., et al. Anchor-Driven Subchunk Deduplication[A]. InProceedings of the4th Annual International Conference on Systems and Storage[C],Haifa, Israel,2011:1-13
    [43]Gueron S., Johnson S. and Walker J.. SHA-512/256[A]. In Proceedings of the8thInternational Conference on Information Technology: New Generations[C], Washington,USA: IEEE Computer Society,2011:354-358
    [44]Won Y., Kim R., Ban J., et al. PRUN: Eliminating Information Redundancy for LargeScale Data Backup[A]. In Proceedings of the International Conference on ComputationalSciences and Its Applications[C], Perugia, Italy,2008:139-144
    [45]Min J., Yoon D. and Won Y.. Efficient Deduplication Techniques for Modern BackupOperation[J]. IEEE Transactions on Computers,2011,60(6):824-840
    [46]北京邮电大学.一种双向并发执行的文件级可变长数据分块方法[P].中国:201010276233,2011.02.09
    [47]Xia W., Jiang H., Feng D., et al. P-Dedupe: Exploiting Parallelism in Data DeduplicationSystem[A]. In Proceeding of the7th International Conference on Networking,Architecture and Storage[C], XiaMen, China,2012:338-347
    [48]Suttisirikul K. and Uthayopas P.. Accelerating the Cloud Backup Using GPU Based DataDeduplication[A]. In proceeding of the18th International Conference on Parallel andDistributed Systems[C], Singapore,2012:766-769
    [49]Thwel T. T. and Thein N. L.. An Efficient Indexing Mechanism for DataDeduplication[A]. In Proceedings of the International Conference on the Current Trendsin Information Technology[C], Dubai,2009:1-5
    [50]Zhu B., Li K. and Patterson H.. Avoiding the disk bottleneck in the data domaindeduplication file system[A]. In Proceedings of the6th USENIX Conference on File andStorage Technologies[C]. Berkeley, USA: USENIX Association,2008:269-282
    [51]Bhagwa D., Eshghi K, Long D. D. E., et al. Extreme Binning: Scalable, paralleldeduplication for chunk-based file backup[A]. In Proceedings of the17th IEEE/ACMInternational Symposium on Modelling, Analysis and Simulation of Computer andTelecommunication Systems[C], London, UK,2009:1-9
    [52]Lillibridge M., Eshghi K. Bhagwat D., et al. Sparse indexing: large scale, inlinededuplication using sampling and locality[A]. In Proceeding of the7th USENIXConference on File and Storage Technologies[C], Berkeley, CA, USA: USENIXAssociation,2009:111-123
    [53]Xia W., Jiang H., Feng D., et al. SiLo: A Similarity-Locality based Near-ExactDeduplication Scheme with Low RAM Overhead and High Throughput[A]. InProceedings of the2011USENIX Annual Technical Conference[C], Portland, OR, USA:USENIX Association,2011:285-298
    [54]Tomazic S., Pavlovic V., Milovanovic J., et al. Fast file existence checking in archivingsystems[J]. ACM Transactions on Storage,2011,7(1):24-44
    [55]Meister D. and Brinkmann A.. dedupv1: Improving deduplication throughput using solidstate drives[A]. In Proceeding of the26th IEEE Symposium on Mass Storage Systemsand Technologies[C], Incline Village, NV,2010:1-6
    [56]Tan Y., Jiang H., Feng D., et al. SAM: A Semantic-Aware Multi-Tiered SourceDe-duplication Framework for Cloud Backup[A]. In Proceedings of the39thInternational Conference on Parallel Processing[C], San Diego, CA, USA,2010:13-16
    [57]Wang X. and Yu H.. How to Break MD5and Other Hash Functions[J]. Lecture Notes inComputer Science,2005,537:19-35
    [58]Tridgell A. Rolling checksum[EB/OL]. http://rsync.samba.org/tech_report/node3.html,1998
    [59]Yang T., Feng D., Liu J., et al. FBBM: A new Backup Method with Data De-duplicationCapability[A]. In Proceedings of the2nd IEEE International Conference on Multimediaand Ubiquitous Engineering[C], Busan, Korea,2008:30-35
    [60]Yang T., Feng D., Liu J., et al.3DNBS: A Data De-duplication Disk-based NetworkBackup System[A]. In Proceedings of the International Conference on Networking,Architecture, and Storage[C], Zhang Jia Jie, China,2009:287-294
    [61]Clements A. T., Ahmad I., Vilayannur M., et al. Decentralized Deduplication in SANCluster File Systems[A]. In Proceedings of the2009USENIX Annual TechnicalConference[C], San Diego, CA, USA: USENIX Association,2009:101-114
    [62]黎天翔.智能网络存储系统中的重复数据删除技术研究[D].广州:华南理工大学,2012.05
    [63]Liu C., Ju D., Gu Y., et al. Semantic Data De-duplication for Archival StorageSystems[A]. In Proceedings of the13th Asia-Pacific Computer System ArchitectureConference[C], Hsinchu, Tai Wan, China,2008:1-9
    [64]Denehy T. E., Hsu W. W.. Duplicate management for reference data[R]. USA: IBMResearch Report (RJ10305),2003
    [65]Bhagwat D., Pollack K., Long D. D. E., et al. Providing High Reliability in a MinimumRedundancy Archival Storage System[A]. In Proceedings of the14th IEEE InternationalSymposium on Modeling, Analysis, and Simulation of Computer and TelecommunicationSystems[C], Washington, USA: IEEE Computer Society Press,2006:413-421
    [66]Zhou Z. and Zhou J.. A novel data redundancy scheme for de-duplication storagesystem[A]. In Proceedings of the3rd International Symposium on KnowledgeAcquisition and Modeling[C], Wuhan, China,2010:293-296
    [67]周正达.信息存储系统中重复数据删除技术的研究[D].武汉:华中科技大学,2012.08
    [68]Chen P. M., Lee E. K., Gibson G. A., et al. RAID: High-Performance, ReliableSecondary Storage[J]. ACM Computing Surveys,1994,26(2):145-185
    [69]Reed I. S. and Solomon G.. Polynomial Codes Over Certain Finite Fields[J]. Journal ofthe Society for Industrial and Applied Mathematics,1960,8(2):300-304
    [70]Han B. and Keleher P.. Implementation and performance evaluation of fuzzy file blockmatching[A]. In Proceedings of the2007USENIX Annual Technical Conference[C],Berkeley, CA, USA: USENIX Association,2007:199204
    [71]顾瑜,刘川意,孙林春等.带重复数据删除的大规模存储系统可靠性保证.清华大学学报(自然科学版),2010,50(5):739-744
    [72]Debnath B., Sengupta S. and Li Jin. ChunkStash: Speeding up Inline StorageDeduplication using Flash Memory[A]. In Proceedings of the2010USENIX AnnualTechnical Conference[C], Berkeley, CA, USA: USENIX Association,2010:215-230
    [73]Guo F. and Efstathopoulos P.. Building a High-performance Deduplication System[A]. InProceedings of the2011USENIX Annual Technical Conference[C], Portland, OR, USA:USENIX Association,2011:271-284
    [74]Dubnicki C., Gryz L., Heldt L., et al. HYDRAstor: a Scalable Secondary Storage[A]. InProceedings of the7th USENIX Conference on File and Storage Technologies[C], SanFrancisco, CA, USA: USENIX Association,2009:197-210
    [75]Wei J., Jiang H., Zhou K., et al. MAD2: A Scalable High-Throughput ExactDeduplication Approach for Network Backup Services[A]. In Proceedings of the26thIEEE Symposium on Mass Storage Systems and Technologies[C], Incline Village, NV,USA,2010:1-14
    [76]Dong W., Douglis F., Li K., et al. Tradeoffs in Scalable Data Routing for DeduplicationClusters[A]. In Proceedings of the9th USENIX Conference on File and StorageTechnologies[C], San Jose, CA, USA: USENIX Association,2011,15-29.
    [77]Meyer D. T. and Bolosky W. J.. A Study of Practical Deduplication[A]. In Proceedingsof the9th USENIX Conference on File and Storage Technologies[C], San Jose, CA, USA:USENIX Association,2011:1-13
    [78]谭玉娟.数据备份系统中数据去重技术研究[D].武汉:华中科技大学,2012.05
    [79]Vrable M., Savage S., and Voelker G. M.. Cumulus: Filesystem Backup to the Cloud[J].ACM Transactions on Storage,2009,5(4):1-28
    [80]Wang G., Zhao Y., Xie X., et al. Research on a clustering data de-duplication mechanismbased on Bloom Filter[A]. In Proceedings of the International Conference on MultimediaTechnology[C], Ningbo, China,2010:2334-2338
    [81]Wei J., Jiang H., Zhou K., et al. DBA: A Dynamic Bloom Filter Array for ScalableMembership Representation of Variable Large Data Sets[A]. In Proceedings of the19thIEEE International Symposium on Modeling, Analysis and Simulation of Computer andTelecommunication Systems[C], Singapore,2011:466-468
    [82]Sung H. M., Lee W. Y., Kim J., et al. Design and Iimplementation of Clustering FileBackup Server Using File Fingerprint[J]. Software Engineering, Artificial Intelligence,Networking and Parallel/Distributed Computing, Springer-Verlag,2008:61-73
    [83]Lillibridge M., Eshghi K., and Bhagwat D.. Improving Restore Speed for BackupSystems that Use Inline Chunk-Based Deduplication[A]. In Proceedings of the11thUSENIX Conference on File and Storage Technologies[C], San Jose, CA, USA:USENIX Association,2013:183-197
    [84]Tan Y., Feng D., Huang F., et al. SORT: A Similarity-Ownership Based Routing Schemeto Improve Data Read Performance for Deduplication Clusters[J]. International Journal ofAdvancements in Computing Technology,2011,3(9):270-277
    [85]Nam Y., Lu G., Park N., et al. Chunk Fragmentation Level: An Effective Indicator forRead Performance Degradation in Deduplication Storage[A]. In Proceedings of the13thInternational Conference on High Performance Computing and Communications[C],Banff, AB, Canada,2011:581-586
    [86]Kaczmarczyk M., Barczynski M., Kilian W., et al. Reducing impact of datafragmentmentation caused by in-line deduplication[A]. In proceeding of the5th AnnualInternational Conference on Systems and Storage [C], Haifa, Israel,2012:1-12
    [87]Fan L., Cao P., Almeida J., et al. Summary Cache: A scalable wide-area web cachesharing protocol[J]. IEEE/ACM Transactions on Networking,2000,8(3):281-293
    [88]Broder A. and Mitzenmacher M.. Network Applications of Bloom Filters: A Survey[J].Internet Mathematics,2004,1(4):485-509
    [89]叶明江,崔勇,徐恪等.基于有状态Bloom filter引擎的高速分组检测[J].软件学报,2007,18(1):117-126
    [90]张一鸣,卢锡城,郑倩冰.一种面向大规模P2P系统的快速搜索算法[J].软件学报,2008,19(6):1473-1480
    [91]Leung A. W., Shao M., Bisson T., et al. Spyglass: Fast, Scalable Metadata Search forLarge-Scale Storage Systems[A]. In Proceedings of the7th USENIX Conference on Fileand Storage Technologies[C], San Francisco, CA, USA: USENIX Association,2009:153-166
    [92]Bloom B. H.. Space/time trade-offs in hash coding with allowable errors[J].Communications of the ACM,1970:422-426
    [93]Zhu Y., Jiang H., Wang J., et al. HBA: Distributed Metadata Management for LargeCluster-Based Storage Systems[J]. IEEE Transactions on Parallel and DistributedSystems,2008,19(6):750-762
    [94]魏建生.高性能重复数据检测与删除技术研究[D].武汉:华中科技大学,2012.01
    [95]Tsuchiya Y. and Watanabe T.. DBLK: deduplication for primary block storage[A]. Inproceedings of the27th IEEE symposiumon Mass Storage Systems and Technologies.Denver, CO, USA,2011:1-5
    [96]王禹.分布式存储系统中的数据冗余与维护技术研究[D].武汉:华南理工大学,2011.09
    [97]Dutch M.. Understanding data deduplication ratios[R]. USA: SNIA Data ManagementForum,2008
    [98]杨天明.网络备份中重复数据删除技术研究[D].武汉:华中科技大学,2010.08
    [99]曾庆辉.海量数据备份的消冗机制研究与实现[D].西安:西安电子科技大学,2010.03
    [100] Wildani A., Miller E. L., and Rodeh O.. HANDS: A Heuristically ArrangedNon-Backup In-line Deduplication System[A]. In Proceedings of the29th IEEEConference on Data Engineering[C], Brisbane, QLD, Australia,2013:446-457
    [101]王龙翔,张兴军,朱国锋等.重复数据删除中的无向图遍历分组预测方法[J].西安交通大学学报,2013,47(10):51-56
    [102] Hamacher C., Vranesic Z., Zaky S., et al. Computer Organization and EmbeddedSystems[M].6th Edition. Beijing: China Machine Press,2013:194-195
    [103] Tullsen M. T., Eggers S. J., and Levy H. M.. Simultaneous Multithreading:Maximizing On-Chip Parallelism[A]. In proceeding of the22nd Annual InternationalSymposium on Computer Architecture[C], Santa Margherita Ligure, Italy,1995:392-403
    [104] Deborah T. M., Frank B., David L. H., et al. Hyper-Threading TechnologyArchitecture and Microarchitecture[J]. Intel Technology Journal (Hyper-ThreadingTechnology),2002,6(1):4-15
    [105] Muthana P., Swaminathan M., Tummala R., et al. Packaging of Multi-CoreMicroprocessors: Tradeoffs and Potential Solutions[A]. In Proceedings of the55thElectronic Components and Technology Conference[C], Orlando, USA,2005:1895-1903
    [106] Olukotun K., Nayfeh B. A., Hammond L., et al. The Case for a Single-ChipMultiprocessor[A]. In Proceedings of the7th International Symposium on ArchitecturalSupport for Programming Languages and Operating Systems[C], New York, USA,1996:2-11
    [107] Liu C., Xue Y., Ju D., et al. A Novel Optimization Method to ImproveDe-duplication Storage System Performance[A]. In Proceedings of the15th InternationalConference on Parallel and Distributed Systems[C], Shenzhen, China,2009:228-235
    [108] Xia W., Jiang H., Feng D., et al. Accelerating Data Deduplication by ExploitingPipelining and Parallelism with Multicore or Manycore Processors[A]. In Proceedings ofthe10th USENIX Conference on File and Storage Technologies[C], San Jose, CA, USA:USENIX Association,2012:1-2
    [109]李超,王树鹏,云晓春等.一种基于流水线的重复数据删除系统读性能优化方法[J].计算机研究与发展,2013,50(1):90-100
    [110] Zhu R., Qin L., Zhou J., et al. Using multi-threads to hide deduplication I/O latencywith low synchronization overhead[J]. Journal of Central South University,2013,20(6):1582-1591

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700