信息存储系统中重复数据删除技术的研究

英文题名：Research on Data Deduplication Technology in Network Storage System
作者：周正达
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：网络存储 ; 重复数据删除 ; 数据冗余 ; 数据可靠性 ; 数据分布
英文关键词：Network Storage ; Data Deduplication ; Data Redundancy ; Data
英文关键词：Reliability ; Data Distribution
学位年度：2012
导师：周敬利
学科代码：081201
学位授予单位：华中科技大学
论文提交日期：2012-08-01

摘要

重复数据删除技术是网络存储系统中一种数据无损压缩的解决方案，可以有效地抑制数据存储开销过快的增长，缩减存储系统的构建以及运营管理的成本。在数据信息量迅猛增长的背景下，重复数据删除技术得到了学术界和产业界广泛的关注。但重复数据删除领域仍然存在诸多技术问题，如提高数据压缩率，减少处理时间，优化数据可靠性等方面。针对上述存在的问题，论文从重复数据删除处理方法，重复数据删除处理中的数据可靠性问题以及存储后台的数据分布策略三个方面展开了深入的研究。
     通过理论分析模型以及现实数据集的实测分析，对影响重复数据删除处理效果的因素展开了研究。目标数据的重复特征对重复数据删除处理的效果具有较大影响，因此，提出了一种基于重复特征的重复数据删除策略，对数据压缩率以及处理时间开销进行优化。该策略主要包括基于语义的数据分组策略和渐进式数据分割粒度判定法。基于语义的数据分组策略根据语义信息对数据的重复特征以及相似性进行判别并完成对目标数据的分组操作。渐进式数据分割粒度判定法是以数据分组为操作单位，根据重复特征对数据分割策略进行合适地设置。实验测试表明基于重复特征的重复数据删除策略相对于其它重复数据删除解决方案，在数据压缩率以及处理时间开销上获得了更加优异的综合性能。
     针对重复数据删除处理中数据可靠性的问题提出了一种最优冗余度计算模型，根据数据的引用热度提高目标数据的可靠性。为了将该理论模型应用到现实存储系统中，采用抽取数据单元样本空间计算经验数值的方法对理论模型进行了可行性优化，并提出一种基于引用热度的数据冗余策略。该数据冗余策略根据数据单元的相关属性（数据单元的大小以及引用热度）配置最优的冗余度，确保目标数据集使用最小的存储开销获得最优的数据可靠性。仿真实验验证了基于引用热度的数据冗余策略的可行性和有效性。
     针对当前数据分布策略中灵活性不足的问题，提出了一种基于容量感知的数据分布策略，以改善在物理节点间存储资源不相等的情况下存储负载的均衡程度。该策略提供了两种情况下的数据分布策略解决方案。在不考虑数据冗余度情况下，提出了一种基于容量感知的数据分布式策略，该策略基于一致性哈希数据分布算法，引入了虚拟化的设计思路，采用虚拟节点分配法进行存储资源的分配；并采用基于节点容量感知的负载均衡方法对物理存储节点之间的数据负载分布进行优化调整。在考虑数据冗余度情况下，提出了一种支持多冗余度的数据分布策略，为数据冗余策略提供灵活的平台支持，并对存储负载均衡程度进行优化。仿真测试结果表明两种数据分布策略在各自应用背景下均有助于改善存储数据负载的均衡水平。
Data deduplication technology is a lossless data compression method in networkstorage systems, which can limit the excessive growth of data storage overhead andreduce the cost of system construction and operation. As the amount of digital datagrowing explosively, the data de-duplication technology has attracted a great interestin both academia and industry. However, there are still many technical issues in thisresearch area, such as improving data compression rate, processing time, data reliability,etc. Hence this dissertation focuses on data deduplication technology in the networkstorage system in detail from three aspects: processing efficiency, data reliability and datadistribution. The main contributions of this dissertation include:
     Based on theoretical analysis by a mathematic model and experimental results onreal world datasets, performance influencing factors of deduplication processing arediscussed in detail form both data duplication characteristics and processing method.According to analysis results, a duplication characteristics aware deduplicationframework is proposed. The novel deduplication framework employs two techniques:Semantic-aware Data Grouping and Progressive Data Chunking Setting Method.Semantic-aware Data Grouping is a pre-processing of data deduplication that groups thedata by semantic information and facilitates the framework to improve performance byexploiting the data duplication characteristics. Progressive Data Chunking SettingMethod is an approach to obtain an optimal data chunking setting for a specific datagroup. The experiments show the proposed deduplication framework improves theperformance in terms of data compression rate and processing time.
     While data deduplication can achieve storage space savings effectively, it alsoreduces data reliability. An optimal redundancy model is proposed to obtain the optimalreplication degrees for the individual data object. Furthermore, the Popularity-based DataRedundancy Scheme is presented, which adopts the proposed model to achieve an optimum level of data reliability at minimum storage space overhead. In order to improvethe feasibility, several acceleration techniques are proposed in the scheme, which employsampling and empirical intermediate data to reduce the computational cost of the model.The evaluation experiment demonstrated the proposed scheme has improved the datareliability for the deduplication solutions.
     Considering the traditional data distribution schemes confront challenges to improveflexibility and load balancing in the heterogeneous distributed storage environment, theCapacity-Aware Data Distribution Method is proposed. Firstly, a consistent hashingbased capacity-aware data distribution scheme is proposed, which employs virtual nodeallocation method and capacity-aware allocation method to improve flexibility and loadbalancing. Secondly, a high reliability capacity-aware data distribution scheme isproposed, which can support multi-replica data redundancy scheme. The experimentsshow the two proposed data distribution scheme can improve load balancing in theheterogeneous distributed storage environment.

引文

[1] The Digital Universe Decade–Are You Ready? IDC white paper, May2010.http://www.emc.com/digital_universe
    [2] The Expanding Digital Universe: A Forecast of Worldwide Information GrowthThrough2010. IDC white paper, March2007.
    [3] Information Technology Small Computer System Interface2, ANSI X3.131:1994,American National Standard Institute, Sept.7,1993.
    [4] Serial Attached SCSI （SAS）, working draft, American National Standard Institute.2009.
    [5]张江陵，冯丹．海量信息存储．北京：科学出版社，2003.
    [6] David Sacks. Demystifying DAS, SAN, NAS, NAS Gateways, Fibre Channel, andiSCSI. IBM Storage Consultant, March2001.6~10
    [7] Fibre Channel3rd Generation Physical Interface （FC-PH-3）, ANSI X3.303:1998,American National Standard Institute, Nov.5,1998.
    [8] G. Pfister. An Introduction to the InfiniBand Arechitecture IEEE Press,2001.
    [9] Ravi K. Khattar, Mark S. Murphy, Giulio J.Tarella, Kyell E. Nystrom. Introduction toStorage Area Network, August1999.
    [10]iSCSI Protocol （Consolidated）, Internet Engineering Task Force （IETF）, StorageMaintenance Working Group（Storm WG）,Internet Draft, Oct.2011.
    [11]Fibre Channel Over TCP/IP （FCIP）, Internet Engineering Task Force （IETF）,Network Working Group, RFC3821（Proposed Standard）, July.2007.
    [12]iFCP-A Protocol for Internet Fibre Channel Storage Networking, InternetEngineering Task Force （IETF）, Network Working Group,RFC4172（ProposedStandard）, Sept.2005.
    [13]Deprecation of the Internet Fibre Channel Protocol （iFCP） Address TranslationMode, Internet Engineering Task Force （IETF）, Network Working Group, RFC6172（Proposed Standard）, March.2011.
    [14]Michael Factor, Kalman Meth, Dalit Naor, Ohad Rodeh, and Julian Satran. Objectstorage: The future building block for storage systems. In2nd International IEEESymposium on Mass Storage Systems and Technologies,2005.
    [15]Michael Mesnier, Gregory R. Ganger, Erik Riedel. Object-based storage: pushingmore functionality into storage. IEEE Potentials，Vol24, Issue2, Apr2005.31-34.
    [16]R. Buyya, C. S. Yeo, S. Venugopal. Market-oriented cloud computing: Vision, hype,and reality for delivering IT services as computing utilities. Proceedings of the10thIEEE International Conference on High Performance Computing andCommunications,2008.
    [17]Armbrust, M., et al. Above the clouds: A Berkeley view of cloud computing. Tech.Rep. UCB/EECS-2009-28, EECS Department, U.C. Berkeley, Feb2009.
    [18]Cloud Storage Reference Model, Storage Networking Industry Association（SINA）,Trial-Use Draft, Jun.2009.
    [19]Cloud Storage Use Cases, Storage Networking Industry Association（SINA）,Trial-Use Draft,Jun.2009.
    [20]Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: What does anMTTF of1,000,000hours mean to you? In Proc. of the FAST’07Conference onFile and Storage Technologies, Feb2007.
    [21]E. Pinheiro, W. D. Weber, and L. A. Barroso. Failure trends in a large disk drivepopulation. In Proc. of the FAST’07Conference on File and Storage Technologies,Feb2007.
    [22]Report To Congress on Server and Data Center Energy Efficiency. In U.S. EPATechnical Report,2007.
    [23]A Comparison of SSD, ioDrives, and SAS rotational drives using TPC-H Benchmark.Technical Report White Paper, HP Development Company,2010.
    [24]FIPS180-1. Secure Hash Standard. U.S. Department of Commerce/NIST, NationalTechnical Information Service, Springfield, VA, Apr.1995.
    [25]R. Rivest. The MD5message-digest algorithm. Request For Comments（RFC）1321,IETF, Apr.1992.
    [26]V. Henson. An analysis of compare-by-hash. In HotOS: Hot Topics in OperatingSystems,USENIX,2003.13-18
    [27]Bolosky WJ,Corbin S,Goebel D,Douceur JR.Single instance storage in Windows2000.In:Proc.of the4th Usenix Windows System Symp.Berkeley: USENIXAssociation,2000.13-24.
    [28]S. Quinlan and S. Dorward. Venti: A new approach to archival storage. InProceedings of the2002Conference on File and Storage Technologies （FAST）,Monterey, California, USA,2002. USENIX.,89–101
    [29]New Venti manual page （overview） http://man.cat-v.org/p9p/7/venti
    [30]Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi,R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B.2000.Oceanstore: An Architecture For Global Store Persistent Storage. In Proceedings ofthe Ninth international Conference on Architectural Support for ProgrammingLanguages and Operating Systems （AS-PLOS2000）. Cambridge, MA. USA,2000.
    [31]Pike, R., Presotto, D., Thompson, K., and Trickey, H.1990. Plan9from bell labs. InProceedings of the Summer1990UKUUG Conference.1–9.
    [32]Athicha Muthitacharoen, Benjie Chen, David Mazières. A low-bandwidth networkfile system. In Proceedings of the18th Symposium on Operating Systems Principles,October2001.
    [33]Michael O. Rabin. Fingerprinting by random polynomials.Technical Report TR-15-81,Center for Research in Computing Technology, Harvard University,1981
    [34]You, L. L., Pollack, K. T., Long, D. D. E. Deep store: An archival storage systemarchitecture. In ICDE’05: Proceedings of the21st International Conference on DataEngineering （Washington, DC, USA,2005）, IEEE Computer Society.804–8015.
    [35]B. Zhu, K. Li, and H. Patterson, Avoiding the Disk Bottleneck in the Data DomainDeduplication File System, in Proceedings of the6th USENIX Conference on Fileand Storage Technologies （FAST）. San Jose, CA, USA. Feb.2008.269-282.
    [36]Burton H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors.Communications of the ACM,1970.13（7）:422-426.
    [37]M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Campbell,Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. inProceedings of the7th USENIX Conference on File and Storage technologies（FAST）,San Francisco, CA, USA, Feb.2009.111-123.
    [38]D. Bhagwat, K. Eshghi, D. D. E. Long, M. Lillibridge,“Extreme Binning: Scalable,Parallel Deduplication for Chunk-based File Backup,” in Proceedings of the17thIEEE/ACM International Symposium on Modelling, Analysis and Simulation ofComputer and Telecommunication Systems （MASCOTS）, London, UK, Sept.2009.
    [39]A. Z. Broder. On the resemblance and containment of documents. inSEQUENCES’97: Proceedings of the Compression and Complexity of Sequences1997.21–29.
    [40]T. Yang, D. Feng, Z. Niu, K. Zhou, Y. Wan. Debar: a scalable high performancededuplication storage system for backup and archiving. in IEEE InternationalSymposium on Parallel and Distributed Processing.2010.
    [41]C. Liu, Y. Lu, C. Shi, G. Lu, D. Du, D.-S. Wang. ADMAD: Application-drivenmetadata aware de-deduplication archival storage systems. in Proceedings of the5thIEEE International Workshop on Storage Network Architecture and Parallel I/Os（SNAPI’08）,2008.29-35.
    [42]Symantec Corporation, Symantec NetBackup PureDisk Optimizing Backups withDeduplication for Remote Offices, Data Center and Virtual Machines,2009.
    [43]DATA DOMAIN. EMC, DD990Appliance Data Sheet.2012.
    [44]IBM. IBM System Storage TS7650G ProtecTIER Deduplication Gateway.2012.
    [45]IBM, IBM System Storage TS7650ProtecTIER Deduplication Appliance,2012.
    [46]Hewlett-Packard Development Company, HP StorageWorks D2D Backup Systems.Rev.1. August2007
    [48]FalconStor,Virtual Tape Library （VTL）600Series Appliances.2012.
    [49]ExaGrid Systems Inc., ExaGrid Cost-Effective Disk-Based Backup,2011.
    [50]NEC Corporation, HYDRAstor HS8-3000Scale-out Global Deduplication Storagefor Backup and Archive.2012
    [51]GREENBYTES. GB-X Series. http://www.getgreenbytes.com/GB-X/
    [52]Eshghi, K., Tang, H. K. A framework for analyzing and improving content-basedchunking algorithms. Technical report HPL-2005-30R1, HP Laboratories, Oct,2005.
    [53]BingChun Chang. A running time improvement for the two thresholds two divisorsalgorithm MS dissertation, San Jose State University,2009.
    [54]Bobbarjung, D. R., Jagannathan, S., Dubnicki, C. Improving duplicate elimination instorage systems. Trans. Storage.2006.(2)4.424–448.
    [55]E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content defined chunking forbackup streams. In FAST’10: Proceedings of the8th Conference on File and StorageTechnologies, February2010.
    [56]G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, W. Hsu.Characteristics of backup workloads in production systems. In Proceedings of the10th USENIX Conference on File and Storage Technologies,2012.
    [57]N. Park, D. Lilja. Characterizing datasets for data deduplication in backupapplications. In IEEE International Symposium on Workload Characterization,2010.
    [58]You, L. L., Karamanolis, C. Evaluation of efficient archival storage techniques. InProceedings of the21st IEEE/12th NASA Goddard Conference on Mass StorageSystems and Technologies, College Park, MD, April2004.
    [59]D. T.Meyer, W. J. Bolosky. A Study of Practical Deduplication. In FAST’11:Proceedings of the9th Conference on File and Storage Technologies, February2011.
    [60]Q. Xin, E. L. Miller, D. D. E. Long, S. A. Brandt,T. Schwarz, W. Litwin. Reliabilitymechanisms for very large storage systems. In Proceedings of the20th IEEE/11thNASA Goddard Conference on Mass Storage Systems and Technologies, Apr.2003.146–156
    [61]Partho Nath; Bhuvan Urgaonkar; Anand Sivasubramaniam. Evaluating the usefulnessof content addressable storage for high-performance data intensive applications.Proceedings of the17th international symposium on High performance distributedcomputing, Boston, MA, USA,2008.
    [62]http://www.tenmax.com/teleport/pro/home.htm
    [63]http://news.sina.com.cn/
    [64]http://news.sohu.com/
    [65]http://news.163.com/
    [66]Bruce E. Bargmeyer, Daniel W. Gillman. Metadata Standards and MetadataRegistries: An Overview. January2009.
    [67]Y. Tan, H. Jiang, D. Feng, L. Tian, Z. Yan and G. Zhou,“SAM: A Semantic-AwareMulti-tiered Source De-duplication Framework for Cloud Backup,” in Proceedings ofthe39th International Conference on Parallel Processing （ICPP’10）,2010.614-623.
    [68]Moreton TD, Pratt IA, Harris TL.Storage, mutability and naming in Pasta. In:Proc.ofthe Int’l Workshop on Peer-to-Peer Computing at Networking2002Workshops.Berlin,Heidelberg:Springer-Verlag,2002.215–219.
    [69]Cox LP,Murray CD,Noble BD.Pastiche:Making backup cheap and easy.In:Proc.of the5th Symp.on Operating Systems Design and Implementation（OSDI2002）.Berkeley:USENIX Association,2002.285–298.
    [70]Bart omiej Romanski, ukasz Heldt, Wojciech Kilian, Krzysztof Lichota, CezaryDubnicki Anchor-driven Subchunk Deduplication，Proceedings of the4th AnnualInternational Conference on Systems and Storage,2011.1-13
    [71]Guanlin Lu; Yu Jin; Du, D.H.C. Frequency Based Chunking for Data De-Duplication,Modeling, Analysis and Simulation of Computer and Telecommunication Systems（MASCOTS）,2010.287–296.
    [72]R. Durrett. Probability: Theory and Examples. Brooks/Cole Publishing Company,1991.
    [73]刘次华，万建平．工科数学分析．武汉：华中科技大学出版社，1999.
    [74]EMC Corp. EMC Centera content-addressed storage system，datasheet，oct.2009.10
    [75]Sun Microsystems, Sun StorageTek5800system architecture.White paper, Dec.2007.
    [76]C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J.Szczepkowski, C. Ungureanu, and M.Welnicki. HYDRAstor: A Scalable SecondaryStorage. In Proceedings of the Seventh USENIX Conference on File and StorageTechnologies （FAST2009）, San Francisco, California, Feb.2009.197–210.
    [77]HydraFS: a High-Throughput File System for the HYDRAstor Content-AddressableStorage System. In Proceedings of the8th USENIX Conference on File and StorageTechnologies（FAST2010）
    [78]Rabin MO. Fingerprinting by random polynomials. Technical Report, CRCTTR-15-81, Harvard University,1981.
    [79]R. J. Honicky and E. L. Miller. A fast algorithm for online placement andreorganization of replicated data. In Proceedings of the17th International Parallel&Distributed Processing Symposium （IPDPS2003）, Nice, France, Apr.2003.
    [80]P.A. Larson. Linear hashing with partial expansions. In Proc. of VLDB,1980.
    [81]P.A. Larson. Dynamic hash tables. CA CM,31（4）,April1988.
    [82]W. Litwin, M-A Neimat, D. Schneider, LH*: Linear Hushing for distributed files, InACM-SIGMOD Intl. Conference on Management of Data, May1993.327-336.
    [83]W. Litwin, M.-A. Neimat, D. A. Schneider. LH*—a scalable, distributed datastructure. ACM Transactions on Database Systems,21（4）,1996.480–525.
    [84]R. Fagin, Jurg Nievergelt, N. Pippenger, H. R. Strong, Extendible Hashing-A FastAccess Method for Dynamic Files, In ACM Transactions on Database Systems,1979.315-344.
    [85]Hilford, V., Bastani, F.B., Cukic, B., EH*-Extendible Hashing in DistributedEnvironment, In Proc. of the COMPSAC’97–21st International Computer Softwareand Applications Conference.
    [86]R. Devine. Design and implementation of DDH: A distributed dynamic hashingalgorithm. In Proceedings of the4th International Conference on Foundations of DataOrganization and Algorithms,1993.101–114.
    [87]B. Kr¨oll, P.Widmayer. Distributing a search tree among a growing number ofprocessors. In Proceedings of the1994ACM SIGMOD International Conference onManagement of Data, ACM Press,1994.265–276.
    [88]Karger, D., Lehman, E., Leighton, F., Levine, M., Lewin, D., Panigrahy, R.Consistent hashing and random trees: Distributed caching protocols for relieving hotspots on the World Wide Web. In Proceedings of the29th Annual ACM Symposiumon Theory of Computing. May1997.654–663
    [89]LEWIN, D. Consistent hashing and random trees: Algorithms for caching indistributed networks. Master’s thesis, Department of EECS, MIT,1998.
    [90]I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: Ascalable peer-to-peer lookup service for Internet applications. In Proc. ACMSIGCOMM, Aug.2001.
    [91]D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V.Vasudevan. FAWN: A fast array of wimpy nodes. In Proc.22nd ACM Symposium onOperating Systems Principles （SOSP）, Big Sky, MT, Oct.2009.
    [92]14Litwin, W.; Neimat, M.-A. High-availability LH*schemes with mirroring,Cooperative Information Systems,1996. Proceedings.196-205
    [93]Litwin, W.; Neimat, M.-A.; Lev, G.; Ndiaye, S.; Seck, T. LH*s: a high-availabilityand high-security scalable distributed data structure. Research Issues in DataEngineering,1997. Proceedings.141-150
    [94]Litwin, W.; Risch, T. LH*g: a high-availability scalable distributed data structure byrecord grouping Knowledge and Data Engineering, IEEE Transactions.2002:14(4):923–927
    [95]Litwin, W., Schwarz, T. LH*RS: A high-availability scalable distributed datastructure using reed solomon codes. In Proc. of the2000ACMSIGMOD Intl.Conference on Management of Data. May2000.237–248.
    [96]Y. Saito, C. Karamanolis, M. Karlsson, M. Mahalingam. Taming aggressivereplication in the Pangaea wide-area file system. In Proceedings of the5thSymposium on Operating System Design and Implementation, Boston, MA,2002.15–30.
    [97]D. M. Geels. Data replication in oceanstore. Technical Report UCB/CSD-02-1217,EECS Department, University of California, Berkeley, December2002.
    [98]G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S.Sivasubramanian, P. Vosshall, W. Vogels. Dynamo: Amazon’s Highly AvailableKey-Value Store. In Proceedings of the21st ACM Symposium on Operating SystemsPrinciples （SOSP’07）, Stevenson WA, October2007.
    [99]T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to Algorithms,Second Edition. MIT Press,Cambridge, Massachusetts,2001.
    [100] R. J. Honicky, E. L. Miller. Replication under scalable hashing: A family ofalgorithms for scalable decentralized data distribution, in Proceedings of the18thInternational Parallel&Distributed Processing Symposium （IPDPS2004）, vol.1,2004.96–106.
    [101] S. A. Weil, S. A. Brandt, E. L. Miller, C. Maltzahn. CRUSH: Controlled, scalable,decentralized placement of replicated data. In Proceedings of the2006ACM/IEEEConference on Supercomputing （SC’06）, Tampa, FL, Nov.2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700