内容感知存储系统中的信息检索关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

内容感知存储系统中的信息检索关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on the Key Technology of Information Retrieval in Content Aware Network Storage System
作者：刘科
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：内容感知存储系统 ; 信息检索 ; 索引分割 ; 重复数据删除 ; 信息生命周期管理 ; 相似性查询 ; 数据相关图
英文关键词：Content Aware Storage System ; Information Retrieval ; Indexing Partition ; Deduplication ; Information Lifecycle Management ; Similarity Retrieval ; Correlation Graph
学位年度：2012
导师：周敬利
学科代码：081201
学位授予单位：华中科技大学
论文提交日期：2012-01-01

摘要

随着信息技术的日益发展，信息资源呈现出爆炸性增长的态势。数字资源的急剧增长导致有效数据通常被淹没在信息海洋之中，单靠人工或传统查询工具将很难迅速定位所需信息。由于信息检索技术可从大规模信息系统中快速、准确、全面地获取有效信息，因此它被认为是解决上述问题的最佳途径。现有研究围绕着提高存储系统智能化程度、增强异构信息检索能力和提升查询结果相关性等方面展开，但是存储与查询功能的相对独立使得存储系统很难理解所存内容、并依据感知到的信息进行查询优化操作。为了将信息检索领域中的相关技术移植并应用于存储领域，在研究内容感知存储系统架构的基础上，探讨了该系统中的信息组织、索引和检索机制，从系统结构角度提供了一种有效融合存储与检索功能的整体解决方案。
     为解决存储系统缺乏内容感知能力的问题，设计了一种跨越存储栈的信息扩展与传递机制。该机制根据应用层的具体需求抽取上层语义信息并作为扩展信息保存，然后在传统的数据I/O通道上扩展元数据I/O通道实现扩展信息的传递，存储系统通过解析此类扩展信息获取语义内容，进而实现在存储系统内部感知和使用上层信息以优化系统整体性能的目的。在该扩展传递机制的基础上设计并实现了内容感知网络存储原型系统。
     为充分利用存储系统感知到的各类信息，为使用者提供高效、便捷的查询服务，提出了内容感知网络存储系统中的两阶段检索策略。由于存储系统中的查询需求主要来自系统管理员对元数据的查询，以及普通用户对关键字内容的查询。这两类查询通过对元数据和关键字分别建立索引来提升系统查询速度，但存储系统自身所具备的特性并没有被用于优化上述查询过程。所提出的两阶段检索策略将基于元数据和关键字的查询与底层存储系统的块相似性查询相结合，提升了系统的整体查询效率。
     为有效衡量索引优化操作对系统性能的影响，提出了基于分级存储的索引分割机制及开销模型。随着存储系统中信息量的不断增大索引所消耗的空间也在同步增长，有些索引在生成之后几乎不会被检索到，因此并非所有索引都拥有相同的访问频率。据此索引优化算法按照访问频率对索引进行分割和分级存储，将不常用的索引存放到低速存储设备上以节约成本，并分析了索引分割对查询命中率、索引空间开销以及查询时间所造成的影响。
     为满足用户对于相似性查询的需求，提出了基于内容哈希的数据相关图构建方法。由于存储系统通常采用层次结构来组织和管理数据，这种层次化设计思想通过标准的接口在各层间传递特定的信息，它隐藏了每一层所不必关心的信息，但也约束和限制了扩展信息在各个层次间的自由流动。所提出的数据相关图构建方法以存储系统中的重复数据块为桥梁，通过打破层次壁垒在多层信息之间建立起联系，生成具备全局特征的完整数据相关图，为将信息检索领域的相关理论引入存储系统奠定了基础。
     为解决用户查询请求过于宽泛或精细时无法得到预期查询结果的问题，结合数据相关图对两阶段查询机制中的排序算法进行改进，提出了块相似性度量算法。该算法将信息检索中网页排序算法的核心思想引入存储系统，以重复数据删除计算所得到的重复数据块作为生成数据相关图和衡量数据相关度的依据，改进了现有的相似查询和相关度计算方法。该解决方案反映了数据的内部结构特征，降低了查询失效率、提高了查全率。
     从上述多个方面开展深入研究，经过模型建立、算法生成、理论分析、实验验证等步骤，将内容感知技术和信息检索关键技术引入到存储系统中，提升了存储系统的智能化程度和信息检索能力。
With the increasing development of information technology, heterogeneous resourcesgrow explosively in recent years and valid data is usually lost in the information oceanbecause of the rapid growth of information. It is hard to quickly locate the requiredinformation just relying on traditional query tools. As modern information retrieval canaccurately access the information from large-scale system in an efficient way, it isconsidered to be the best method to solve the above problem. Current researches focus onimproving intelligence of storage system, enhancing search capability of heterogeneousinformation and increasing query accuracy. As the relatively independent of storage andretrieval components, it is hard to understand the data content and optimize data layout forfast retrieval in storage system. In order to lead information retrieval technology intostorage research area and maximize the query efficiency, a new mechanism of informationorganization, indexing and retrieval has been considered. And an overall solution schemeis provided to integrate retrieval capacity into storage system.
     To address the lack of content aware capacity in storage system, an informationextension mechanism is proposed to transmit information through storage stacks. Theupper semantic information is extracted and stored as extended information. Then theadvanced metadata I/O channel based on the traditional data I/O channels transfers theextended data to the lower storage system. By analyzing the additional information,storage systems realize and use the upper semantic information to optimize the overallsystem performance. Based on the information extension mechanism, a content awarenetwork storage prototype system is implemented.
     In order to take advantage of the semantic information and the duplicate blockinformation to deliver efficient query service for users, a two-phrase retrieval strategy isintroduced. As the query requests in storage system are coming from two aspects, theformer one is metadata retrieval that delivered by administrator and the latter one is user’scommon keyword query. The indexing structure can efficiently enhance the queryperformance, but the functions of de-duplication and block similarity detection in contentaware storage system are not utilized to enhance the above query processing. Theproposed strategy combines metadata/keyword query with block similarity query andutilizes ranking coefficient to evaluate similarity among query results. Thus the retrieval algorithm has efficiently enhanced the retrieval recall.
     Propose an index partition mechanism and query cost model based on tiered storage.The index space is increasing as file number has increased significantly in storage system.However, not all of these indexes have the same access frequency, some of which willnever be retrieved after being generated. So index has been segmented according to theaccessing frequency, those inactive index will be stored in low-speed storage device tosave costs. Meanwhile, the index partition performance, index space cost and queryprecision have been considered.
     Propose a correlation graph construction method based on content hash to satisfythose query requests. It is well known that hierarchical structure is typically utilized toorganize and manage data in storage system. Specific information can be passed from onelayer to another through a standard interface in this architecture. It brings benefits to hidethe non-concerned information of each layer, while constraints the fluently informationmigrate between all levels. In order to establish a special hyperlink data structure instorage system and generate a global feature to meet the user’s complex query requests,the barriers between all levels have been broken to establish a stable correlation graph.
     Users can not get the desired results when those submitted query terms are too broador too fine. So the ranking algorithm in the two-phrase query mechanism needs to beextended. The enhanced algorithm modifies the information retrieval method to measurethe similarity query results in storage system. Meanwhile, correlation graph and blocksimilarity algorithm based on de-duplication technology are utilized to sort the queryresults. This kind of solution can better reflect the characteristics of the internal datastructure, as well as reducing the query failure rate and improving the recall rate.
     Guided by the above research methods, through prototype modeling, algorithmgenerating, theoretical analyzing and experiment verifying steps, content aware andinformation retrieval technology are integrated into storage system. Experiments indicatethat the storage intelligence and information retrieval capabilities have been rapidlyenhanced in content aware network storage system.

引文

[1] Gantz John F., Chute Christopher, Manfrediz Alex, et al. The diverse andexploding digital universe: An updated forecast of worldwide information growththrough2011. IDC white paper, March2008.
    [2] Jim Gray, David T. Liu, Maria Nieto-Santisteban, et al. Scientific datamanagement in the coming decade. SIGMOD Record,2005,34(4):34-41.
    [3] Sivathanu Gopalan. End-to-end abstractions for application-aware storage.Doctoral Thesis:State University of New York at Stony Brook,2008.
    [4] Soules Craig A. N. Using context to assist in personal file retrieval.2006, CarnegieMellon University.94.
    [5] Sam Shah, Craig A. N. Soules, Gregory R. Ganger, et al. Using provenance to aidin personal file search. In: Proceedings of the2007USENIX Annual TechnicalConference.2007.1-14.
    [6] David K. Gifford, Pierre Jouvelot, Mark A. Sheldon, et al. Semantic file systems.ACM SIGOPS Operating Systems Review,1991,25(5):16-25.
    [7] Ellard D., Mesnier M., Thereska E., et al. Attribute-Based Prediction of FileProperties. December2003, Harvard Computer Science Group.
    [8] Soules C. A. N. and Ganger G. R. Toward automatic context-based attributeassignment for semantic file systems. June2004, Parallel data laboratory,CarnegieMellon University.
    [9] Aravindan Raghuveer, Meera Jindal, Mohamed F. Mokbel, et al. Towards efficientsearch on unstructured data: an intelligent-storage approach. In: Proceedings of thesixteenth ACM conference on Conference on information and knowledgemanagement.2007.951-954.
    [10] Franklin Michael, Halevy Alon and David Maier. From databases to dataspaces: anew abstraction for information management. SIGMOD Record,2005,34(4):27-33.
    [11] J. Liu, X. Dong and A.Y Halevy. Answering Structured Queries on UnstructuredData. In: Proceedings of WebDB.2006.25-30.
    [12] Xin Dong and Alon Halevy. Indexing dataspaces. In: Proceedings of the2007ACM SIGMOD international conference on Management of data.2007.43-54.
    [13] Leung Andrew. Organizing indexing and searching large-scale file systems.Technology Report UCSC-SSRC-09, University of California, Santa Cruz, Dec.2009.
    [14] Nicholas Lester, Alistair Moffat and Justin Zobel. Fast on-line index constructionby geometric partitioning. In: Proceedings of the14th ACM internationalconference on Information and knowledge management.2005.776-783.
    [15] David H. C. Du. Intelligent Storage for Information Retrieval. In: Proceedings ofthe International Conference on Next Generation Web Services Practices.2005.214-220.
    [16] Parker-Wood Aleatha. Fast security-aware search on large scale file systems.March2009, UCSC CMPS221Class Project Report.
    [17] Wen-tau Yih and Christopher Meek. Consistent phrase relevance measures. In:Proceedings of the2nd International Workshop on Data Mining and AudienceIntelligence for Advertising.2008.37-44.
    [18] Carpineto Claudio, Mori Renato de, Romano Giovanni, et al. Aninformation-theoretic approach to automatic query expansion. ACM Transactionson Information Systems (TOIS), Jan.2001,19(1):1-27.
    [19] Song R., Wen J.R., S. Shi G. Xin, et al. Microsoft Research Asia at web track andterabyte track of TREC2004. In: Text REtrieval Conference.2004.
    [20] Qin Tao, Liu Tie-Yan, Zhang Xu-Dong, et al. A study of relevance propagation forweb search. In: Proceedings of the Twenty-Eighth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval.2005.408-415.
    [21] Anna Povzner, Kimberly Keeton, Arif Merchant, et al. Autograph-automaticallyextracting workflow file signatures. ACM SIGOPS Operating Systems Review,200943(1):76-83.
    [22] Soules Craig A. N. and Ganger Gregory R. Connections: Using context to enhancefile search. Operating Systems Review (ACM),2005,39(5):119-132.
    [23] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, et al. Scatter/Gather: acluster-based approach to browsing large document collections. In: Proceedings ofthe15th annual international ACM SIGIR conference on Research anddevelopment in information retrieval.1992.318-329.
    [24] Quan Dennis, Huynh David and Karger David R. Haystack: A Platform forAuthoring End User Semantic Web Applications. ISWC,2003:738-753.
    [25] Jim Gemmell, Gordon Bell, Roger Lueder, et al. MyLifeBits: fulfilling the Memexvision. In: Proceedings of the tenth ACM international conference on Multimedia.2002.235-238.
    [26] J Rhodes B and T Starner. Remembrance Agent: a continuously running automatedinformation retrieval system. In: Proceedings of the First International Conferenceon the Practical Application of Intelligent Agents and Multi-agent Technology,PAAM'96.1996.487-495.
    [27] Fertig S., Freeman E. and Gelernter D. Lifestreams: an alternative to the desktopmetaphor. In: Proceedings of the Conference on Human Factors in ComputingSystems.1996.410-411.
    [28] Bradley Rhodes. Using Physical Context for Just-in-Time Information Retrieval.IEEE Trans. Comput.,2003,52(8):1011-1014.
    [29] Anurag Acharya, Mustafa Uysal and Joel Saltz. Active disks: programming model,algorithms and evaluation. In: Proceedings of the eighth international conferenceon Architectural support for programming languages and operating systems.1998.81-91.
    [30] Michael Mesnier, Eno Thereska, Gregory R. Ganger, et al. File Classification inSelf-*Storage Systems. In: Proceedings of the First International Conference onAutonomic Computing.2004.44-51.
    [31] Ganger G. R., Strunk J. D. and Klosterman A. J. Self-*Storage: Brick-basedStorage with Automated Administration. Technical Report CMU-CS-03-178,2003.
    [32] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Lakshmi N.Bairavasundaram, et al. Semantically-smart disk systems: past, present, and future.ACM SIGMETRICS Performance Evaluation Review,2006,33(4):29-35.
    [33] Muthian Sivathanu, Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, etal. Database-aware semantically-smart storage. In: Proceedings of the4thconference on USENIX Conference on File and Storage Technologies.2005.239-252.
    [34] Garth A. Gibson, David F. Nagle, Khalil Amiri, et al. A cost-effective,high-bandwidth storage architecture. ACM SIGOPS Operating Systems Review,1998,32(5):92-103.
    [35] Burra Gopal and Udi Manber. Integrating content-based access mechanisms withhierarchical file systems. In: Proceedings of the third symposium on Operatingsystems design and implementation.1999.265-278.
    [36] Zhichen Xu, Magnus Karlsson, Chunqiang Tang, et al. Towards a semantic-awarefile store. In: Proceedings of the9th conference on Hot Topics in OperatingSystems.2003.145-150.
    [37] Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis, et al. Fusing data managementservices with file systems. In: Proceedings of the4th Annual Workshop onPetascale Data Storage.2009.42-46.
    [38] Carns Philip H., Ligon Walter B., Ross Robert B., et al. PVFS: a parallel filesystem for Linux clusters. In: Proceedings of the4th Annual Linux Showcase andConference. October2000.317-327.
    [39] Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung. The Google file system.ACM SIGOPS Operating Systems Review. Vol.37.2003.29-43.
    [40] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, et al. Ceph: a scalable,high-performance distributed file system. In: Proceedings of the7th symposium onOperating systems design and implementation.2006.307-320.
    [41]王鑫印.无结构和半结构信息检索相关技术研究:[博士学位论文].上海：复旦大学图书馆,2007.
    [42] Taro L. Saito and Shinichi Morishita. Relational-style XML query. In: Proceedingsof the2008ACM SIGMOD international conference on Management of data.2008.303-314.
    [43] Jagadish H. V., Al-Khalifa S., Chapman A., et al. TIMBER: A native XMLdatabase The International Journal on Very Large Data Bases,2002,11(4):274-291.
    [44] Lillibridge M., Eshghi K., Bhagwat D., et al. Sparse Indexing: Large Scale, InlineDeduplication Using Sampling and Locality. In: Proceedings of the7th USENIXConference on File and Storage Technologies (FAST). February2009.111-123.
    [45] Mauldin M. L. Retrieval performance in Ferret a conceptual information retrievalsystem. In: ACM SIGIR International Conference on Research and Developmentin Information Retrieval.1991.347-355.
    [46] Dubnicki C., Gryz L., Heldt L., et al. HYDRAstor: a Scalable Secondary Storage.In: Proceedings of the7th USENIX Conference on File and Storage Technologies(FAST). February2009.
    [47] Manber U., Smith M. and Gopal B. WebGlimpse: combining browsing andsearching. In: Proceedings of the USENIX Annual Technical Conference.1997.195-206.
    [48] Kleinberg J. M. Authoritative sources in a hyperlinked environment. Journal of theACM Computing Surveys,1999,46(5):604-632.
    [49] Monica Bianchini, Marco Gori and Franco Scarselli. Inside PageRank. ACMTransactions on Internet Technology (TOIT),2005,5(1):92-128.
    [50] Page L., Brin S., Motwani R., et al. The pagerank citation ranking: Bringing orderto the Web. Technical report, Stanford Digital Library Technologies Project,1998:1-17.
    [51] Giampalo Dominic. Practical File System Design with the Be File System.1stedition. Morgan Kaufmann Publishers Inc.,1998.
    [52] Olson M. A. The design and implementation of the Inversion file system. In:Proceedings of the2003USENIX Technical Conference.1993.205-217.
    [53] Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, et al.Provenance-aware storage systems. In: Proceedings of the annual conference onUSENIX '06Annual Technical Conference.2006.43-56.
    [54] Olson Michael A., Bostic Keith and Seltzer Margo. Berkeley DB. In: Proceedingsof the Freenix Track,1999USENIX Annual Technical Conference. June1999.183-192.
    [55] Karl Gyllstrom and Craig Soules. Seeing is retrieving: building informationcontext from what the user sees. In: Proceedings of the13th internationalconference on Intelligent user interfaces.2008.189-198.
    [56] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Websearch engine. Comput. Netw. ISDN Syst.,1998,30(7):107-117.
    [57] Austin David. How Google Finds Your Needle in the Web's Haystack.http://www.ams.org/samplings/feature-column/fcarc-pagerank,2011.
    [58] Langville Amy and Meyer Carl. Google's PageRank and Beyond: The Science ofSearch Engine Rankings: Princeton University Press.2006.
    [59] Xia Peng, Feng Dan, Jiang Hong, et al. FARMER: a novel approach to file accesscorrelation mining and evaluation reference model for optimizing peta-scale filesystem performance.2008.185-196.
    [60] Li Zhenmin, Chen Zhifeng, Srinivasan Sudarshan M., et al. C-Miner: MiningBlock Correlations in Storage Systems.2004San Francisco, CA.173-186.
    [61] Li Zhenmin, Chen Zhifeng and Zhou Yuanyuan. Mining block correlations toimprove storage performance. ACM Transactions on Storage (TOS),2005,1(2):213-245.
    [62] Deepavali Bhagwat and Neoklis Polyzotis. Searching a file system using inferredsemantic links. In: Proceedings of the sixteenth ACM conference on Hypertext andhypermedia.2005.85-87.
    [63] Sean Quinlan and Sean Dorward. Venti: A New Approach to Archival Storage. In:Proceedings of the Conference on File and Storage Technologies.2002.89-101.
    [64] Rabinovici-Cohen S., Factor M. E., Naor D., et al. Preservation DataStores: newstorage paradigm for preservation environments. IBM Journal of Research andDevelopment,2008,52(4):389-399.
    [65] Lawrence L. You, Kristal T. Pollack and Darrell D. E. Long. Deep Store: AnArchival Storage System Architecture. In: Proceedings of the21st InternationalConference on Data Engineering.2005.804-815.
    [66] Storage Networking Industry Association(SNIA). Information ManagementExtensible Access Method (XAM) v1.0Part1: Architecture.2008.
    [67] Yang Tianming, Jiang Hong, Feng Dan, et al. DEBAR: A scalablehigh-performance De-duplication storage system for backup and archiving. In:Proceedings of the2010IEEE International Symposium on Parallel andDistributed Processing, IPDPS2010.1-12.
    [68] T10Technical Committee. SCSI Block Commands-3(SBC-3).2009November25.
    [69] Storage Networking Industry Association(SNIA). Information ManagementExtensible Access Method (XAM) v1.0Part3: Java API.2008July9.
    [70] Storage Networking Industry Association(SNIA). Information ManagementExtensible Access Method (XAM) v1.0Part2: C API.2008July9.
    [71] Baeza-Yates R. and Ribeiro-Neto B. Modern Information Retrieval: The Conceptsand Technology behind Search (2nd Edition). ACM Press,1999.
    [72] Berry Michael and Browne Murray. Understanding Search Engines: MathematicalModeling and Text Retrieval. Second Edition, SIAM, Philadelphia,2005.
    [73] Lester N., Zobel J. and Williams H.E. In-place versus re-build versus re-merge:Index maintenance strategies for text retrieval systems. In: Proceedings of theAustralasian Computer Science Conference.2004.15-22.
    [74]刘科,秦磊华,周敬利等.内容感知存储系统中的两阶段检索策略.计算机科学,2011,38(5):20-23.
    [75] Roi Blanco and Alvaro Barreiro. Probabilistic static pruning of inverted files.ACM Transactions on Information Systems (TOIS),2010,28(1):1-33.
    [76] Soumyadeb Mitra, Marianne Winslett and Windsor W. Hsu. Query-basedpartitioning of documents and indexes for information lifecycle management. In:Proceedings of the2008ACM SIGMOD international conference on Managementof data.2008.623-636.
    [77] Hua Yu, Jiang Hong, Zhu Yifeng, et al. SmartStore: a new metadata organizationparadigm with semantic-awareness for next-generation file systems. In:Proceedings of the Conference on High Performance Computing Networking.2009.1-12.
    [78] Leung A.W. and Miller E.L. Scalable full-text search for petascale file systems. In:Proceedings of the2008ACM Petascale Data Storage Workshop (PDSW08).November2008.1-7.
    [79] G. Salton. Automatic information organization and retrieval: Cornell University:Addison-Wesley.1968.
    [80] Larry Huston, Rahul Sukthankar, Rajiv Wickremesinghe, et al. Diamond: AStorage Architecture for Early Discard in Interactive Search. In: Proceedings of the3rd USENIX Conference on File and Storage Technologies.2004.73-86.
    [81] Hisashi Kurasawa, Daiji Fukagawa, Atsuhiro Takasu, et al. Maximal metric marginpartitioning for similarity search indexes. In: Proceeding of the18th ACMconference on Information and knowledge management.2009.1887-1890.
    [82] Beigi Mandis, Devarakonda Murthy and Jain Rohit. Policy-based informationlifecycle management in a large-scale file system. In: Proceedings of the6th IEEEInternational Workshop on Policies for Distributed Systems and Networks.2005.139-148.
    [83] A W Leung, M Shao, T Bisson, et al. High-performance metadata indexing andsearch in petascale data storage systems. Journal of Physics: Conference Series,2008,125(1):5.
    [84] Li Baihua, Meng Qinggang and Holstein Horst. Similarity K-d tree method forsparse point pattern matching with underlying non-rigidity Pattern Recognition,2005,38(12):2391-2399.
    [85] Berg M. de, Kreveld M.van, Overmars M., et al. Computational Geometry(Algorithms and Applications). Springer,1998.
    [86] Andrew W. Leung, Minglong Shao, Timothy Bisson, et al. Spyglass: fast, scalablemetadata search for large-scale storage systems. In: Proccedings of the7thconference on File and storage technologies.2009.153-166.
    [87] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACMComputing Surveys(CSUR),2006,38(2):6.
    [88] Vuk Ercegovac, Vanja Josifovski, Ning Li, et al. Supporting sub-document updatesand queries in an inverted index. In: Proceedings of the17th ACM Conference onInformation and Knowledge Management.2008.659-668.
    [89] Bender Michael A., Farach-Colton Martin, Fineman Jeremy T., et al.Cache-oblivious streaming B-trees. In: Proceedings of the19th Symposium onParallel Algorithms and Architectures (SPAA '07).2007.81-92.
    [90] Udi Manber and Sun Wu. GLIMPSE: a tool to search through entire file systems.In: Proceedings of the USENIX Winter1994Technical Conference on USENIXWinter1994Technical Conference.1994.23-32.
    [91] Buttcher Stefan. Multi-User File System Search. PhD thesis,University ofWaterloo,2007.
    [92] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing onlarge clusters. Commun. ACM,2008,51(1):107-113.
    [93] Chao Zhang, Nan Sun, Xia Hu, et al. Query segmentation based on eigenspacesimilarity. In: Proceedings of the ACL-IJCNLP2009Conference Short Papers.2009.185-188.
    [94] Roberto J. Bayardo, Yiming Ma and Ramakrishnan Srikant. Scaling up all pairssimilarity search. In: Proceedings of the16th international conference on WorldWide Web.2007.131-140.
    [95] Benjamin Zhu, Kai Li and Hugo Patterson. Avoiding the disk bottleneck in thedata domain deduplication file system. In: Proceedings of the6th USENIXConference on File and Storage Technologies.2008.1-14.
    [96] Koller Ricardo and Rangaswami Raju. I/O Deduplication: Utilizing ContentSimilarity to Improve I/O Performance. In: Proceedings of the8th conference onfile and storage technologies.2010.211-224.
    [97] Bolosky W. J., Corbin S., Goebel D., et al. Single instance storage in windows2000. In: Proceedings of the4th Usenix Windows System Symposium.
    [98] Hong B., Plantenberg D., Long D. D. E., et al. Duplicate data elimination in a sanfile system. In: Proceedings of the21st IEEE/12th NASA Goddard Conferenceon Mass Storage Systems and Technologies.2007.
    [99] Gionis A., Indyk P. and Motwani R. Similarity search in high dimensions viahashing. In: Proceedings of the Twenty-Fifth International Conference on VeryLarge Data Bases.1999.518-529.
    [100] Lillibridge M., Eshghi K., Bhagwat D., et al. Sparse Indexing: Large Scale, InlineDeduplication Using Sampling and Locality. In: Proceedings of the7th USENIXConference on File and Storage Technologies (FAST). Feb.2009.111-123.
    [101] Wei Jiansheng, Jiang Hong, Zhou Ke, et al. MAD2: A scalable high-throughputexact deduplication approach for network backup services. In: Proceedings of the2010IEEE26th Symposium on Mass Storage Systems and Technologies (MSST).May,2010.1-14.
    [102] Bloom Burton H. Space/time trade-offs in hash coding with allowable errors.Communications of the ACM,1970,13(7):422-426.
    [103] Stefan B., ttcher and Charles L. A. Clarke. Indexing time vs. query time: trade-offsin dynamic information retrieval systems. In: Proceedings of the14th ACMinternational conference on Information and knowledge management.2005.317-318.
    [104] Matthew Chang and Chung Keung Poon. Efficient phrase querying with commonphrase index. Information Processing and Management: an International Journal,2008,44(2):756-769.
    [105] Broder A. On the Resemblance and Containment of Documents. In: Proceedingsof the Compression and Complexity of Sequences.1997.21-29.
    [106] Yanjun Li, Soon M. Chung and John D. Holt. Text document clustering based onfrequent word meaning sequences. Data&Knowledge Engineering,2008,64(1):381-404.
    [107] Kaufman L. and Rousseeuw P.J. Finding Groups in Data: An Introduction toCluster Analysis. John Wiley&Sons,1990.
    [108] Larsen B. and Aone C. Fast and effective text mining using linear-time documentclustering. In: Proceedings of ACM SIGKDD International Conference onKnowledge Discovery and Data Mining.1999.16-22.
    [109] Fung B.C.M., Wang K. and Ester M. Hierarchical document clustering usingfrequent itemsets. In: Proceedings of SIAM International Conference on DataMining.2003.
    [110] Brandon Salmon, Steven W. Schlosser, Lorrie Faith Cranor, et al. Perspective:semantic data management for the home. In: Proccedings of the7th conference onFile and storage technologies.2009.167-182.
    [111] Mihalcea Rada and Tarau Paul. TextRank: Bringing Order into Texts. In:Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing.2004.404-411.
    [112] Sanghyun Park, Dongwon Lee and Wesley W. Chu. Fast Retrieval of SimilarSubsequences in Long Sequence Databases. In: Proceedings of the1999Workshopon Knowledge and Data Engineering Exchange.1999.60-67.
    [113] Edgar Ch, vez and Gonzalo Navarro. A compact space decomposition for effectivemetric indexing. Pattern Recognition Letters,2005,26(9):1363-1376.
    [114] Ronny Lempel and Shlomo Moran. Predictive caching and prefetching of queryresults in search engines. In: Proceedings of the12th international conference onWorld Wide Web.2003.19-28.
    [115] Hao He, Haixun Wang, Jun Yang, et al. BLINKS: ranked keyword searches ongraphs. In: Proceedings of the2007ACM SIGMOD international conference onManagement of data.2007.305-316.
    [116] Stefan B., ttcher and Charles L. A. Clarke. A security model for full-text filesystem search in multi-user environments. In: Proceedings of the4th conferenceon USENIX Conference on File and Storage Technologies.2005.169-182.
    [117]聂雪军.内容感知存储系统中信息生命周期管理关键技术研究：[博士学位论文].武汉：华中科技大学图书馆,2011.
    [118] Zhou Jingli, Liu Ke, Qin Leihua, et al. Block-Ranking: Content similarity retrievalbased on data partition in network storage environment. International Journal ofDigital Content Technology and its Applications,2010,4(3):85-94.
    [119] Haveliwala, Taher, Kamvar, et al. The Second Eigenvalue of the Google Matrix.Technical Report. Stanford,2003.
    [120] Bryan Kurt and Leise Tanya. The$25,000,000,000eigenvector The linear algebrabehind Google. SIAM Review,2006,48(3):569-581.
    [121] Voorhees PEllen M. and Harman PDonna. The text retrieval conferences (TRECS).In: Proceedings of a workshop on held at Baltimore, Maryland. October1998.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700