基于数据压缩的信息检索技术的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
文本信息数量的飞速增长给传统的信息检索技术带来了新的挑战。在目前有关信息检索技术的研究中,基于数据压缩的信息检索技术是一个新的研究领域。由于使用这种技术能够降低文本信息的空间需求并提高查找速度,所以该技术具有较高的理论研究意义和应用前景。
     近年来国内外有一些学者在做关于这种技术的研究,但研究成果不多,研究工作也比较粗散零略。
     本文在分析总结前人研究成果的基础上提出了一种压缩纯英文文本数据库正文的压缩方法以及和此压缩方法结合使用的一些查找算法和解压缩算法。
     使用本文设计出的压缩方法能够将典型的英文文本数据库压缩到其原始大小的35%左右,要优于winzip等目前流行的压缩软件。此外由于检索可在压缩后的文本数据库上直接进行,所以能够提高查找速度。
     此外,作者在本文中还提出了一个基于“块编址”的压缩倒排索引结构以及在此压缩索引结构上的一些查找算法来获得更好的查找性能。
Information Retrieval technique based on data compression is a new research field, and the research on it is at the beginning.
    This paper presents a compression method for the pure-english text database and its corresponding decompression method as well as some search algorithms on the compressed text database. These algorithms draw lessons from the knowledge about semi-static model technique and word-based byte-oriented encoding method.
    The compression method can compress general English text database to about 35% of the original database size, which precedes to many popular compress softwares such as Winzip. Furthermore, retrieval which can be directly executed on the compressed text database increases efficiently the search speed.
    In addition, a new index structure which can support efficiently search on large full-text database is proposed in this paper, and some search algorithms based on the index structure are designed. Experimental results show these algorithms are very efficient.
引文
[1] T.C.Bell 等 "Data Compreesion in Full-Text Retrieval Systems", Journal of the American Society for Information Science.44(9) , 1993 pages 508-531.
    [2] Riccardo A.Baeza-yates, Gaston H.Gonnet 等 A new Approach to Text Searching". Communication of the ACM 35, Oct 1992, pages 74-82.
    [3] Nivio Ziviani, Edleeno Silva de Moura 等"Compression:A Key for Next-Generation Text Retrieval Systems" .IEEE Computer, vol 33,No 11 2000. pages 37-44.
    [4] Wu and U.Manber "Fast Text Searching Allowing Errors" Comm.ACM, Oct. 1992, pages 83-91.
    [5] R Baeza-yates ,G.Navarro"Fast Approximate String Matching" Algorithmica. Vol.23, No,2,1999. pages127-158.
    [6] M,Burrows , D.J Wheeler "A Block-Sorting Lossless Data Compression Algorithm" Technique Report ,Digital Equipment Corporation.
    [7] Edleno Silva de Moura ,Gonzalo Navarro 等" Fast and Flexible Word Seraching On Compressed Text" ACM Transactions on Information Systems,Vol,18,No.2,April 2000, pages 113-139.
    [8] Moffat A, Katajanien J "In-palce calculation of minimum-redundancy codes" In S.Akl,F.DEHNE,AND J,-R.SACK Eds., Proc .Workshop on algorithms and data structures(Queen's Univercity, Kingston,Ontario,August 1995) ,pages 393-402. LNCS 955 Springer-Verlag. 1995
    [9] Sun Wu, Udi Manber "Fast Text Searching Allowing Errors" Communication of the ACM ,October 1992, vol 35, No.10
    
    
    [10] Udi Manber , Sun Wu " GLIMPSE:A Tool to Search Through Entire File Systems" In Proceedings of the USENIX Winter 1994 Technical Conferene. Pages 23-32,1994
    [11] Alistair Moffat, Andrew Turpin "On the Implementation of Minimum Redundancy Prefix Codes" IEEE Trans.comm.,vol.45,NolO , pages 1200-1207, October 1997.
    [12] Justin Zobel ,Alistair Moffat,Ron Sack-Davis "An Efficient Indexing Technique for Full-Text Database Systems" Proceedings of the 18~(th) VLDB Conference Vancouver,British Columbia,Canada 1992.
    [13] Hughe Williams, Justin Zobel "Compressing Integers for Fast File Access" THE COMPUTER JOURNAL, Vol 42,NO,3 ,1999
    [14] Yusuke Shibata, Teteuya Matsumoto "A Boyer-Moore type algorithm for compressed pattern matching" DOI Technical Report DOI-TR-170 1998
    [15] Justin Zobel, Alistair Moffat, Ron Sack-Davis "Searching Large Lexicons for Partially Terms using Compressed Inverted Files" Proceedings of the 19~(th) VLDB Conference Dublin,Ireland,1993.
    [16] Christof Monz , Maaten de Rijke "Inverted Index Construction" .Instrution to information Retrieval ,Spring 2001.
    [17] Justin Zobel , Alistair Moffat ,Kotagiri Ramamohanarao "Inverted Files Versus Signature Files for Text Indexing" ACM Transactions on Database systems,Vol.23,No 4,December 1998, pages 453-490.
    [18] Eric W.Brown , James P.Callan , W.Bruce Croft "Fast Incremental Indexing for Full-Text Information Retrieval" Proceeding of the 20~(th) VLDB Conference Santiago,Chile,1994.
    [19] Kunihiko 等 "Fast algorithms for k-word Proximity Search" IEICE TRANSFUNDAMENTALS,VOL.E84-A,No 9, September 2001
    
    
    [20] Shmuel T.Klein , Dana Shapira "A New Compression Method for Compressed Matching". Proc. Data Compression Conference DCC-2000, Snowbird, Utah (2000) . pages 400-409.
    [21] Jiri Dvorsky, Jaroslav Pokorny ,Vaclav Snasel "Word-based Compression Methods and Indexing for Text Retrieval Systems". Springer Verlag, Berlin Heidelberg, 1999, stat ve sborniku. pages 61-74.
    [22] Gonzalo Navarro "Regular Expression Searching over Ziv-Lempel Compression Text" In Proc. 12th Combinatorial Pattern Matching Conference (CPM'Ol), pages 1-17, 2001.
    [23] "Data Compression". http:// www.ics.uci.edu/~dan/pubs/
    [24] Ricardo Baeza-Yates ,Gonzalo Navarro "Block Addressing Indices for ApproximateText Retrieval" In Proc .ACM CIKM'97, pages 1-8,1997. Extend version appear in JASIS.
    [25] Edleno Silva de Moura , Gonzalo Navarro ,Nivio Ziviani "Indexing Compressed Text", Proceeding of the 4~(th) south American Workshop on String Proceeding.Caleton university Press 1997.
    [26] Gonzalo Navarro , Edleno Silva de Moura ,Nivio Ziviani 等 "Adding Compression to Block Addressing Inverted Indices "Information Retrieval Journal, vol .3 ,No 1, 2000 pp.49-77
    [27] Ricardo Baeza-Yates ,Gonzalo Navarro "Multiple Approximate String Matching ". Proceedings of the 5th Workshop on Algorithms and Data Structures 1997
    [28] Tim Bell ,Matt Powell ,Amar Mukherjee " Searching BWT compressed text with the Boyer-Moore Algorithm and binary search" Data Compression Conference (DCC) 2001.
    [29] Andrej Brodnik , Svante Carlsson "Sub-linear Decoding of Huffman
    
     Codes Almost In-Place" Information Processing Letters .No ,26 pages 237-241. 1998
    [30] Ruy Luiz Milidiu , Artur Alves Pessoa 等 "In-place Length-Restricted Prefix Coding" .Proceeding of the forth lartin American work shop on String Proceeding .Volume 8 of International Informatinal Series. 80-94 Chile 1997 Carlenton Univercity Press.
    [31] Edleno Silva de Moura , Nivio Ziviani 等"Fast Searching on Compressed Text Allowing Errors" In Proceedings of the 21~(st) Annual International ACM SIGIR Conference on Research and Development in information Retrieval. 1998
    [32] Chung-Hung Lai , Tien-Fu Chen "Compressing Inverted Files in Scalable Information Systems by Binary Decision Diagram Encoding". http://www.sc2001. org/papers/
    [33] Takuya Kida , Yusuke Shibata 等 "A Unifying Framework for Compressed Pattern Matvhing" . SPIRE/CRIWG 1999 pages 89-96
    [34] Ricardo Baeza-Yates ,Gonzalo Navarro "Fast Approximate String Matching in a Dictionary". In Proc. 5~(th) Suth American symposium on String Proceeding and Information Retrieval (SPIRE'98) Page 14-22 IEEE CS Press 1998.
    [35] Jiri Dvorsky "Text Compression with Random Access" . Proceedings of ISM 2000, Roznov p.R. 2000.
    [36] Kunihiko Sadakane "A Modified Burrows-Wheeler Transformation for case-insenditive Search with Application to Suffix Array Compression". In proceeding of IEEE Data Compreesion Conference (DCC'99) Page: 548-5. 60 1999 poster session.
    
    
    [37] Josha Bach, Ian h.Witten "Lexical Attraction for Text Compression". Proceedings of the Data Compression Conference 1999.
    [38] Ricardo Bzeza-Yates , Gonzalo Navarro, "Multiple Approximate String Matching". Proceedings of the 5th Workshop on Algorithms and Data Structures. 1997
    [39] Marrio Drumond Araujo , Gonzalo Navarro , Nivio Ziviani "Large Text Searching Allowing Errors". In Proc. WSP'97, pages 2-20. Carleton University Press, 1997.
    [40] Zobel. J , Moffat. A "Adding Compression to a full-text retrieval system" Software Practice and Experience 25,8 page:891-903.
    [41] Moffa .A , Katajainen.J "In-Place calculation of minimum-redundancy codes" Proc .Workshop on Algorithms and Data Structures, pp 393-402 1995
    [42] Bernhard Balkenhol , Stefan Kurtz , Yuri M.Shtarkov " Modifications of the Burrows and Wheeler Data Compression Algorithm". Proc. IEEE Data Compression Conference March 1999. pages 188-197
    [43] 吴尔南.数据压缩.电子工业出版社.2000年. pages 3-10.
    [44] 袁玫,袁文.数据压缩技术及其应用. pages 1-17.
    [45] J.Ziv and A.Lempel "A universal algorithm for sequential data compression." IEEE Trans.Inf. Theory ,23: 337-343,1977.
    [46] J.Ziv and A.Lempel "Compression of individual sequences via variable length coding. IEEE Trans.Inf. Theory ,24: 530-536. 1978.
    [47] T.Welch. "A technique for high performance data compression". IEEE Computer Magazine, 17(6) : 8-19 June 1984.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700