编码单位可变的倒排索引压缩算法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Inverted Index Compression Algorithm with Variable Coding Units
  • 作者:安兆翔 ; 瞿有利
  • 英文作者:AN Zhaoxiang;QU Youli;School of Computer and Information Technology, Beijing Jiaotong University;
  • 关键词:倒排索引 ; 索引压缩 ; 可变单位 ; 分区优化
  • 英文关键词:inverted index;;index compression;;variable unit;;partition optimization
  • 中文刊名:JSGG
  • 英文刊名:Computer Engineering and Applications
  • 机构:北京交通大学计算机与信息技术学院;
  • 出版日期:2018-09-15 09:09
  • 出版单位:计算机工程与应用
  • 年:2019
  • 期:v.55;No.934
  • 基金:中央高校基本科研业务费专项资金(No.2015JBM035)
  • 语种:中文;
  • 页:JSGG201915011
  • 页数:7
  • CN:15
  • 分类号:87-93
摘要
倒排索引是大多数大型文本搜索系统的核心数据结构,索引压缩可以有效地减少倒排索引的空间占用,提升检索效率。针对倒排索引压缩算法中的字节对齐编码进行研究,对于其压缩率不够优秀的问题,提出了分区可变单位编码(PVU编码)。算法以可变单位方式代替固定字节存储,使实际存储空间更加贴合原码长度,从而提高压缩效果。针对序列均匀分区并非最优分区的问题,提出将最优分区问题转化为图论中最短路径问题的方法,使用Dijkstra算法求解序列的最优编码分区。通过对比实验验证了改进优化的PVU编码相较于传统的字节对齐编码能够更好地压缩倒排索引序列。
        The inverted index is the data structure at the core of most large-scale search systems. Index compression can effectively reduce the space occupied by inverted index and improve retrieval efficiency. To solve the problem of poor compression of byte-aligned coding, Partitioned Variable Unit(PVU)coding is proposed. Instead of storing multiple bytes, PVU stores integers in multiple units. Units include multiple values. This method makes the storage space more suitable for the original code length and has a better compression effect. To solve the optimal partition problem, the partition problem is transformed into the shortest path problem in graph theory. Dijkstra's algorithm is designed to solve this problem. Experiments show that the optimized PVU can compress the inverted index sequence better than the traditional byte-aligned coding.
引文
[1] Manning C D,Raghavan P,Schütze H.An introduction to information retrieval[J].Journal of the American Society for Information Science&Technology,2008,61(4):852-853.
    [2] Sebastiano V.Quasi-succinct indices[C]//Proceedings of the Sixth ACM International Conference on Web Search and Data Mining,2013:83-92.
    [3] Elias P.Universal codeword sets and representations of the integers[J].IEEE Transactions on Information Theory,1975,21(2):194-203.
    [4] Golomb S W.Run-length encodings[J].IEEE Transactions on Information Theory,1966,12(3):399-401.
    [5] Thiel L H,Heaps H S.Program design for retrospective searches on large data bases[J].Information Storage&Retrieval,1972,8(1):1-20.
    [6] Cutting D,Pedersen J.Optimization for dynamic inverted index maintenance[C]//International ACM SIGIR Conference on Research and Development in Information Retrieval,1989:405-411.
    [7] Dean J.Challenges in building large-scale information retrieval systems:invited talk[C]//ACM International Conference on Web Search and Data Mining,2009.
    [8] Oberoi P S,Oberoi P S,Oberoi P S,et al.SIMD-based decoding of posting lists[C]//ACM International Conference on Information and Knowledge Management,2011:317-326.
    [9] Lemire D,Kurz N,Rupp C.Stream VByte:faster byteoriented integer compression[J].Information Processing Letters,2018,130:1-6.
    [10] Anh V N,Moffat A.Index compression using 64 bit words[J].Software Practice&Experience,2010,40(2):131-147.
    [11] Anh V N,Moffat A.Inverted index compression using wordaligned binary codes[J].Information Retrieval,2005,8(1):151-166.
    [12] Yan H,Ding S,Suel T.Inverted index compression and query processing with optimized document ordering[C]//International Conference on World Wide Web,Madrid,Spain,April 2009:401-410.
    [13]张旭东,孙志明,刘亚宁,等.基于64位体系结构的倒排索引压缩算法[J].计算机工程,2014,40(2):71-76.
    [14] G?del K.Uber formal unentscheidbare s?tze der principia mathematica und verwandter system I[J].Monatshefte Für Mathematik Und Physik,1931,38(1):173-198.
    [15]白福均,高建瓴,李宛蓉,等.一种倒排索引压缩方法[J].计算机应用研究,2019(1).
    [16]史亮,张鸿,刘欣然,等.倒排索引中的文档序号重排技术综述[J].中文信息学报,2015,29(2):24-32.
    [17] Zukowski M,Heman S,Nes N,et al.Super-scalar RAM-CPU cache compression[C]//International Conference on Data Engineering,2006:59.
    [18] Lemire D,Boytsov L.Decoding billions of integers per second through vectorization[J].Software Practice&Experience,2012,45(1):1-29.
    [19]李俊廷.基于分区的倒排索引压缩算法研究[D].北京:北京交通大学,2017.
    [20] Floyd R W.Algorithm 97:shortest path[J].Communications of the ACM,1962,5(6):345.
    [21] Dijkstra E W.A note on two problems in connexion with graphs[J].Numerische Mathematik,1959,1(1):269-271.
    [22] Bellman R.On a routing problem[J].Quarterly Applied Mathematics,1958,16(1):87-90.
    [23]段凡丁.关于最短路径的SPFA快速算法[J].西南交通大学学报,1994,29(2):207-212.
    [24] Ottaviano G,Venturini R.Partitioned elias-fano indexes[C]//International ACM Sigir Conference on Research&Development in Information Retrieval,2014:273-282.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700