一种倒排索引压缩方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Method of inverted index compression
  • 作者:白福均 ; 高建瓴 ; 李宛蓉 ; 贺思云 ; 肖绍武
  • 英文作者:Bai Fujun;Gao Jianling;Li Wanrong;He Siyun;Xiao Shaowu;College of Big Data & Information Engineering Guizhou University;Archives Guizhou University;
  • 关键词:搜索引擎 ; 倒排索引 ; 索引压缩 ; 人工蜂群算法 ; ASCS算法
  • 英文关键词:search engine;;inverted index;;index compression;;artificial bee colony;;ASCS algorithm
  • 中文刊名:JSYJ
  • 英文刊名:Application Research of Computers
  • 机构:贵州大学大数据与信息工程学院;贵州大学档案馆;
  • 出版日期:2018-02-08 17:13
  • 出版单位:计算机应用研究
  • 年:2019
  • 期:v.36;No.327
  • 基金:贵州省档案局科研资助项目(2015D001);; 贵州省科学技术基金资助项目(黔科合J字[2015]2045);; 贵州大学研究生创新基金资助项目(研理工2017014,研理工2017016)
  • 语种:中文;
  • 页:JSYJ201901025
  • 页数:4
  • CN:01
  • ISSN:51-1196/TP
  • 分类号:112-115
摘要
针对自适应分段压缩ASCS算法进行了研究,对于ASCS算法中采用的均匀分段方式并非最优分段问题,提出以人工蜂群算法优化ASCS算法中的分段方式;对于ASCS算法考虑序列占用空间的影响因素过于单一问题,提出多因素下的改进算法;对于分布不均的长序列在ASCS算法下压缩率不理想的问题,提出先排序后差分编码操作再以ASCS算法压缩。通过对比实验证明,优化改进后的算法可以较显著地压缩倒排索引。
        This paper proposed segmentation method optimized by ABC algorithm in ASCS algorithm for the problem of ASCS algorithm,which adopted uniform segmentation instead of optimal segmentation. The ASCS algorithm only considered an influencing factor and ignored the influence of other factors. The ratio of compressing long sequence of uneven distribution was unsatisfactory with ASCS algorithm,it adopted the process integer sequence with sorting and differential encoding before ASCS algorithm. Simulation experiments show that the improved algorithm has significantly increased compression ratio of inverted index file comparing with ASCS algorithm.
引文
[1] ManningC D,Raghavan P,Schütze H. Introduction to information retreival[M]. Cambridge:Cambridge University Press,2009:67-68.
    [2] Witten Z H,Moffat A,Bell T C. Managing gigabytes:compressing and indexing documents and images[M]. 2nd ed.[S. l.]:Morgan Kaufmann,2001:79-80.
    [3] Ounis I,Amati G,Plachouras V,et al. Terrier:a high performance and scalable information retrieval platform[EB/OL].(2006-06-26). https://www. researchgate. net/publication/250803036_Terrier_A_High_Performance_and_Scalable_Information_Retrieval_Platform.
    [4] Elias P. Universal codeword sets and representations of the integers[J]. IEEE Trans on Information Theory,1975,21(2):194-203.
    [5] Manber U,Myers G. Suffix arrays:a new method for on-line string searches[J]. SIAM Journal on Computing,1990,22(5):935-948.
    [6] Golomb S W. Run-length encodings[J]. IEEE Trans on Information Theory,1966,12(3):399-401.
    [7] Rice R,Plaunt J. Adaptive variable-length coding for efficient compression of spacecraft television data[J]. IEEE Trans on Communication Technology,1971,19(6):889-897.
    [8] Somasundaram K,Domnic S. Extended GOLOMB code for integer representation[J]. IEEE Trans on Multimedia,2007,9(2):239-246.
    [9] Domnic S,Glory V. Inverted file compression using EGC and FEGC[J]. Procedia Technology,2012,6(12):493-500.
    [10]Glory V,Domnic S. Re-ordered FEGC and block based FEGC for inverted file compression[J]. International Journal of Information Retrieval Research,2013,3(1):71-88.
    [11]毛福林.倒排索引压缩算法研究[D].北京:北京交通大学,2015.(Mao Fulin. Research on inverted index compression algorithm[D]. Beijing:Beijing Jiaotong University,2015.)
    [12]闫宏飞,张旭东,单栋栋,等.基于指令级并行的倒排索引压缩算法[J].计算机研究与发展,2015,52(5):995-1004.(Yan Hongfei,Zhang Xudong,Shan Dongdong,et al. Inverted index compression algorithm based on instruction level parallelism[J]. Journal of Computer Research and Development,2015,52(5):995-1004.)
    [13]Williams H E,Zobel J. Compressing integers for fast file access[J].Computer Journal,1999,42(3):193-201.
    [14]Anh V N,Moffat A. Inverted index compression using word-aligned binary codes[J]. Information Retrieval,2005,8(1):151-166.
    [15] Goldstein J,Ramakrishnan R,Shaft U. Compressing relations and indexes[C]//Proc of the 14th International Conference on Data Engineering. Piscataway,NJ:IEEE Press,1998:370-379.
    [16]Zukowski M,Heman S,Nes N,et al. Super-scalar RAM-CPU cache compression[C]//Proc of the 22nd International Conference on Data Engineering. Piscataway,NJ:IEEE Press,2006:59.
    [17]Yan Hao,Ding Shuai,Suel T. Inverted index compression and query processing with optimized document ordering[C]//Proc of the 18th International Conference on World Wide Web. New York:ACM Press,2009:401-410.
    [18]Lemire D,Boytsov L. Decoding billions of integers per second through vectorization[J]. Journal of Software:Practice and Experience,2015,45(1):1-29.
    [19]特日跟,江晟,李雄飞,等.基于整数数据的文档压缩编码方案[J].吉林大学学报:工学版,2016,46(1):228-234.(Te Rigen,Jiang Sheng, Li Xiongfei, et al. Document compression scheme based on integer data[J]. Journal of Jilin University:Eng and Technol Ed,2016,46(1):228-234.)
    [20]G9del K.Uber formal unentscheidbare satze der principia mathematica und verwandter system,I[J]. Monatshefte Für Mathematik,1931,38(1):173-198.
    [21]秦全德,程适,李丽,等.人工蜂群算法研究综述[J].智能系统学报,2014,9(2):127-135.(Qin Quande,Cheng Shi,Li li,et al. Artificial bee colony algorithm:a survey[J]. CAAL Trans on Intelligent Systems,2014,9(2):127-135.)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700