Fragmented BWT: An Extended BWT for Full-Text Indexing
详细信息    查看全文
  • 关键词:Suffix array ; Burrows ; Wheeler transform ; Full ; text indexing
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2016
  • 出版时间:2016
  • 年:2016
  • 卷:9954
  • 期:1
  • 页码:97-109
  • 全文大小:592 KB
  • 参考文献:1.Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Algorithm Data Compression (124), p. 18 (1994)
    2.Claude, F., Navarro, G.: The wavelet matrix. In: SPIRE, pp. 167–179 (2012)
    3.Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2000)MathSciNet CrossRef MATH
    4.Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)MathSciNet CrossRef MATH
    5.Grossi, R., Gupta, A., Vitter, S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–850 (2003)
    6.Hayashi, S., Taura, K.: Parallel and memory-efficient Burrows-Wheeler transform. In: Proceedings - 2013 IEEE International Conference on Big Data, pp. 43–50 (2013)
    7.Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Colloquium on Automata, Languages and Programming, pp. 943–955 (2003)
    8.Kärkkäinen, J., K.D., S., P.: Parallel external memory suffix sorting. In: CPM 2015, pp. 329–342 (2015)
    9.Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), 1 (2009)CrossRef
    10.Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)CrossRef
    11.Li, R., Yu, C., Li, Y., Lam, W., Yiu, M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)CrossRef
    12.Manber, U., Myers, G.: Suffix string arrays: a new searches method for on-line. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327 (1990)
    13.Nong, G., Zhang, S., Chan, H.: Linear suffix array construction by almost pure induced-sorting. In: 2009 Data Compression Conference, pp. 193–202 (2009)
    14.Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)MathSciNet CrossRef MATH
  • 作者单位:Masaru Ito (16)
    Hiroshi Inoue (17)
    Kenjiro Taura (16)

    16. Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
    17. IBM Research, Tokyo, Japan
  • 丛书名:String Processing and Information Retrieval
  • ISBN:978-3-319-46049-9
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
  • 卷排序:9954
文摘
This paper proposes Fragmented Burrows Wheeler Transform (FBWT), an extension to the well-known BWT structure for full-text indexing and searching. A FBWT consists of a number of BWT fragments each covering only a subset of all the suffixes of the original string. As constructing FBWT does not entail building the BWT of the whole string, it is faster than constructing BWT. On the other hand, searching with FBWT can be more costly than that with BWT, since searching the former requires searching all fragments; its amount of work is \(O(dp + {\textit{occ}}\log ^{1+\epsilon }n)\) as opposed to \(O(p + {\textit{occ}}\log ^{1+\epsilon }n)\) of regular BWT, where p is the length of the query string, n the length of the original text, occ the occurrences of the query string, and d the number of fragments. To compensate the search cost, searching with FBWT can be accelerated with SIMD instructions by searching multiple fragments in parallel. Experiments show that building FBWT is about twice as fast as building BWT via a state of the art algorithm (SA-IS); and that FBWT’s search performance compared to BWT’s depends on the number of occurrences, ranging from four times slower than BWT (when there are few occurrences), to twice as fast as BWT (when there are many).

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700