Mining approximate patterns with frequent locally optimal occurrences

设为首页

收藏本站

网站地图 | English | 公务邮箱

About the library

Background
History
Leadership
Organization

Readers' Guide

Opening Hours
Collections
Help Via Email

Publications

Electronic Information Resources

Mining approximate patterns with frequent locally optimal occurrences

详细信息查看全文

作者：Atsuyoshi Nakamura^a ; ^{atsu@main.ist.hokudai.ac.jp" class="auth_mail" title="E-mail the corresponding author} ; Ichigaku Takigawa^a ; Hisashi Tosaka^b ; Mineichi Kudo^a ; Hiroshi Mamitsuka^c
关键词：Alignment ; Frequent pattern mining ; String ; Ordered tree ; DNA
刊名：Discrete Applied Mathematics
出版年：2016
出版时间：19 February 2016
年：2016
卷：200
期：Complete
页码：123-152
全文大小：2505 K

文摘

We consider a frequent approximate pattern mining problem, in which interspersed repetitive regions are extracted from a given string. That is, we enumerate substrings that frequently match substrings of a given string locally and optimally. For this problem, we propose a new algorithm, in which candidate patterns are generated without duplication using the suffix tree of a given string. We further define a k

k

-gap-constrained setting, in which the number of gaps in the alignment between a pattern and an occurrence is limited to at most k

k

. Under this setting, we present memory-efficient algorithms, particularly a candidate-based version, which runs fast enough even over human chromosome sequences with more than 10 million nucleotides. We note that our problem and algorithms for strings can be directly extended to ordered labeled trees. In our experiments we used both randomly synthesized strings, in which corrupted similar substrings are embedded, and real data of human chromosome. The synthetic data experiments show that our proposed approach extracted embedded patterns correctly and time-efficiently. In real data experiments, we examined the centers of 100 clusters computed after grouping the patterns obtained by our k

k

-gap-constrained versions (k=0,1

k = 0, 1

and 2) and the results revealed that the regions of their occurrences coincided with around a half of the regions automatically annotated as Alu sequences by a manually curated repeat sequence database.