基于二级结构的非编码RNA挖掘方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
非编码RNA的研究是目前生物信息学领域最重要的课题之一。步入21世纪以来,关于非编码RNA的研究连续获得Science评选的年度十大科学突破,并在2006年获得了诺贝尔生物或医学奖。越来越多的生物信息学研究人员致力于从已有的测序数据中挖掘非编码RNA,并分析其功能。但目前的挖掘方法还存在挖掘效率低、假阳性高、无法发现新家族等缺憾。因此,本文从分析RNA的结构出发,结合并改进分类学习方法,对非编码RNA挖掘中的若干关键问题进行深入的研究。
     本文的主要内容包括:
     (1)提出处理生物信息学中普遍存在的训练样本不平衡的分类方法。生物信息学中存在大量的正反例不平衡的学习问题,这是由于现实分布的特点,另一方面也由于获得正例花费的成本远远高于反例。本文提出一种处理正反例不平衡的分类方法,以解决生物信息学中的snoRNA识别、microRNA前体判别、SNP位点的真伪识别等问题。本文方法利用集成学习的思想,将反例集均匀分割并依次与正例集组合,得到一组类别平衡的训练集;然后对每个训练集采用不同原理的分类器进行训练;最后投票表决待测样本。为了避免弱分类器影响投票效果,本文结合AdaBoost思想,将每个分类器训练中产生的错误样本加入到下两个分类器的训练集中,这种做法既避免了AdaBoost的反复训练,又有效地利用了投票机制遏制了弱分类器的影响。五组UCI测试数据和三组生物信息学实验证明了本方法在处理类别不平衡的分类问题时的优越性。此外,本文还开发了基于该方法的软件libID,以方便广大同行使用。
     (2)提出RNA二级结构的“质心”表示方法和基于它的二级结构预测算法。
     目前RNA的各种二级结构表示方法,均不能快速地衡量两个RNA分子二级结构的相似程度。针对该问题本文提出“质心”的概念来描述RNA分子中各个茎区的位置,并且衍生出“质心距”、“D函数”等概念来进一步刻画茎区之间、二级结构之间的相似程度。基于这种快速衡量二级结构相似程度的方法,本文分别对比较序列分析法和最小自由能方法做出改进。对于比较序列分析法,提出一套独立于多序列比对的预测算法;对于最小自由能法,结合RNA的类别信息,进一步提高预测效果。
     (3)对目前挖掘microRNA的两种思路进行了研究,并深入的分析和讨论了其中的部分关键问题。
     同源比对和从头预测是目前挖掘microRNA的两种思路。同源比对方法是目前的主要方法,本文提出一种基于关键字树的比对搜索算法,提高了搜索的精度同时又降低了期望时间开销。将本文的方法分别应用于大豆和家蚕上均取得了较好的效果。从头预测方法基于机器学习思想,是未来的发展方向,它有利于发现新家族,不过成熟体定位问题一直是该方法的瓶颈。本文从两个角度对该问题进行了深入的探讨,取得了较准确的结果。尽管没有完全解决该瓶颈,但为该问题的深入研究奠定了基础。
     (4)结合本文提出的二级结构预测算法和样本类别不平衡的分类算法,挖掘snoRNA。
     目前的snoRNA挖掘方法大都是基于靶标信息的。随着“孤儿”snoRNA等新的功能性snoRNA的发现,独立于靶标信息的挖掘方法受到越来越多的关注。相比于目前的挖掘方法,本文将外显子序列引入训练集,提取了更为显著的二级结构特征,应用本文提出的专门处理类别不平衡的分类器,得到了一套更为有效和准确的snoRNA挖掘方法。特别地,本文还针对snoRNA的特殊二级结构,提出了有效的二级结构预测算法,并且应用于挖掘的特征提取过程中,这在国际上尚属首次。交叉验证和基因组片段上的挖掘实验证明了本文方法的有效性。
Non-coding RNA is one of the most important topics in bioinformatics. The research of non-coding RNA has been voted as top ten scientific progresses for several years recently, and it won the Nobel Price in 2006. More and more bioinformatics researchers devote themselves to mining non-coding RNA and analyzing the function. However, the efficiency of the current mining method is low and the false positive is high. So in this thesis, I develop the secondary structure prediction algorithm, improve the machine learning method for imbalanced data, and do deep research on mining non-coding RNA.
     The contributions of the dissertation are as follows:
     (1) Three strategies are proposed for class imbalance learning problems in bioinformatics.
     There are many class imbalance learning problems in bioinformatics. It is because of the native distribution and that positive samples always spend much more than the negative ones. A novel classification method is proposed for training class imbalance data, such as identifying snoRNA, classifying microRNA precursors from pseudo ones, mining SNPs from EST sequences, etc. The method is based on the main idea of ensemble learning. First, the negative set (big class) is divided randomly into several subsets equally. Every subset together with the positive set is a class balance training set. Then several different classifiers are selected and trained with these balance training sets. After the multi-classifiers are built, they will vote for the last prediction when facing new samples. In the training phase, a strategy similar to AdaBoost is used. For each classifier, the samples will be added to the next two classifiers’training sets if they are misclassified. This strategy can improve the performance of weak classifiers by voting. Five UCI data sets and three bioinformatics experiments prove the performance of our method. Furthermore, a software program, named libID, is developed.
     (2)“Centriod of helix”is proposed firstly as a novel concept in this thesis, and two novel algorithms are developed based on this concept.
     RNA secondary structure can not be compared quickly by current representation. In this thesis, a novel concept“centroid of stem”is proposed for discribing the position of the stem, and more novel concepts, such as“distance between centroids”,“D function”, are extended for measuring the difference between secondary structure. The comparative sequence analysis method and the minimum free energy method are both improved based on these novel concepts. For comparative sequence analysis method, a novel prediction algorithm is proposed independent of multiple sequence alignment; for minimum free energy method, the prediction performance is improved by involving the class information.
     (3) Research and key problems on mining microRNA are discussed deeply.
     Homologous searching and ab initio predicting are two methods for mining microRNA. Homologous searching is the main method currently. In this thesis, a novel searching method based on keywords tree is proposed, for saving the time cost and maintaining the sensitivity at the same time. The application on soybean and silkworm proves the performance of our method. Ab initio prediction is based on machine learning and will be the main mining method in the future. It can find new microRNA family, however, localization of mature part is the bottleneck. In this thesis, I discuss this problem with two points of view. Although I havn’t solved this problem completely, my work has done help on the further research.
     (4) Algorithm on mining snoRNA is developed based on the secondary structure prediction and class imbalance learning methods mentioned above.
     SnoRNAs are mined based on targets information currently. As the development of function, especially as the discovery of“orphan snoRNA”, ab initio mining methods is noticed and researched since the independent of targets information. In this thesis, we propose a novel ab initio snoRNA gene mining algorithm, which is based on ensemble learning and a special secondary structure prediction algorithm. Three contributions are made to improve current mining methods, including enriching the negative training set, using the ensemble classifiers for the class imbalance data, and developing a special secondary structure prediction algorithm for extracting features with high quality, which is the first time to our knowledge. The performance of learning method is proved by cross validation and the mining method is proved by the experiments on genome data.
引文
1管乃洋.非编码RNA基因识别模型的设计与实现.国防科学技术大学研究生毕业论文. 2006
    2 Tao Jiang, Ying Xu, Michael Q. Zhang. Current Topics in Computational Molecular Biology. The MIT Press. 2002.
    3杨博. BP神经网络的研究与应用.哈尔滨工业大学研究生毕业论文.2006
    4 Sean R.Eddy. Non-Coding RNA Genes and the Modern RNA World. Nature Reviews. 2001, 2(12):919-929
    5龙漫远,朱作言.非编码RNA:比较近缘物种及寻找雄性基因.科学通报. 2007,52(6):617-619
    6 Zhang B., Pan X., Cobb G.P. et al. MicroRNAs as Oncogenes and Tumor Suppressors. Developmental Biology. 2007,302(1):1-12
    7 George A. Calin, Carlo M. Croce. MicroRNA Signatures in Human Cancers. Nature Reviews. 2006,6(11): 857-866
    8 Qinghua Jiang, Yadong Wang, Yangyang Hao, Liran Juan, Mingxiang Teng, Xinjun Zhang, Meimei Li, Guohua Wang, Yunlong Liu. Mir2disease: A Manually Curated Database for MicroRNA Deregulation in Human Disease. Nucleic Acids Research. 2009,37(Database Issue):D98-D104
    9 Mattick Js. The Functional Genomics of Noncoding RNA. Science. 2005, 309(5740): 1527-1528.
    10 Michalak P. RNA World-The Dark Matter of Evolutionary Genomics. Journal of Evol Biol. 2006, 19(6):1768-1774.
    11 Lund E, Guttinger S, Calado A, et al. Nuclear Export of MicroRNA Precursors. Science, 2004, 303(5654): 95-98
    12 Lee Y, Ahn C, Han, J, et al. The Nuclear RNAse III Drosha Initiates MicroRNA Processing. Nature, 2003, 425(6956):415-419
    13 Bernstein E, Caudy A A, Hammond S, et al. Role For a Bidentate Ribonuclease in the Initiation Step of RNA Interference. Nature, 2001, 409(6818): 363-366
    14金伟波.基于支持向量机方法的植物microRNA预测及小麦microRNA的克隆.西北农林科技大学研究生毕业论文. 2007
    15 Todd M. Lowe, Sean R. Eddy. A Computational Screen For Methylation GuideSnoRNAs in Yeast. Science.1999,283(5405):168– 1171
    16 Shivendra Kishore, Stefan Stamm. The snoRNA HBII-52 Regulates Alternative Splicing of The Serotonin Receptor 2c. Science. 2006,311(5758):230-232
    17 Andre Lambert, Jean-Fred Fontaine, Matthieu Legendre, Fabrice Leclerc.The ERPIN Server: An Interface To Profile-Based RNA Motif Identification. Nucleic Acids Research, 2004, 32(Web Server Issue):W160-W165
    18 Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H. Assessing The Accuracy of Prediction Algorithms For Classication: An Overview. Bioinformatics 2000,16(5):412-424
    19 Gardner P.P, Giegerich R. A Comprehensive Comparison of Comparative RNA Structure Prediction Approaches. BMC Bioinformatics, 2004,5:140
    20 Sjoerd J.de Vries, Alexandre M.J.J.Bonvin. How Proteins Get in Touch: Interface Prediction in the Study of Bio-molecular Complexes. Current Protein and Peptide Science. 2008,9(4):394-406
    21 Lakes Ezkurdia, Lisa Bartoli, Piero Faiselli, Rita Casadio, Alfonso Valencia, Michael L. Tress. Progress and Challenges in Predicting Protein-Protein Interaction Sites. Briefings in Bioinformatics. 2009, 10(3):233-246
    22汪旭升,吴为人,金谷雷,朱军.水稻全基因组R基因鉴定及候选RGA标记开发.科学通报. 2005,50(11):1085-1089
    23 Dezulian T, Remmert M, Palatnik Jf, Weigel D, Huson Dh. Identification of Plant MicroRNA Homologs. Bioinformatics. 2005, 22(3): 359?360.
    24 Weber Mj. New Human and Mouse MicroRNA Genes Found By Homology Search. FEBS Journal. 2005, 272(1): 59?73.
    25 Xiujie Wang, Jose L Reyes, Namhai Chua, Terry Gaasterland. Prediction and Identification of Arabidopsis Thaliana MicroRNA Genes and Their mRNA Targets. Genome Biology. 2004,5(9):R65
    26 http://www.facultyof1000.com/article/15833117
    27 Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R.Eddy, Alex Bateman. Rfam: Annotating Non-Coding RNAs in Complete Genomes. Nucleic Acids Research. 2005, 33(Database Issue):D121-124
    28 Griffiths,J.,S. et al. mirBase: microRNA Sequences, Targets and Gene Nomenclature. Nucleic Acids Research. 2006, 34(Database Issue): D140-D144
    29 Dandan Song, Yang Yang, Bin Yu, Binglian Zheng, Zhidong Deng, Bao-Liang Lu, Xuemei Chen, Tao Jiang. Computational Prediction of Novel Non-Coding RNAs inArabidopsis Thaliana. BMC Bioinformatics. 2009,10(Suppl 1):S36
    30 Baohong Zhang, Xiaoping Pan, Edmund J.Stellwag. Identification of Soybean MicroRNAs and Their Targets. Planta. 2008,229(1):161-182
    31 Yanwei Wang, Pingchuan Li, Xiaofeng Cao, Xiujie Wang, Aimin Zhang, Xia Li. Identification and Expression Analysis of MiRNAs From Nitrogen-Fixing Soybean. Biochemical and Biophysical Research Communication. 2009,378(4):799-803
    32 Senthil Subramanian, Yan Fu, Ramanjulu Sunkar, W Brad Barbazuk, Jian-Kang Zhu, Oliver Yu. Novel and Nodulation-Regulated MicroRNAs in Soybean Roots. BMC Genomics. 2008, 9:160
    33 Zhumur Ghosh, J.C., Bibekanand Mallick, Mirnomics-The Bioinformatics of MicroRNA Genes. Biochemical and Biophysical Research Communication. 2007. 363(1): 6-11.
    34 C.S. Sullivan, A.T. Grundhoff, S. Tevethia, J.M. Pipas, D. Ganem. SV40-Encoded MicroRNAs Regulate Viral Gene Expression and Reduce Susceptibility to Cytotoxic T Cells. Nature. 2005, 435(7042) 682-686
    35 A. Adai, C. Johnson, S. Mlotshwa, S. Archer-Evans, V. Manocha, V. Vance, V. Sundaresan. Computational Prediction of MicroRNAs in Arabidopsis Thaliana, Genome Research. 2005, 15(1):78-91
    36 T. Dezulian, J.F. Palatnik, D. Huson, D. Weigel. Conservation and Divergence of MicroRNA Families in Plants, Genome Biology. 2005, 6(11):P13
    37 L.P. Lim, N.C. Lau, E.G. Weinstein, A. Abdelhakin, S. Yekta, M.W. Rhodes, C.B. Burge, D.P. Bartel. The MicroRNAs of Caenorhabditis Elegans. Genes and Development. 2003,17(8): 991
    38 J. Nam, J. Kim, S. Kim, B. Zhang, Promir Ii: A Web Server for the Probabilistic Prediction of Clustered, Nonclustered, Conserved and Nonconserved MicroRNAs, Nucleic Acids Research. 2006,34 (Web Sever Issue):W455-W458
    39 J. Hertel, P.F. Stadler. Hairpins in a Haystack: Recognizing MicroRNA Precursors in Comparative Genomics Data. Bioinformatics. 2006, 22(14): e197-e202
    40 S. Pfeffer, M. Zavolan, F.A. Grasser, M. Chien, J.J. Russo, J. Ju, B. John, A.J. Enright, D. Marks, C. Sander, T. Tuschl. Identification of Virus-Encoded MicroRNAs. Science. 2004, 304 (5671):734-736
    41 Kwang Loong Stanley Ng, Santosh K.Mishra. De Novo SVM Classification of Precursor MicroRNAs from Genomic Pseudo Hairpins Using Global and Intrinsic Folding Measures. Bioinformatics. 2007, 23(11): 1321-1330
    42 J. Nam, K. Shin, J. Han, Y. Lee, V.N. Kim, B. Zhang. Human MicroRNA Predictionthrough a Probabilistic Co-Learning Model of Sequence and Structure, Nucleic Acids Research. 2005,33(11):3570-3581
    43 M. Yousef, M. Nebozhyn, H. Shatkay, S. Kanterakis, L.C. Showe, M.K. Showe. Combining Multi-Species Genomic Data for MicroRNA Identification Using a Na?ve Bayes Classifier Machine Learning for Identification of MicroRNA Genes. Bioinformatics. 2006, 22 (11):1325-1334
    44 Jian-Hua Yang, Xiao-Chen Zhang, Zan-Peng Huang, Hui Zhou, Mian-Bo Huang, Shu Zhang, Yue-Qin Chen, Liang-Hu Qu. Snoseeker: An Advanced Computational Package For Screening of Guide and Orphan SnoRNA Genes in the Human Genome. Nucleic Acids Research. 2006, 34(18): 5112-5123
    45 Sverker Edvardsson, Paul P.Gardner, Anthony M.Poole, Michael D.Hendy, David Penny, Vincent Moulton. A Search for H/ACA SnoRNAs in Yeast Using MFE Secondary Structure Prediction. Bioinformatics. 2003,19(7):865-873
    46 Laurent Lestrade, Michel J.Weber. SnoRNA-Lbme-Db, A Comprehensive Database of Human H/ACA and C/D Box SnoRNAs. Nucleic Acids Research. 2006, 34(Database Issue):D158-162
    47 Jana Hertel, Ivo L.Hofacker, Peter F.Stadler. Snoreport: Computational Identification of snoRNAs With Unknown Targets. Bioinformatics. 2008,24(2):158-164
    48 Eva Freyhult, Sverker Edvardsson, Ivica Tamas, Vincent Moulton, Anthony M Poole. Fisher: a program for the detection of H/ACA snoRNAs using MFE secondary structure prediction and comparative genomics-assessment and update. BMC Research Notes. 2008, 1:49
    49 Changning Liu, Baoyan Bai, Geir Skogerbo, Lun Cai, Wei Deng, Yong Zhang, Dongbo Bu, Yi Zhao, Runsheng Chen. Noncode: An Integrated Knowledge Database of Non-Coding RNAs. Nucleic Acids Research. 2005,33(Database Issue): D112-D115
    50 Xiaowo Wang, Jing Zhang, Fei Li, Jin Gu , Tao He , Xuegong Zhang, Yanda Li, MicroRNA Identification Based On Sequence and Structure Alignment. Bioinformatics. 2005, 21(18): 3610-3614
    51 Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, Xuegong Zhang. Classification of Real and Pseudo MicroRNA Precursors Using Local Structure-Sequence Features and Support Vector Machine. BMC Bioinformatics. 2005.6:310
    52 Xiaowo Wang, Jin Gu, Michael Q. Zhang, Yanda Li. Identification of Phylogenetically Conserved MicroRNA Cis-Regulatory Elements Across 12Drosophila Species. Bioinformatics. 2008,24(2):165-171
    53汪小我,张学工,李衍达,脊椎动物中微小RNA进化模式研究.中国科学C辑2008, 38(4):348-355
    54 Peng Jiang, Haonan Wu, Wenkai Wang, Wei Ma, Xiao Sun, Zuhong Lu. Mipred: Classification of Real and Pseudo MicroRNA Precursors Using Random Forest Prediction Model with Combined Features. Nucleic Acids Research. 2007,35(Web Server Issue):W339-W344
    55 Guo Q, Xiang AL, Yang Q, Qiu CX, Yang ZM. Bioinformatic Identification of MicroRNAs and Their Target Genes from Solanum Tuberosum Expressed Sequence Tags. Chin Sci Bullen. 2007, 52(17):1656-1664.
    56 Qiu CX, Xie FL, Zhu YY, Guo K, Huang SQ, Nie L, Yang ZM. Computational Identification of MicroRNAs and Their Targets in Gossypium Hirsutum Expressed Sequence Tags. Gene. 2007, 395(1-2):49-61.
    57 Fu Liang Xie, Si Qi Huang, Kai Guo, An Ling Xiang, Yi Yong Zhu, Li Nie, Zhi Min Yang. Computational Identification of Novel MicroRNA and Targets In Brassica Napus. FEBS Letters. 2007,581(7):1464-1474
    58 Zhou ZS, Wang SJ, Yang ZM. Bioinformatic Identification and Expression Analysis of New MicroRNAs from Medicago Truncatula. Biochem Biophy Res Comm.2008. 374(3): 538–542
    59 Yuanyuan Wei, Shuang Chen, Pengcheng Yang, Zongyuan Ma, Le Kang. Characterization and Comparative Profiling of the Small RNA Transcriptomes in Two Phases of Locust. Genome Biology. 2009,10:R6
    60 Xiaomin Yu, Qing Zhou, Sung-Chou Li, Qibin Luo, Yimei Cai, Wen-Chang Lin, Huan Chen, Yue Yang, Songnian Hu, Jun Yu. The Silkworm(Bombyx Mori) MicroRNAs and Their Expressions In Multiple Developmental Stages. PLoS One. 2008,3(8):e2997
    61 Zhao T., Li G., Mi S., Li S., Hannon, G., Wang X.-J., Qi Y. A Complex System of Small RNAs in the Unicellular Green Alga Chlamydomonas Reinhardtii. Genes and Development.2007, 21(10):1190-1203
    62 Tao Zhao, Wei Wang, Xue Bai, Yijun Qi. Gene Silencing by Artificial MicroRNAs in Chlamydomonas. Plant Jounal. Doi: 10.1111/J.1365-313x.2008.03758.X, 2008 Nov. 28
    63徐玲,罗玉萍,周冬根,李思光. U83 Box C/D SnoRNA构建果蝇科系统发生树.细胞生物学杂志. 2007,29(5):758-762
    64 Yalin Zhao, Hua Li, Yanyan Hou, Lei Cha, Yuan Cao, Ligui Wang, Xiaomin Ying ,Wuju Li. Construction of Two Mathematical Models For Prediction of Bacterial sRNA Targets. Biochemical and Biophysical Research Communications. 2008, 372(2): 346–350
    65 Xue Zhou, Zhen Liao, Qidong Jia, Luogen Cheng, Fei Li. Identification and Characterization of Piwi Subfamily in Insects. Biochemical and Biophysical Research Communications. 2007, 362 (1): 126–131
    66李华,应晓敏,查磊,李伍举.基于k-Tuple组合的酵母ncRNA与mRNA的比较研究.生物物理学报. 2006,22(2):110-116
    67杨良怀,吕丕明,陈立军,邓明华. K-Gram方法识别microRNA前体.生物化学与生物物理进展. 2007, 34(2):154-161
    68 Chawla N V, Japkowicz N, Kolcz A. Editorial: Special Issue on Learning From Imbalanced Data Sets. ACM SigKDD Explorations, 2004, 6(1): 1–6
    69 Japkowicz N. Proceedings of AAAI2000 Workshop on Learning from Imbalanced Data Sets. AAAI Technical Report WS-00-05, AAAI 2000
    70 Dietterich T G, Margineantu D, Provost F, Turney P. Proceedings of the ICML2003 Workshop on Cost-Sensitive Learning, 2003.
    71 Pearson R, Goney G, Shwaber J. Imbalanced Clustering For Microarray Timeseries. Proceedings of The ICML’03 Workshop On Learning From Imbalanced Data Sets, Washington DC, 2003.
    72 Wu G, Chang E Y. Class-Boundary Alignment For Imbalanced Dataset Learning. Proceedings of The ICML’03 Workshop On Learning From Imbalanced Data Sets, Washington DC, 2003.
    73 Wu J, Mullin M D, Rehg J M. Linear Asymmetric Classifier For Cascade Detectors. Proceedings of the 22nd International Conference on Machine Learning, 2005, 993–1000.
    74 Stolfo S, Fan W, Lee W, Prodromidis A, Chan P. Cost-Based Modeling For Fraud and Intrusion Detection: Results From the Jam Project. Proceedings of The 5th ACM SigKDD International Conference On Knowledge Discovery and Data Mining, San Diego, Ca, 1999
    75 Kubat M S, Holte R C S, Matwin S S. Machine Learning For the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 1998, 30(2): 195–215.
    76 Fawcett T.“In Vivo”Spam Filtering: A Challenge Problem for Data Mining. ACM SigKDD Explorations, 2003, 5(2): 140–148
    77 Weiss G M. Mining with Rarity. A Unifying Framework. ACM SigKDD Explorations, 2004, 6(1): 7–19
    78 G.T. Marth, et al. A General Approach to Single-Nucleotide Polymorphism Discovery. Nature Genetics. 1999, 23(4):452-456
    79 Philippe P. Luedi, Alexander J. Hartemink, Randy L. Jirtle. Genome-wide prediction of imprinted murine genes. Genome Research. 2005,15(8): 875-884
    80 Philippe P. Luedi, Fred S. Dietrich, Jennifer R. Weidman et al. Computational and experimental identification of novel human imprinted genes. Genome Research. 2007,17(12):1723-1730
    81 Yijuan Lu, Qi Tian, Feng Liu, Maribel Sanchez, Yufeng Wang. Interactive Semisupervised Learning For Microarray Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2007,4(2):190-203
    82李建中,杨昆,高宏,骆吉州,郭政.考虑样本不平衡的模型无关的基因选择方法.软件学报. 2006,17(7):1485-1493
    83 Z. J. Ding, Y. Feng, Y. G. Zheng, Y.-Q. Zhang, Granular Decision Fusion Systems For Effective Protein Methylation Prediction, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2008), Sept. 15-17, Sun Valley, Idaho
    84于建涛,郭茂祖,蔡禄.蛋白质相互作用及其网络预测方法研究进展.电子学报.2007,35(12a):1-7
    85 Chunlin Wang, Chris Ding, Richard F.Meraz, Stephen R.Holbrook. PSOL: A Positive Sample Only Learning Algorithm for Finding Non-Coding RNA Genes. Bioinformatics. 2006,22(21):2590-2596
    86 Malik Yousef, Segun Jung, Louise C Showe, Michael K Showe. Learning from Positive Examples When the Negative Class is Undetermined- MicroRNA Gene Identification. Algorithms for Molecular Biology. 2008,3(1):2
    87 Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16(6): 321–357
    88 Drummond C, Holte R C. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling. Proceedings of ICML’2003 Workshop on Learning from Imbalanced Data Sets, 2003
    89 Japkowicz N, Stephen S. The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis. 2002, 6(5): 429–449
    90 Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One Sided Selection. Proceedings of the 14th International Conference on Machine Learning, 1997, 179–186
    91 Batista G E, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SigKDD Explorations, 2004, 6(1): 20–29
    92 Estabrooks A, Japkowicz N. A Mixture-of-Experts Framework for Learning from Unbalanced Data Sets. Proceedings of the 4th Intelligent Data Analysis Conference, 2001, 34–43.
    93 Estabrooks A, Jo T, Japkowicz N. A Multiple Resampling Method for Learning From Imbalanced Data Sets. Computational Intelligence, 2004, 20(1): 18–19
    94 Guo H, Viktor H L. Learning from Imbalanced Data Sets with Boosting and Data Generation: The Databoost-Im Approach. ACM SigKDD Explorations. 2004, 6(1):30–39
    95 Chawla N V, Lazarevic A, Hall L O, Bowyer K. SMOTEBOOST: Improving Prediction of the Minority Class in Boosting. Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2003, 107–119.
    96 Joshi M, Kumar V, Agarwal R. Evaluating Boosting Algorithms To Classify Rare Classes: Comparison and Improvements. Proceedings of the 1st IEEE International Conference on Data Mining, 2001, 257–264
    97 Karakoulas G J, Shawe-Taylor J. Optimizing Classifiers for Imbalanced Training Sets. Proceedings of the Conference of Advances in Neural Information Processing Systems, 1999, 11: 253–259.
    98 Viola P, Jones M. Fast and Robust Classification Using Asymmetric Adaboost and A Detector Cascade. Proceedings of the Conference of Advances in Neural Information Processing Systems, 2002, 14: 1311–1318.
    99 Fan W, Stolfo S J, Zhang J, Chan P K. Adacost: Misclassification Costsensitive Boosting. Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, Ca, 1999, 97–105.
    100 Zadrozny B, Langford J, Abe N. Cost-Sensitive Learning by Cost-Proportionate Example Weighting. Proceedings of the 3rd International Conference on Data Mining, 2003, 435–442.
    101 Ting K M. An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(3): 659–665.
    102 Scholkopf B, Platt J, Shawe-Taylor J, Smola A.J, Williamson R.C. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 2001, 13(7): 1443– 1471
    103 Tax D. One-Class Classification. Phd Thesis, Delft University of Technology, 2001
    104 Japkowicz N. Supervised Versus Unsupervised Binary-Learning by Feedforward Neural Networks. Machine Learning, 2001, 42(1): 97–122
    105 Manevitz L M, Yousef M. One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2001, 2(2): 139–154
    106 Jinfu Liu, Qinghua Hu, Daren Yu. A Weighted Rough Set Based Method Developed for Class Imbalance Learning. Information Science. 2008,178(4):1235-1256
    107刘胥影.代价敏感学习和类别不平衡学习的研究.南京大学研究生毕业论文.2006
    108刘胥影,吴建鑫,周志华.一种基于级联模型的类别不平衡数据分类方法.南京大学学报(自然科学).2006,42(2):148-155
    109李鹏,王晓龙,刘远超,王宝勋.一种基于混合策略的失衡数据集分类方法.电子学报. 2007,35(11):2161-2165
    110 Anderes Krogh, et al. Neural Network Ensembles, Cross Validation, and Active Learning. Advances in Neural Information Processing Systems. Cambridge: MIT Press, 1995. 231-238
    111 Ian H, et al. Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005
    112 Eibe Frank, et al. Data Mining In Bioinformatics Using Weka. Bioinformatics. 2004,20(15):2479-2481
    113 http://archive.ics.uci.edu/ml/datasets.html
    114 X.-Y. Liu, J. Wu, Z.-H. Zhou. Exploratory Under-Sampling for Class-Imbalance Learning. Proceedings of The 6th IEEE International Conference On Data Mining (ICDM'06), Hong Kong, China, 2006, 965-969
    115 Scott Christley, Yiming Lu, Chen Li, Xiaohui Xie. Human Genomes as Email Attachments. Bioinformatics 2009, 25(2): 274-275.
    116 S. T. Sherry, et al. DBSNP: The NCBI Database of Genetic Variation. Nucleic Acids Research. 2001, 29(1): 308-311
    117 ftp://ftp.ncbi.nih.gov/repository/unigene/homo_sapiens/hs.seq.all.gz
    118 X. Huang, et al. CAP3: A DNA Sequence Assembly Program. Genome Research. 1999, 9(9): 868-877
    119 Nickerson Da, et al. Polyphred: Automating the Detection and Genotyping of Single Nucleotide Substitutions Using Fluorescence-Based Resequencing. Nucleic Acids Research. 1997, 25(14): 2745–2751
    120 Weckx S, et al. novoSNP, A Novel Computational Tool for Sequence Variation Discovery. Genome Research. 2005, 15(3): 436–442
    121 J. Zhang, et al. SNPdetector: A Software Tool for Sensitive and Accurate SNP Detection. Plos Computational Biology. 2005, 1(5):e53
    122 http://www4.clustrmaps.com/counter/maps.php?url=http://dbgroup.cs.tsinghua.edu.cn/zouquan/libid/
    123 Furtig B, Richter C, Wohnert J, Schwalbe H. Nmr Spectroscopy of RNA. Chembiochem. 2003, 4(10):936-962.
    124 E.Ten Dam, K.Pleij, D.Draper. Structural and Functional Aspects of RNA Pseudoknots. Biochemistry. 1992,31(47):11665-11676
    125 JE Tabaska, RB Cary, HN Gabow, GD Stormo. An RNA Folding Method Capable of Identifying Pseudoknots and Base Triples. Bioinformatics. 1998,14(8):691-699
    126 Xiaolu Huang, Hesham Ali. High Sensitivity RNA Pseudoknot Prediction. Nucleic Acids Research, 2007,35(2):656-663
    127 David P. Giedroc, Carla A. Theimer, Paul L. Nixon. Structure, Stability and Function of RNA Pseudoknots Involved in Stimulating Ribosomal Frameshifting. J. Mol. Biol. 2000, 298(2):167-185
    128 Tinoco.I., Uhlenbeck.O.C., Levine.M.D. Estimation of Secondary Structure in Ribonucleic Acids. Nature. 1971.230(5293):362-367
    129 J. A. Jaeger, D. H. Turner ,M. Zuker. Improved Predictions of Secondary Structures For RNA. Proc. Natl. Acad. Sci. 1989, 86(20): 7706-7710
    130 David H.Mathews, Jeffret Sabina, Michael Zuker, Douglas H.Turner. Expand Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure. J. Mol. Biol. 1999,288(5):911-940
    131 Woese C, Pace N: The RNA World, Chap. Probing RNA Structure, Function, and History by Comparative Analysis Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. 1993:91-117
    132 Ruth Nussinov, George Pieczenik, Jerrold R. Griggs, Daniel J. Kleitman.Algorithms For Loop Matchings. Siam Journal On Applied Mathematics.1978,35(1):68-82
    133 M.S.Waterman, T.F.Smith. RNA Secondary Structure: A Complete Mathematical Analysis. Mathematical Biosciences. 1978.42(1):257-266
    134 Waterman.M.S., Smith T.F. Rapid Dynamic Programming Methods For RNASecondary Structure. Advances in Applied Mathematics. 1986.7(4):455-464
    135 R.B.Lyngs?. M. Zuker. C.N.S.Pedersen. Fast Evaluation of Internal Loops in RNA Secondary Structure Prediction. Bioinformatics. 1999.15(6): 440-445
    136 Elena Rivas, Sean R. Eddy. A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots. J Mol Biol. 1999.285(5):2053-2068
    137 Zuker M. On Finding All Suboptimal Foldings of a RNA Molecular. Science. 1989,244(4900):48-52
    138 Mathews D.H., Turner D.H. Dynalign: An Algorithm for Finding the Secondary Structure Common to Two RNA Sequences. J. Mol. Biol. 2002, 317(2):191-203
    139 Dowell R, Eddy S: Evaluation of Several Lightweight Stochastic Context-Free Grammars for RNA Secondary Structure Prediction. BMC Bioinformatics 2004, 5:71
    140 Mathias Sprinzl, Carsten Horn, Melissa Brown, Anatoli Loudovitch, Sergey Steinberg. Compilation of tRNA Sequences and Sequences of tRNA Genes. Nucleic Acids Research. 1998,26(1):148-153
    141 Lowe TM, Eddy SR. tRNAScan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Research. 1997,25(5):955-964
    142张涛涛.基于比较序列分析的RNA二级结构预测方法研究.哈尔滨工业大学研究生毕业论文.2007
    143 Julien Allali, Marie-France Sagot. A New Distance for High Level RNA Secondary Structure Comparison. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2005,2(1):3-14
    144 Sankoff D. Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM Journal on Applied Mathematics 1985, 45(5):810-825
    145 Gorodkin J, Heyer L, Stormo G. Finding The Most Significant Common Sequence and Structure Motifs in a Set of RNA Sequences. Nucleic Acids Research 1997, 25(18):3724-3732
    146 Gorodkin J, Stricklin S, Stormo G. Discovering Common Stemloop Motifs in Unaligned RNA Sequences. Nucleic Acids Research 2001, 29(10):2135-2144
    147 Knudsen B., Hein J. Using Stochastic Context Free Grammars and Molecular Evolution to Predict RNA Secondary Structure. Bioinformatics. 1999, 15(6): 446-454
    148 Knudsen B., Hein J. Pfold: RNA Secondary Structure Prediction Using StochasticContext-Free Grammars. Nucleic Acids Research. 2003,31 (13): 3423-3428
    149 Ivo L.Hofacker. Vienna RNA Secondary Structure Sever. Nucleic Acids Research. 2003, 31 (13): 3429-3431
    150 Hofacker IL, Bernhart S, Stadler P. Alignment of RNA Base Pairing Probability Matrices. Bioinformatics 2004, 20(14):2222-2227
    151 Touzet H., Perriquet O. CARNAC: Folding Families of Non Coding RNAs. Nucleic Acids Research. 2004,32(Web Sever Issue):W142-145
    152 Perriquet O, Touzet H, Dauchet M. Finding The Common Structure Shared By Two Homologous RNAs. Bioinformatics 2003, 19(1):108-116
    153 Siebert S, Backofen R. MARNA: A Server for Multiple Alignment of RNAs. Proceedings of the German Conference on Bioinformatics 2003:135-140
    154 Christina Witwer, Ivo L.Hofacher, Peter F. Stadler. Prediction of Consensus RNA Secondary Structure Including Pseudoknots. IEEE/ACM Transaction on Computational Biology and Bioinformatics. 2004,1(2): 66-77
    155刘琦,张引,叶修梓,俞荣栋.基于离散Hopfield网络求解极大独立集的茎区选择算法以及在RNA二级结构预测中的应用.计算机学报. 2008,31(1):51-58
    156 Y. Takefuji, L. Chen, K. Lee, J. Huffman. Parallel Algorithms For Finding A Near-Maximum Independent Set of A Circle Graph. IEEE Transaction On Neural Networks. 1990,1(3):263-267
    157 Brown J.W. The Ribonuclease P Database. Nucleic Acids Research,1999,27(1):314
    158华友佳,肖华胜. MicroRNA研究进展.生命科学. 2005. 17(3):1-4
    159 Lee Y, Jeon K, Lee J T, et al. MicroRNA Maturation: Stepwise Processing and Subcellular Localization. Embo Journal, 2002, 21(17): 4663-4670
    160 Ketting R F, Sylvia E J, Bernstein F E, et al. Dicer Functions in RNA Interference and in Synthesis of Small RNA Involved in Developmental Timing in C. Elegans. Genes and Development, 2001, 15(20): 2654-2659
    161 Aho.A.V, Corasick.M.J. Efficient String Matching: An Aid to Bibliographic Search. Communications of ACM, 1975, 18(6): 333 -340
    162 D.Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press. 1997.
    163 A.Rambaut. Estimating the Rate of Molecular Evolution: Incorporating Noncontemporaneous Sequences into Maximum Likelihood Phylogenies. Bioinformatics, 2000,16(4):395-399
    164 Huiyu Xia, Fei Li, Tao He, Yanda Li. Distribution of Mature MicroRNA on ItsPrecursor: A New Character for MicroRNA Prediction. International Journal of Information Technology. 2005.11(8):1-8
    165 Anastasia Khvorova, A.R., Sumedha D. Jayasena. Functional siRNAs and MicroRNAs Exhibit Strand Bias. Cell. 2003, 115(2):209-216.
    166 Dianne S. Schwarz, G.H., Tingting Du, Zuoshang Xu, Neil Aronin, Phillip D. Zamore. Asymmetry in the Assembly of the RNAi Enzyme Complex. Cell. 2003, 115(2): 199-208
    167徐磊.多示例学习算法的研究与应用.哈尔滨工业大学研究生毕业论文.2007
    168 Chen Li, Bin Wang, Xiaochun Yang. Vgram: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, VLDB 2007
    169 Xiaochun Yang, Bin Wang, Chen Li. Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently. ACM SIGMOD 2008
    170徐玲. U43及其侧翼序列中snoRNA的鉴定及基因组织的进化分析.南昌大学研究生毕业论文. 2006
    171 Dacid Tollervey, Tamas Kiss. Function and Synthesis of Small Nucleolar RNAs. Curr Opin Cell Biol. 1997,9(3):337-342
    172 Arina D. Omer, Todd M Lowe, Anthony G Russell, et al. Homologs of Small Nucleolar RNAs in Archaea. Science, 2000,288(5465):517-522
    173 Burkhard M, Oliver R, Said A, Dirk H, Klaus F.X.M, Andreas D, Hans W.M. Exon Discovery by Genomic Sequence Alignment. Bioinformatics. 2002,18(6):777-787

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700