基于YK编码的多模式识别方法及其在DNA分类上的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在研究对象分类时出现了一个问题。对于任何一个对象,由于它的多功能性,它可以同时属于多个类别。如何判断一个具有若干特征的对象同时属于多个类别?这就是所谓的“多对多联想模式识别”,简称“多模式识别”。多模式识别在信息识别中是一个急需解决的技术问题,对于研究具有多功能性或者信息量丰富但功能并不清楚的对象而言,它将大大提高分类结果的最终准确性。本研究选择来自NCBI(美国国立生物技术信息中心)的GenBank中的水稻DNA序列indole-3-acetic acid-amido synthetase、F-box protein、RING-H2 finger protein三个基因家族作为研究材料。
     本文主要研究内容包括以下三点
     1、对于被分类对象,利用信息论中YK编码方法从对象中建立特征字符串字典,从而提取特征数据,统计出各特征字符串重复出现的频数,计算出特征字符串重复出现的频率百分比,以此作为本研究的主要研究数据。
     2、研究多模式识别方法的实现。设计出两种方法。一种是距离判别方法;另一种方法是利用神经网络来实现多模式识别。
     3、将以上研究应用于DNA序列分类上,主要内容有
     (1)运用YK编码方法,建立DNA序列特征字符串字典,利用各特征字符串重复出现频率的百分比作为主要研究数据,通过距离判别DNA序列,实现DNA序列多模式识别。
     (2)选择YK编码DNA序列特征字符串字典中重复出现频率最高的前十个特征字符串百分比作为神经网络的输入向量,结合已知的目标输出,建立指导下学习的BP神经网络,实现DNA序列多模式识别。
     (3)本研究还从生物学角度考虑,将DNA序列转录翻译成蛋白质序列,根据氨基酸分子中侧链基的极性性质,将蛋白质序列中的20种氨基酸和终止信息码分成5类。以5类氨基酸分子中侧链基的极性性质的百分比含量作为输入向量,结合已知的目标输出,建立指导下学习的BP神经网络,实现DNA序列多模式识别。
There is a problem appeared when we research the Object classification. Because of the multiple functions of each object, everyone of them can be classified into different classification. How could we determine an object which has a number of characteristics and whether it is classifed into different classification? This is the so-called "many-to-many association pattern recognition", referred to as "multi-pattern recognition.".The Multi-pattern recognition of identification is an techincal problem need to be resolved immediately. The Multi-pattern recognition will greatly enhance the accuracy of the final classification results when it research the object which has mutliple functions or versatility imformations but not clear functions. In my research, This research choose those three Gene family from the NCBI (National Center for Biotechnology Information) in GenBank, DNA sequence of rice indole-3-acetic acid-amido synthetase, F-box protein, RING-H2 finger protein gene as main research material.
     This paper includes the following three points
     1, For classified objects, a Information theory of YK Coding methods was used to establish the dictionary of characteristics string from the object, extract and analyse the characteristics of data, count the recurring and calculate the repeated characteristics string frequency of percentage, as this study of main research data.
     2, Realization of pattern recognition methods was studied and two ways were designed. Method one is the distance criterion.To achieve the multi-pattern recognition by using neural network is another method.
     3, The research is applied to DNA sequence classification,, the main contents include
     1) The YK coding method is used to establish the DNA sequence characteristics string of the dictionary and make use of the repeated characteristics string frequency of percentage as the main research data, by which the Multi-Pattern Recognition of DNA sequence is achieved through the distance criterion.
     2) The YK encoding methods of DNA sequence of characteristics string of the dictionary most frequently recurring feature of the first ten percentage string was selected as input vector of neural network, combined with the known target output, then the the BP neural network of learning under the guidance is established,by which Multi-Pattern Recognition of DNA sequence is achieved.
     3) Considering from the biological point of view, DNA sequence is translated into protein sequences of transcription, and the protein sequences in the 20 kinds of amino acids and the termination of the information code is divided into 5 categories according to side-chain amino acid-based molecular species of polar nature. The 5 amino acid side-chain molecular species of polar nature-based content of percentage is used as input vector, combined with the known target output, then the the BP neural network of learning under the guidance is established,by which Multi-Pattern Recognition of DNA sequence is achieved.
引文
[1]苏京平,闫双勇,孙林静,等.我国转基因水稻研究的状况[J].天津农业科学,2007,13(4):7-11
    [2]伏军,徐国庆,罗弘,等.外源DNA导入水稻的育种效果变异[J].湖南农业科学,1992,18(9):10-16
    [3]李建粤,许燕,张伟.大豆DNA导入后稻米粗蛋白与直链淀粉的相关性[J].上海师范大学学报:自然科学版,1999,28(1):89-93
    [4]邵建林,史定华,王翼飞.贝叶斯神经网络在生物序列分析中的应用[J].新技术新方法,2004,26(2):108-111
    [5]周玉元,周铁军.DNA序列分类的Fisher判别法[J].湖南农业大学学报,2003,29(5):437-440
    [6]杨玉英,刘罗飞.DNA序列的一种几何分类法[J].吉首大学学报,2002,9,23(3):85-87
    [7]杨健,王驰,杨勇.DNA分类模型[J].数学的实践与认识,2001,1,31(1):31-34
    [8]吕金翅,马小龙,曹芳.DNA序列分类的数学模型[J].数学的实践与认识,2001,31(1):46-53
    [9]蒋利平,叶青,宋军锋,杨锦华.DNA序列分类数学模型[J].新疆师范大学学报,2001,6,20(2):5-8
    [10]韩轶平,余杭,刘威.DNA序列的分类[J].数学的实践与认识,2001,1,31(1):38-45
    [11]朱永松,蔡光兴,黄斌.DNA序列分析中的数学模型[J].湖北工学院学报,2002,9,17(3):52-55
    [12]陈晓燕,鲍伦军,莫金垣.DNA序列特征的统计分析[J].化学通报,2004(4):283-289
    [13]汤诗杰,周亮,王晓玲.DNA序列的分类模型[J].数学的实践与认识.2001,1,31(1):19-25
    [14]徐晓秋,初立元,左铭杰,谭欣欣.DNA分类方法的探讨[J].大连大学学 报,2001,8,22(4):95-100
    [15]黄希利,邱铭铭,方顺.DNA序列的距离判别分类模型[J].装备指挥技术学院学报,2004,8,15,(4):101-104
    [16]冯涛,康喆雯,韩小军.关于DNA序列分类问题的模型[J].数学的实践与认识,2001,1,31,(1):26-30
    [17]顾俊华,盛春楠,韩正中.模糊聚类分析方法在DNA序列分类中的应用[J].计算机仿真,2005,10,22,(10):108-112
    [18]王焕森,饶明贵.DNA序列的灰色分类[J].中原工学院学报,2007,8,18(4):18-21
    [19]刘丽.DNA序列分类模型[J].安徽农业大学学报,2005,32(3):393-396
    [20]邹伟,陈继业.DNA序列的分类模型[J].邵阳师范高等专科学校学报,2004,4,24(2):15-16
    [21]刘志.DNA序列的一种分类方法[J].陕西师范大学学报,2002,5(30):114-116
    [22]李银山,杨春燕,张伟.DNA序列分类的神经网络方法[J].计算机仿真,2003,20(2):65-68
    [23]由伟,刘亚秀.用人工神经网络模型对DNA序列进行分类[J].计算机信息与技术,2005(25):89
    [24]Chun Li, Ping-an He, Jun Wang. Artificial Neural Network Method for Predicting Protein Coding Genes in the Yeast Genome[J].Internet Electronic Journal of Molecular Design,2003,2(8):527-538
    [25]Claverie J M. Computational methods for the identification of genes in vertebrate genomic sequence [J]. hum Mo 1 Genet,1997,6 (10):1735-1744
    [26]Green P, Lipman D, Hillier L, Waterston R, States D, Clavierie JM. Ancient conserved regions in new gene sequences and the protein databases[J].Science,1993,259:1711-1716
    [27]Kroyh A, Mian I S, Hanssler D.A hidden Markov model that finds genes in Ecol DNA[J].N ucleic Acids Res,1994,22 (22):4768-4778 [28] Gelfand M S, Roytberg M A. Prediction of the exon-intron structure by a dynamic programming approach[J]. Bio systems,1993,30 (1-3):173-182
    [29]Tiwavi S, Ramachandran S, Bhattacharga A, Bhattacgarga S, Ramaswamy R. Prediction of probable genes by Fourier analysis of genomic sequences[J]. comput App 1 Bio sci,1997,13 (3):263-270
    [30]Fickett J M, Tung C S. Assessment of protein coding measures[J]. Nucleic Acids Res,1992,20 (24):6441-6450
    [31]Guigo R, Knudsen S, Drake N, Smith T. Prediction of gene structure[J]. Mo 1 Bio 1,1992,226 (1):141-157
    [32]Dong S, Searls D B. Gene structure prediction by linguistic methods[J].Genom ics,23 (3):540-551
    [33]Salzberg S. Locating protein coding regions in human DNA using a decision tree algorithm[J]. Comput Bio 1,2(3):473-485
    [34]刘瑾,徐可欣,陈小红,吴萍,赵学玲.采用图像融合技术的多模式人脸识别[J].工程图学学报,2007,6:72-78
    [35]刘鹏,王作英.多模式汉语连续语音识别中视觉特征的提取和应用[J].中文信息学报,2003,18(4):79-84
    [36]张秋和,王文伟,刘洪阳.基于单视图多模式的人脸识别方法[J].吉林大学学报,2006,5,24(3):330-335
    [37]胡楠,王英武,吕凝.基于内容的视频检索方法[J].吉林大学学报,2006,5,24(3):265-270
    [38]谷军霞,丁晓青.小写金额的多模式切分与识别算法[J].中国图象图形学报,2008,4,13(4):696-701
    [39]刘玥,陈戍,郭鹏毅,等.用特征编码模板实现多模式分类识别的方法[J].光学学报,2001,2,21(2):173-176
    [41]任翠池,杨淑莹,洪俊.基于BP神经网络的手写字符识别[J].天津理工大学学报,2006,22(4):80-82
    [41]Chellappa R, Wilson C L et al. Human and machine recognition of faces:A survey. Proceedings of the IEEE.1995,83(5):705-740
    [42]杨庆雄.基于神经网络的字符识别研究[J].信息技术,2005,4:92-96
    [43]朱廷劭,高文,凌晓峰.神经网络在汉语两字词韵律规则学习中的应用[J].计算机研究与发展,1999,36(6):664-667
    [44]贾震斌,填立炎.基于BP神经网络的身份证号码识别算法[J].苏州市职业大学报,2006,17(3):92-94
    [45]顾明亮,沈兆勇.基于语音配列的汉语方言自动辨识[J].中文信息学报,2006,20:77-82
    [46]M A Zissman. Comparison of Four Approaches to Automatic Language Identification of Telephone Speech[J]. IEEE Trans. Speech and Audio Processing,1996.4(1):31-34
    [47]马惠敏,沈熠,郑链,王克勇.图像识别神经网络处理系统[J].北京理工大学学报,1999,19(S1):85-88
    [48]张良培,李德仁.人工神经元网络在光谱识别中的应用[J].光谱学与光谱分析,1999,19(2):158-160
    [49]张平,徐问之.基于神经网络多国货币种类的识别与研究[J].重庆大学学报,1999,22(5):30-34
    [50]Richard O. Duda, Peter E Hart, David G. Stork.模式分类[M].北京:机械工业出版社,2003
    [51]韩雪松,吴静然,杨鹃.BP神经网络在农业工程中的应用[J].商场现代化,2007,6:44-45
    [52]马成林,吕俊伟,王化民,于海业.论人工神经网络在农业工程中的应用[J].农业工程学报,1997,9,13:232-235
    [53]沈世镒,吴忠华.信息论基础与应用[M].北京:高等教育出版社,2004
    [54]蔡禄.生物信息学教程[M].北京:化学工业出版社,2007

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700