基于改进的ID3算法的蛋白质纯化方法研究

英文题名：The Methods Research of Protein Purification Based on the Improved ID3
作者：赵桐锌
论文级别：硕士
学科专业名称：测试计量技术与仪器
中文关键词：数据挖掘 ; ID3 ; 蛋白质纯化 ; 离散化 ; 决策树
英文关键词：Data mining ; ID3 ; Protein purification ; Discretization ; Decision tree
学位年度：2011
导师：刘文琦
学科代码：080402
学位授予单位：大连理工大学
论文提交日期：2011-11-09

摘要

现阶段生物技术的发展十分迅速,蛋白生产工艺的确定是其中的热门,也是目前生物领域中的一项重要研究课题。蛋白纯化工艺是蛋白生产中一个十分重要的步骤。在蛋白生产及相关研究中蛋白质的分离纯化技术使用广泛,传统的蛋白纯化方法是依靠操作人员的经验进行反复的试验最后进行确定的,但是此方法却花费比较大,周期也比较长。
蛋白质本身所具有的各个性质与蛋白纯化方法之间存在着一定的关系,因此本文将数据挖掘技术引入到纯化方法的确定中来。决策树方法不仅能够直接体现数据的特点,便于理解,具有较好的分类预测能力,能方便提取决策规则,而且擅长处理非数值型数据。本文采用决策树方法中的ID3算法对历史蛋白数据集进行分类,找出蛋白性质与纯化方法之间的隐藏关系。ID3算法以信息论为基础,以信息熵和信息增益度为衡量标准,实现对数据的归纳分类。但是ID3算法存在不能处理离散数据和多值偏向性的缺点,不能直接应用到蛋白纯化方法的确定中,本文提出了改进的ID3算法(RS-ID3),运用粗糙集理论将数据离散化并应用信息增益率来计算属性重要度,克服了传统ID3算法的局限性。通过对UCI标准数据库中的数据集进行分类,将RS-ID3算法与另一种改进的ID3算法——C4.5算法进行比较,可以看出所提方法具有更好的分类效果。最后将所提的RS-ID3算法用于蛋白质纯化工艺摸索,实例验证也具有很好的效果,该方法为纯化方法的确定提供了支持。
At present, biotechnology development is very rapid, determination of protein production process is most popular and is also an important research topic in the biological area. Protein purification is a very important step in the production. In protein production and related research technology, isolation and purification of proteins is widely used. Traditional purification method is relying on the experience of operators repeatedly test, but this method takes a larger and the cycle is longer.
There is a certain relationship between the protein purification method and protein properties, so this paper has taken data mining to solve this problem. Decision tree can directly reflect the features, easy to understand, has better classification of the predictive power, easy extracting decision rules, and is good at dealing with non-numeric data. Using ID3 algorithm to categorize historical protein data sets and identify hidden relationships between protein properties and purification methods. ID3 algorithm based on the information theory、information entropy and information gain for the metrics, enabling the data summary classifications. But the ID3 algorithm cannot process discrete data and values disadvantage of biased, so it cannot be directly to the protein purification method of determining. This paper improved ID3 algorithm (RS-ID3), using rough set theory to discrete data and using information gain ratio to calculate attribute significance, overcomes the traditional limitations of ID3 algorithm. Using RS-ID3 algorithm compared with another improved ID3 algorithm--C4.5 algorithm, analysis shows this algorithm not only improves the UCI machine learning data set classification accuracy, but also has a good effect in the prediction of protein purification. Support is provided for the purification methods of determining.

引文

[1]张惠展.基因工程[M].上海：华东理工大学出版社,2005.
    [2]罗立新.细胞工程[M].广州：华南理工大学科学出版社,2005.
    [3]郭勇.酶工程[M].北京：科学出版社,2005.
    [4]邓毛程,张邦建.发酵工艺原理[M].北京：北京轻工业出版社,2007.
    [5]田埂,苏夜阳.从“人类”基因组计划到“千人”基因组计划[J].生命世界,2011(8)：54-57.
    [6]Francisco Campos, Gabriel Guillen,Jose L. Reyes, Alejandra A. A general method of protein purification for recombinant unstructured non-acidic proteins [J]. Protein Expression and Purification,2011,80(1):47-51.
    [7]Takeshi Ikeda, Ken-ichi Ninomiya, Ryuichi Hirota, Akio Kuroda. Single-step affinity purification of recombinant proteins using the silica-binding Si-tag as a fusion partner [J]. Protein Expression and Purification,2010,70(1):91-95.
    [8]廖晓霞,张学武.高效分离纯化藻蓝蛋白新法[J].食品工业科技,2011,32(6)：273-275.
    [9]Sepideh Babaei,Amir Geranmayeh, Seyyed Ali Seyyedsalehi. Protein secondary structure prediction using modularreciprocal bidirectional recurrent neural networks[J]. computer methods and programs in biomedicine,100 (2010)237-247.
    [10]David Diaz,Francisco Jose Esteban, Pilar Hernandez,et al. Parallelizing and optimizing a bioinformatics pairwise sequencealignment algorithm for many-core architecture[J]. Parallel Computing,37 (2011) 244-259.
    [11]马云,王云云,张晓婷等.鸭PPARα基因结构及功能的生物信息学分析[J].浙江大学学报,37(4)：371-379,2011.
    [12]蔡刘体,胡重怡.烟草T-phylloplanin基因编码蛋白结构与功能的生物信息分析[J].生物技术通报,2009,(1)：100-102.
    [13]黄涛.蛋白质结构数据库的挖掘[D].上海：同济大学,2006.
    [14]韩秋明,李微,李华锋等.数据挖掘技术应用实例[M].北京：机械工业出版社,2009.
    [15]赵闪.数据挖掘在客户管理中的应用研究[D].广州：广东工业大学,2007.
    [16]毛国君,段立娟,王实等.数据挖掘原理与算法(第二版)[M].北京：清华大学出版社,2007.
    [17]韩家炜,堪博.数据挖掘概念与技术[M].北京：机械工业出版社,2007.
    [18]Tansel Ozyer, Reda Alhajj, Ken Barker. Intrusion detection by integrating boosting genetic fuzzy classifier and data mining criteria for rule pre-screening [J]. Journal of Network and Computer Applications,2007,30(1)237-247.
    [19]Guangli Nie, Lingling Zhang, Ying Liu, et al. Decision analysis of data mining project based on Bayesian risk[J]. Expert Systems with Applications,2009,36(3) 4589-4594.
    [20]陈斌.人工智能在计算机网络技术中的应用[J].技术与市场,2010,17(20)4.
    [21]Jiawei Han, Micheline Kamber数据挖掘：概念与技术[M].范明,孟小峰,译.北京：机械工业出版社,2001.
    [22]晁永生,刘海江,刘娜.基于数据挖掘的白车身工艺规划系统[J].计算机工程,2010,36(17)：16-18.
    [23]梁丹,乔立红.数据挖掘技术在工艺参数优化中的应用[J].机械工程师,2007(6)：20-22.
    [24]聂建武,魏康民.人工神精网络在计算机辅助工艺设计中的应用[J].轻工机械,2007,25(3)：77-80.
    [25]Ryszard S Michalski, Ivan Bratko, Miroslav Kubat机器学习与数据挖掘：方法和应用[M].朱明等译.北京：电子工业出版社,2004.
    [26]杨杰,姚莉秀.数据挖掘技术及其应用[M].上海：上海交通大学出版社,2011.
    [27]张银奎,廖丽.数据挖掘原理[M].北京：机械工业出版社,2003.
    [28]Atramentov A. Multi-relational Decision Tree Algorithm-Implementation and Experiments[D]. USA:University of Iowa,2003.
    [29]史忠植.神经网络[M].北京：高等教育出版社,2009.
    [30]李鹤松.关联规则挖掘和孤立点分析的研究[D].长沙：国防科学技术大学,2004.
    [31]芦海燕.数据挖掘中关联规则算法的研究[J].电脑知识与技术,2011,26(7)：6324-6328.
    [32]M. Halkidi, Y. Batistakis,M. Vazirgiannis. Clustering algorithms and Validity measures[J]. IEEE2001.3-22.
    [33]林骁尉.基于数据挖掘的货品存储分配策略研究[D].大连：大连海事大学,2010.
    [34]苗夺谦,李国道.粗糙集理论、算法与应用[M].北京：清华大学出版社,2008.
    [35]翁敬农译.数据挖掘教程[M].北京清华大学出版社,2003.
    [36]Berry,M. J. A.,Linoff, G.5. DataMining Techniques[J]. John Wiley & Sons, InC,1997
    [37]Quinlan,J. R Induction of Decision Tree[J]. Machine Learning,1986,1(1)181-106.
    [38]黄芳.基于数据挖掘的决策树技术在成绩分析中的应用研究[D].山东：山东大学,2009.
    [39]王峥崎.基于决策树算法的改进与应用[D].西安：西安电子科技大学论文,2005.
    [40]Apte C, Weiss S. Data Mining with Decision Trees and Decision Rules [J]. Future Generation System,1997,13:197-210.
    [41]Davidson Russell. Reliable inference for the Gini index [J]. Journal of Econometrics,2009,150(1):30-40.
    [42]卢东标.基于决策树的数据挖掘算法研究与应用[D].武汉：武汉理工大学论文,2008.
    [43]戴南.基于决策树的分类方法研究[D].南京：南京师范大学论文,2006.
    [44]陈沛玲.决策树分类算法优化研究[D].广州：中南大学论文,2007.
    [45]Quinlan, J R. Introduction of Decision Tree [J]. Machine Learning,1986, 1(1)181-106.
    [46]Quinlan, J R. Introduction of decision tree [M].Machine Learning,1986.
    [47]于秀兰,蒋青,陈前斌.信息论基础[M].北京：高等教育出版社,2007.
    [48]李华.基于决策树ID3算法的改进研究[D].成都：电子科技大学,2006.
    [49]陈沛玲.决策树分类算法优化研究[D].长沙：中南大学,2007.
    [50]Quinlan, J R. C4.5 Programs for Machine Learning[M]. San Mateo 1 Morgan Kaufmann Publishers, Inc,1993.
    [51]J. Ross Quinlan. C4.5:Programs for Machine learn ing [Ml. Morgan Kaufmann,1993:63-91.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700