基于SVM和链接分析的蛋白质关系抽取系统

英文题名：Protein-Protein Interaction Extraction System Based on SVM and Link Parse
作者：吴宝栋
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：关系抽取 ; 支持向量机(SVM) ; 链接语法分析 ; 指代消解 ; 实体识别
英文关键词：Interaction Extraction ; Support Vector Machine ; Link Grammar Parse ; Anaphora Resolution ; Entity Recognition
学位年度：2007
导师：林鸿飞
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2007-12-01

摘要

随着生物医学文献数量的急剧增长，各种各样的生物医学信息出现在生物医学研究者面前。这给生物医学研究者带来很沉重的负担，使他们很难迅速地从这些文献中找到需要的信息。为了提高他们的工作效率，迫切地需要一些自动化的工具帮助他们在海量生物医学文献中迅速地找到需要的信息。生物医学文献中蛋白质(基因)相互作用关系抽取的研究正是在这种背景下产生的。此外，从生物医学文献中抽取蛋白质(基因)相互作用关系也具有很高的应用价值，对蛋白质知识网络的建立、蛋白质关系的预测、新药的研制等均具有重要的意义。
     本文构建了一个生物医学文献中的蛋白质相互作用关系抽取系统。该系统使用基于支持向量机(SVM)和链接分析(Link parse)的方法抽取蛋白质(基因)交互作用关系。系统首先通过指代消解替换生物医学文献中的第三人称代词，然后使用条件随机域模型对生物医学文献进行实体识别，通过链接语法分析器分析文献中句子的链接路径，最后通过四大类特征的提取，包括：词项特征、关键词特征、链接特征以及词对特征，利用SVM分类器抽取蛋白质(基因)相互作用关系。
     本文首先介绍了蛋白质相互作用关系抽取的相关知识和研究概况，然后重点介绍了本文的实验系统所使用的核心方法——统计学习理论与支持向量机(SVM)，接下来对系统使用的其他方法进行了详细描述，包括指代消解、命名实体识别、链接语法与链接语法分析器以及链接路径提取、关系抽取的特征选取。本文的最后给出了系统实现与性能评估。
As the quantity of biomedical literatures is increasing rapidly, various kinds of biomedical information appear in front of biomedical researchers. This brings biomedical researchers a heavy burden and makes it difficult to find needed information from these literatures rapidly. In order to improve their work efficiency, an automated facility is urgently needed to find needed information rapidly and accurately. Research on protein-protein interaction automatic extraction from biomedical literature emerges under this background. Furthermore, there is high application value in protein-protein interaction automatic extraction from biomedical literature, which can help to build protein relation network, predict protein function and design new drugs.
     This paper presents a protein-protein interaction extraction system for biomedical literature. This system applies the approach based on Support Vector Machine model and link parse to extract protein-protein interactions and it first uses anaphora resolution to replace the third person pronouns, then applies Conditional Random Fields model to tag protein names in biomedical text and a Link Grammar Parser to parse the link path in sentences. At last, after using feature extraction and choice of four kinds to construct feature vectors, uses Support Vector Machine model to extract complete protein-protein interactions.
     This paper first introduces related knowledge and works on protein-protein interaction extraction, then introduces the core approaches of system which are Statistical Learning Theory and Support Vector Machine model in detail. Later describes other approaches of system particularly, such as anaphora resolution, entity recognition, Link Grammar and Link Grammar Parser and feature choice for interaction extraction. The last part of this paper presents the implementation and the assessment of the system.

引文

[1] Kazunari S, Kenji H, Masatoshi Y et al. Extracting infromation on protein-protein interactions from biological literature based on machine learning approaches. Genome Informatics. 2003, (14): 699-700.
    [2] 李保利，陈玉忠，余士汶．信息抽取研究综述．计算机工程与应用．2003，39(10)：1-5．
    [3] Chinchor N, Marsh E. MUC-7 information extraction task definition. 7th Message Understanding Conference, Virginia, 1998:2-3.
    [4] Aone C, Ram M. A large-scale relation and event extraction system. 6th Applied Natural Language Processing Conference, Washington DC, 2000:76-83.
    [5] Soderland S. Learning information extraction rules for semi-structured and free text. Machine Learning. 1999, 34(1-3): 233-272.
    [6] Tom M．机器学习．北京：机械工业出版社，2000：166-169．
    [7] Zhang T. Regularized winnow methods. In Advances in Neural Information Processing Systems 13, Cambridge, 2001:703-709.
    [8] Lodhi H, Saunders C, Shawe-Taylor J et al. Text classification using string kernels. Machine Learning. 2002, (2): 419-444.
    [9] Zelenko D, Aone C, Richardella A. Kernel methods for relation extraction. Machine Learning. 2003, (3): 1083-1106.
    [10] Culotta A, Sorensen J. Dependency tree kernels for relation extraction. 42nd Annual Meeting of ACL, Barcelona, 2004: 1-8.
    [11] Zhao Shubin, Ralph G. Extracting relations with integrated information using kernel methods. 43rd Annual Meeting of the ACL, Michigan, 2005:419-426.
    [12] Zhou Guodong, Su Jian, Zhang Jie et al. Exploring various knowledge in relation extraction. 43rd Annual Meeting of the ACL, Michigan, 2005:427-434.
    [13] Stapley, Benoit. Biobibliometrics: information retrieval and visualization from cooccurrence of gene names in Medline abstracts. Pacific Symposium on Biocomputing, Honolulu, 2000: 529-540.
    [14] Koike A, Kobayashi Y, Takagi T. Kinase pathway database:an integrated protein-kinase and NLP-based protein interaction resource. Genome Res. 2003,13(6a):1231-1243.
    [15] Nanda K. Combining lexical, syntactic, and semantic features with Maximum Entropy Models for extracting relations. 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, 2004: 21-26.
    [16] Xiao Juan, Su Jian, Zhou Guodong et al. Protein-protein interaction extraction:a supervised learning approach. First International Symposium on Semantic Mining in Biomedical(SMBM), Hinxton, 2005: 51-59.
    [17] Blaschke C.Valencia A.Can bibliographic pointers for known biological data be found automatically? Protein interaction as a case study. Comparative and Functional Genomics. 2001,2(4):196-206.
    [18] David C, Bernard B, William L et al. BioRAT:extracting biological information from full-length papers. Bioinformatics. 2004, 20(17):3206-3213.
    [19] Ahmed S T, Chidambaram D, Davulcu H et al. IntEx:a syntactic role driven protein-protein interaction extractor for bio-medical text.the ACL-ISMB Workshop on Linking Biological Literature,Ontologies and Databases:Mining Biological Semantics, Detroit, 2005:54-61.
    [20] Vapnik V, Lerner A. Pattern recognition using generalized portrait method. Automation and Remote control.1963,24(2):774-780.
    [21] Vapnik V, Chervoknenkis A Y. The necessary and sufficient conditions for consistency in empirical risk minimization method. Pattern Recognition and Image Analysis. 1991,1 (3):283-305.
    [22] Cherkassky V, Mulier F. Guest editorial Vapnik-Chervonenkis(VC) learning theory and its application. Transactions on Neural Networks. 1999,10(5):985-987.
    [23] Vapnik V. The nature of statistical learning theory. NY:Springer-Verlag, 1995.
    [24] Vapnik V, Levin E, Le Cun Y. Measuring the VC-dimension of a learning machine. Neural Computation. 1994,6(5):851-876.
    [25] Cherkassky V, Mulier F. Learning from data: concepts,theory and methods. NY:John Viley & Sons, 1997.
    [26] Burges C J C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 1998, 2(2):121-167.
    [27] 边肇祺．模式识别．北京：清华大学出版社,1988．
    [28] Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers.Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, 1992:144-152.
    [29] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995, 20(3):273-297.
    [30] Scholkopf B, Burges C, Vapnik V. Extracting support data for a given task. First International Conference on Knowledge Discovery & Data Mining, Menlo Park, 1995:262-267.
    [31] Vapnik V, Golowich S, Smola A. Support vector method for function approximation, regression estimation, and signal processing. In:Mozer M.Jordan M, Petsche T. Neural Information Processing Systems. Cambridge MA:MIT Press, 1997:281-287.
    [32] Mulier K R, Smola A J, Ratsch G et al. Predicting time series with support vector machines. 7th International Conference on Artificial Neural Networks, Berlin, 1997:999-1005.
    [33] Drucker H, Burges C, Kaufman L et al. Support vector regression machines. In:Mozer M, Jordan M, Petsche T. Neural Information Processing Systems. Cambridge MA:MIT Press, 1997: 155-161.
    [34] Scholkopf B, Smola A, Muller K R. Kernel principal component analysis. 7th International Conference on Artificial Neural Networks, Berlin, 1997:583-589.
    [35] Scholkopf B, Smola A, Muller K R. Nonlinear component analysis as kernel eigenvalue problem. Neural Computation. 1998,10(1):1299-1319.
    [36] Mitkov R. Anaphora resolution:the state of the art. 17th International Conference on Computational Linguistics, Wolverhampton, 1999:869-875.
    [37] Hobbs J, Kehler A.A theory of parallelism and the case of VP ellipsis. 35th Annual Meeting of the Association for Computational Linguistics, Madrid, 1997:394-401.
    [38] Michael S, Hahn U. Functional centering-grounding referential coherence in information structure. Computational Linguistics. 1999,25(3):309-344.
    [39] Ge N, Charniak E. A statistical approach to anaphora resolution. 6th Workshop on Very Large Corpora, Montreal, 1998: 161-170.
    [40] Kulick S, Bies A, Liberman M et al. Integrated annotation for biomedical information extraction. Human Language Technology conference/North American chapter of the Association for Computational Linguistics annual meeting 2004 Work shop, Biolink, 2004:61-68.
    [41] Lafferty J, McCallum A, Pereira F. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning, Morgan Kaufmann, 2001: 282-289.
    [42] Lynette H, Alexander Y, Christian B et al. Overview of BioCreAtIvE:critical assessment of information extraction for biology. BMC Bioinformatics. 2005,6(1):154-163.
    [43] Sleator D, Temperley D. Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report. 1991, CMU-CS-91-196.
    [44] 李盘林，李丽双，李洋等．离散数学．北京：高等教育出版社，1999．
    [45] Bruce Eckel．Java编程思想(第2版)．北京：机械工业出版社，2002．
    [46] Lukasz S, Christopher S M, Adam J S et al. The database of interacting proteins:2004 update. Nucleic Acids Research. 2004, 32(1): 449-451.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700