面向生物医学领域的文本挖掘技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
当前,出版的生物医学文献呈指数级增长,成为一座巨大的知识宝库。由于生物医学文献绝大多数都是以文本方式存在,从这座宝库快速有效地进行文本挖掘,提取生物医学知识的需求变得非常迫切。生物医学文本挖掘主要依赖于自然语言处理技术和机器学习方法从海量生物医学文献中有效地找到所需信息、发现隐藏的生物医学知识。
     本文首先介绍了文本挖掘技术及其应用,接着介绍了生物医学领域文本挖掘技术的当前研究现状以及本人在该领域所作的相关研究。
     本文提出了一个基于改进编辑距离算法的生物实体识别方法,这是一种基于词典的方法:通过改进编辑距离算法提高识别的召回率,并采用POS扩展、利用上下文线索等语言知识以及前后缀词扩展、合并邻近实体等规则进一步提高性能。在JNLPBA2004测评语料上的实验表明其性能远远高于基于字符串完全匹配的词典方法(综合分类率F分别为68.48%和47.7%)。
     当前流行的机器学习方法的生物实体识别性能还有很大的提升空间,本文提出了一个基于条件随机域(CRFs)与上下文线索的生物实体识别方法。该方法首先选取合适特征,使用条件随机域进行生物实体识别;同时充分利用语言学的知识,使用上下文中存在的三种启发式语法结构(上下文线索):括号对、启发式语法结构和交互词提示,根据其提供的实体及其类别信息进一步提高识别性能。在JNLPBA2004和BioCreative2004task 1A测评语料上的实验结果表明上下文线索的引入提高了性能三个百分点左右。
     从生物医学文献中抽取蛋白质交互作用关系对蛋白质知识网络的建立、蛋白质关系的预测、新药的研制等均具有重要的意义。基于自然语言处理的系统通过分析语法结构进行关系抽取,能获得较高的准确性。本文提出了一个基于链接语法分析的蛋白质(基因)交互作用关系的抽取方法。该方法使用条件随机域(CRFs)与上下文线索结合的生物实体识别方法,再通过链接语法分析划分语法成分,从语法成分及其合理组合中抽取蛋白质(基因)交互作用关系。实验结果表明该方法的召回率以及综合分类率F指标都高于使用同一测试语料的其他系统。
     基于机器学习和统计的方法可以获得较高的召回率,本文提出了一种基于支持向量机(SVM)的蛋白质交互作用关系抽取方法。该方法除了选取词项特征、关键词特征、实体距离特征以及链接特征等特征外,还利用链接语法分析方法可以获得较高准确率的特性,引入链接语法分析方法抽取结果特征,在损失较少关系抽取召回率性能的情况下,较大地提高了准确率,从而最终提高了综合分类率。实验结果表明该方法的召回率性能与使用同一测试语料的其他系统相比具有明显的优势,综合分类率F指标也高于其他系统。
     海量的生物医学文献给应用文本挖掘技术进行隐含医学知识发现提供了前所未有的机会。本文提出了一个生物医学领域的假设生成方法,该方法对医学文献记录中的医学主题词MeSH及自由文本中的医学概念同时进行相关概念提取,弥补了当前研究只使用其中一个的不足。同时基于UMLS Knowledge Sources,进行基于概念的查询扩展,提高了相关记录的召回率,并通过语义过滤,降低了搜索空间。通过验证鱼油与雷诺氏病关联的实验表明该方法提高了获取相关概念的效果。
It is well understood that the number of biomedical literatures is growing at an astounding pace and these vast collections of publications offer an excellent opportunity for the discovery of hidden biomedical knowledge by applying text mining technologies. Text mining in biomedical literature helps biomedical researchers efficiently find what they need and hidden biomedical knowledge from the huge amount of biomedical literatures mainly via natural language processing and machine learning.
     This dissertation firstly introduces text mining technologies and their applications in biomedical field. Then author's work in this field is introduced.
     A dictionary-based bio-entity name recognition approach using improved edit distance algorithm is presented. The approach expands dictionary via the abbreviation definitions identifying algorithm and improves the recall rate through the improved edit distance algorithm. Then some language knowledge-based methods including POS (Part of speech) expansion and the exploitation of the contextual cues and some rule-based methods including First-keywords and Post-keywords expansion and merge of adjacent entity names are applied to further improve the performance. Experiment results on JNLPBA2004 show that the above method could achieve a much better performance (68.48% in F-score) than the exact matching baseline (47.7%).
     As the current popular methods, the performance of machine learning techniques still has much space to be improved. This dissertation presents a conditional random field-based bio-entity name recognition approach and studies the methods of improving the performance by the exploitation of the contextual cues including bracket pair, heuristic syntax structure and interaction words cue. Experiment results on both JNLPBA2004 and BioCreative2004 task 1A datasets show that these methods can improve conditional random fields-based recognition performance by about 3 percentage points in F-score.
     Automatic extracting protein-protein interaction information from biomedical literatures can help to build protein relation network, predict protein function and design new drugs. Natural language processing based protein-protein interaction extraction methods usually can have relative good precise rate. This dissertation presents a Link Grammar based protein-protein interaction extraction approach. This approach applies conditional random fields model to tag protein names in biomedical text, then uses a Link Grammar parser to identify the syntactic roles in sentences and at last extracts complete interactions by analyzing the matching contents of syntactic roles and their linguistically significant combinations. Experiment evaluations with two other state of the art extraction systems indicate that this approach can achieve better performance.
     Machine learning and statistical methods usually can achieve higher recall rate. This dissertation also presents a SVM-based protein-protein interaction extraction approach. This approach uses four features including Words features, keyword features, entity distance feature and link path feature. In addition, the Link Grammar extraction result feature is introduced to improve the precise rate.The introduction of this feature improves much precise rate with little lose of recall rate. Experiment evaluations with other systems indicate that this approach can achieve much better recall rate and its F-score is also higher than others.
     Vast collections of biomedical publications offer an excellent opportunity for the automatic discovery of hidden knowledge. This dissertation describes the content and development in the research of the hidden knowledge discovery in biomedical literature and presents a biomedical hidden knowledge discovery approach. The approach extracts relative biomedical concepts from both MeSH (Medical Subject Headings) and free text (title and abstract) and achieves better extracting effect comparing with only extracting from one of them. In addition, by via of UMLS biomedical resources, this approach performs a query expansion and, therefore, improves the recall rate of relative records. The approach also reduces search space greatly through a semantic filter. Experiment on Fish Oils and Raynauds disease shows the effectiveness of this approach.
引文
[1]Hearst M A.Text data mining:issues,techniques,and relationship to information access.Presentation notes for UW/MS workshop on data mining,1997.
    [2]Feldman R,Dagan I.Knowledge discovery in texts,In:Proceedings of the ECML-95 Workshop on Knowledge Discovery,Crete,Greece,1995:175-180.
    [3]Manafis B.Natural language processing:a human-computer interaction perspective.Advances in Computers,1999,47:2-68.
    [4]王挺,麦范金,刘忠.自然语言处理及其应用前景的研究.桂林航天工业高等专科学校学报,2006,44(4):19-21.
    [5]张业鹏,张道德.贝叶斯算法在文本自动分类系统中的应用.计算机与现代化,2006,2:36-37.
    [6]胡荣,罗庆云.KNN算法在文本分类中的改进.南华大学学报,2005,19(3):78-81.
    [7]Vapnik V N.The Nature of statistical learning theory.New York:Spdnger-Verlag,1995.
    [8]Bdn S.Extracting patterns and relations from the World Wide Web.In:Proceedings of international Workshop on the Web and Databases.Spain,1998.
    [9]Feldman R,Dagan I.Knowledge discovery in textual databases.In:Proceedings of the 1th International Conference on Knowledge Discovery and Data Mining,Montreal,1995:112-117.
    [10]Wuthrich B,Permunetilleke D,Leung S et al.Daily prediction of major stock indices from textual WWW data.In:Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining,New York,1998:364-368.
    [11]Perez-Iratxeta C,Bork P,Andrade M A.Association of genes to genetically inherited diseases using data mining.Letter to Nature Genetics,2002,31(3):316-319.
    [12]Fukuda K,Tsunoda T et al.Toward information extraction:identifying protein names from biological papers.In:Proceedings of Pacific Symposium on Biocomputing,Hawaii,U.S.A,1998:707-718.
    [13]Hristovski D,Peterlin B,Mitchell J A et al.Improving literature based discovery support by genetic knowledge integration.Study Health Technology Information,2003,95:68-73.
    [14]Hristovski D,Stare J,Petedin B et al.Supporting discovery in medicine by association rule mining in MEDLINE and UMLS.In:Proceedings of MedInfo Conference,London,2001,10(2):1344-1348.
    [15]Rindflesch T C,Tanabe L,Weinstein J W et al.EDGAR:extraction of drugs,genes and relations from the biomedical literature.In:Proceedings of Pacific Symposium on Biocomputing.Hawaii,U.S.A,2000:514-525.
    [16]Thomas J,Milward D,Ouzounis C et al.Automatic extraction of protein interactions from scientific abstracts.In:Proceedings of Pacific Symposium on Biocomputing.Hawaii,U.S.A,2000:538-549.
    [17]Bunescu R,Ge R,Rohit J K et al.Learning to extract proteins and their interactions from MEDLINE abstracts.In:Proceedings of ICML-2003 on Machine Learning in Bioinformatics.Menlo Park:AAAI Press,2003:46-53.
    [18]Blaschke C,Andrade M A,Ouzounis C et al.Automatic extraction of biological information from scientific text:protein-protein interactions.In:Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology.Menlo Park:AAAI Press,1999:60-67.
    [19]Fu Y,Mostata J,Seki K.Protein Association discovery in biomedical literature.In:Proceedings of 3th ACM/IEEE-CS joint Conference on Digital Libraries.Washington,2003:113-115.
    [20]Chang J T,Raychaudhuri S,Altman R B.Including biological literature improves homology search.In:Proceedings of Pacific Symposium on Biocomputing.Hawaii,U.S.A,2001,24(1):374-383.
    [21]Craven M.Learning to extract relations from MEDLINE.In:Proceedings of AAAI'99 Workshop on Machine Learning for Information Extraction,Orlando Florida,1999.
    [22]Blake C,Pratt W.Automatically identifying candidate treatments from existing medical literature.In:Proceedings of AAAI symposium on knowledge-based approaches.Stanford,California,2002.
    [23]Marcotte E M,Xenarios I,Eisenberg D.Mining literature for protein-protein interactions.Bioinformatics,2001,17(4):359-363.
    [24]Sekimizu T,Park H,Tsujii J.Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts.Genome Informatics,1998,9:62-71.
    [25]Freudenberg J,Propping P.A similarity-based method for genome-wide prediction of disease relevant human genes.Bioinformatics.2002,18(Suppl.2):110-115.
    [26]Andrade M A,Borka P.Automated extraction of information in molecular biology.FEBS Letters,2000,476(1-2):12-17.
    [27]Bruijn B,Maratin J.Getting to the core of knowledge:mining biomedical literature.International Journal of Medical Informatics,2002,67:7-18.
    [28]郑华川,崔雷.胃癌前病变低频被引论文的共词分析和共篇聚类分析,中华医学图书情报杂志,2002,11(3):1-3.
    [29]崔雷,郑华川.关于从MEDLINE数据库中进行知识抽取和挖掘的研究进展,情报学报,2003,22(4):425-433.
    [30]马明,武夷山.基于MEDLINE的非相关文献知识发现.中华医学图书情报,2004,13(5):1-3.
    [31]周雪忠,吴朝晖,刘保延.生物医学文献知识发现研究探讨及展望.复杂系统与复杂性科学,2004,1(3):45-55.
    [32]张朝林等.通过文献挖掘建立乳腺癌相关基因关联网络的研究.2003年中国计算机大会(CNCC'2003).
    [33]李梢,张学工,季梁,李衍达.复杂性疾病生物信息学研究的策略与方法,世界华人消化杂志,2003,11(10):1465-1469.
    [34]包含飞.医学数据、信息和知识的信息学属性研究,中国医疗杂志 2003,2(1):1-4.
    [35]Hersh W R,Bhuptiraju R T.TREC 2003 genomics track overview,In:Proceedings of 12th Text Retrieval Conference,Gaithersburg,Maryland,2003.
    [36]Tanabe L et al.MedMiner:An internet text-mining tool for biomedical information,with application to gene expression profiling.Biotechniques.1999,27:1210-1217.
    [37]Muller H M,Kenny E E,Sternberg P W.Textpresso:an ontology-based information retrievaland extraction system for biological literature.PLoS Biology.2004,2:e309.
    [38]Perez-Iratxeta C,Bork P,Andrade,A M.XplorMed:a tool for exploring MEDLINE abstracts.Trends in Biochemical Sciences.2001,26:573-575.
    [39]Hoffmann R,Valencia A.A gene network for navigating the literature.Nature Genetics.2004,36:664.
    [40]Doms A, Schroeder M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Research. 2005,33:W783-W786.
    [41] Hoffmann R et al. Text mining for metabolic pathways, signaling cascades, and protein networks. Science's STKE. 2005, 283: pe21.
    [42] Yeh A S, Hirschman L, Morgan, A A. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup, Bioinformatics, 2003,19(Suppl. 1): i331-339.
    [43] Regev Y, Finkelstein-Landau M, Feldman R. Rule-based extraction of experimental evidence in the biomedical domain: The KDD Cup 2002, ACM SIGKDD Explorations Newsletter, 2002, 4(2): 90-92.
    [44] Ghanem M M, Guo Y, Lodhi H et al. Automatic scientific text classification using local patterns: KDD Cup 2002, ACM SIGKDD Explorations Newsletter, 2003,4(2): 95-96.
    [45] Donaldson I, Martin J, de Bruijn B et al. PreBIND and Textomy-mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 2003, 4:11.
    [46] Hersh W R, Bhupatiraju R T. TREC genomics track overview. In: Proceedings of the 14th Text Retrieval Conference: TREC 2005. MD: National Institute for Standards & Technology. Gaithersburg, 2005:14-23.
    [47] Eppig J T, Bult C J, Kadin J A et al. The mouse genome database (MGD): from genes to mice-a community resource for mouse biology. Nucleic Acids Research. 2005, 33:D471-D475.
    [48] Dayanik A, Fradkin D, Genkin A et al. DIMACS at the TREC 2005 genomics track. In: Proceedings of the Fourteenth Text REtrieval Conference. MD: National Institute for Standards & Technology. Gaithersburg, 2005.
    [49] Si L, Kanungo T. Thresholding strategies for text classifiers: TREC 2005 biomedical triage task experiments. In: Proceedings of the Fourteenth Text REtrieval Conference. MD: National Institute for Standards & Technology. Gaithersburg, 2005.
    [50] Damianos L, Day D, Hirschman L et al. Real users, real data, real problems: the MiTAP system for monitoring bio Events. In: Proceedings of the Conference on Unified Science & Technology for Reducing Biological Threats & Countering Terrorism, 2002: 167-77.
    [51] Lenci A, Bartolini R, Calzolari N et al. Multilingual summarization by integrating linguistic resources in the MLIS-MUSI project. In: Proceedings of the Third International Conference on Language Resources and Evaluation, Spain, 2002.
    [52] Kan M Y, McKeown K R, Klavans J L. Applying natural language generation to indicative summarization. In: Proceedings of the Eighth European Workshop on Natural Language Generation, France, 2001.
    [53] Kim J D, Tomoko O, Yoshimasa T et al. Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Geneva, Switzerland, 2004: 70-75.
    [54] Hirschman L, Yeh A, Blaschke C et al. Overview of BioCreAtlvE: critical assessment of information extraction for biology. BMC Bioinformatics, 2005, 6(Suppl 1):S1.
    [55]Swanson D. Two medical literatures that are logically but not bibliographically connected. Journal of the American Society for Information Retrieval, 1987, 38(4): 228-233.
    [56] Swanson D. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 1986,30(1): 7-18.
    [57]Swanson D.Migraine and magnesium:eleven neglected connections.Perspectives in Biology and Medicine,1988,31(4):526-557.
    [58]Weeber M,Klein H,Aronson A R et al.Text-based discovery in biomedicine:the architecture of the DAD system.In:Proceedings of the Annual Conference of the American Medical Informatics Association,2000:903-907.
    [59]Srinivasan P.MeSHmap:a text mining tool for Medline.In:Proceedings of the Annual Conference of the American Medical Informatics Association,2001:642-646.
    [60]Perez-Iratxeta C,Bork P,Andrade M A.Association of genes to genetically inherited diseases using text mining.Nature Genetics.2002,31:316-319.
    [61]Hristovski D,Peterlin B,Mitchell J A et al.Using literature-based discovery to identify disease candidate genes.International Journal of Medical Informatics.2005,74:289-298.
    [62]Tiffin N,Kelso J F,Powell A R et al.Integration of text- and dam-mining using ontologies successfully selects disease gene candidates.Nucleic Acids Research.2005,33:1544-1552.
    [63]Aronson A R.Effective mapping of biomedical text to the UMLS Metathesaurus:the MetaMap program,In Proceedings of AMIASymp.Washington,DC,2001:17-21.
    [64]Rindflesch T C,Hunter L,Aronson A R.Mining molecular binding terminology from biomedical text,In:Proceedings of AMIA Syrup,Washington,DC,1999:127-131.
    [65]Majoros W H,Subramanian G M,Yandell M D.Identification of key concepts in biomedical literature using a modified Markov heuristic,Bioinformatics,2003,19(3):402-407.
    [66]Regev Y,Finkelstein-Landau M,Feldman R.Rule-based extraction of experimental evidence in the biomedical domain:the KDD Cup 2002,ACM SIGKDD Explorations Newsletter,2003,4(2):90-92.
    [67]林鸿飞,杨志豪,赵晶.中文文本的信息自动抽取和相似检索机制 小型微型计算机系统,2007,28(11):2074-2079.
    [68]时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法.计算机工程,2007,33(19):276-278.
    [69]Yang Z H,Lin H F,Li Y P et al.TREC 2005 genomics track experiments at DUTAI.In:Proceedings of the 14th Text REtrieval Conference.Galthersburg,Maryland,2005.
    [70]Salton G,Allan J,Buckley C.Approaches to passage retrieval in full text information systems.In:Proceedings of the 16 the Annual International ACM SIGIR Conference.Pittsburgh,PA,1993:49-58
    [71]Yang Z H,Lin H F,Li Y P et al.DUTIR at TREC 2006 Genomics and Enterprise Tracks.In:Proceedings of the 15th Text REtrieval Conference.Galthersburg,Maryland,2006.
    [72]Yang Z H,Lin H F,Cui B J et al.DUTIR at TREC 2007 Genomics Track.In:Proceedings of the 16th Text REtrieval Conference.Gaithersburg,Maryland,2007.
    [73]方鸷飞,林鸿飞,杨志豪.中文文本体裁的自动分类机制.中文信息学报,2006,20(2):24-32.
    [74]郑海,林鸿飞,杨志豪.基于概念和关联扩充的文本标题分类机制.小型微型计算机系统,2005,26(5):732-734.
    [75]邹金凤,林鸿飞,杨志豪.文本分类中多分类器的综合机制.计算机工程与应用,2005,41(26):166-169.
    [76]宋锐,林鸿飞,杨志豪.基于编辑距离与网页内部结构的中文新闻移动摘要,中文信息学报,2008,22(1):87-91.
    [77]林鸿飞,杨志豪,赵晶.基于段落匹配和分布密度的偏重摘要实现机制.中文信息学报,2007,21(1):43-48.
    [78]闫英杰,林鸿飞,杨志豪.关键词密度分布法在偏重摘要中的应用研究.计算机工程,2007,33(16):156-158.
    [79]Bikel D M,Schwartz R L,Weischedei R M.An algorithm that learns what's in a name.Machine Learning,1999,34(1-3):211-231.
    [80]DARPA.Proceedings of the Sixth Message Understanding Conference(MUC-6),Columbia,MD,USA,Morgan Kaufmann,1995.
    [81]Tjong E F,Sang K,Meulder F D.Introduction to the CoNLL-2003 shared task:language-independent named entity recognition.In:Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003),2003:142-147.
    [82]Kim J D,Ohta T,Tateisi Y et al.GENIA corpus-a semantically annotated corpus for bio-text mining.Bioinformatics,2003,19(suppl.1):ⅰ180-ⅰ182.
    [83]Tsuruoka Y,Tsujii J.Boosting precision and recall of dictionary-based protein name recognition.In:Proceedings of the ACL-03 Workshop on Natural Language Processing in Biomedicine.Sapporo,Japan,2003:41-48.
    [84]Cohen A M.Unsupervised gene/protein entity normalization using automatically extracted dictionaries.In:Proceedings of the ACL-ISMB Workhop on Linking Biological Literature,Ontologies and Databses:Mining Biological Semantics.Detroit,MI;2005:14-24.
    [85]Fukuda K,Tsunoda T,Tamura A et al.Toward information extraction:identifying protein names from biological papers.In:Proceedings of the Pacific Symposium on Biocomputing.Hawai,USA,1995:705-716.
    [86]Olsson F,Eriksson G,Franzen K et al.Notions of correctness when evaluating protein name taggers.In:Proceedings of the 19th International Conference on Computational Linguistics.Taipei,Taiwan,2002:765-771.
    [87]Zhou G D,Su J.Exploring deep knowledge resources in biomedical name recognition.In:Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications.Geneva,Switzerland,2004:96-99.
    [88]Lee K J,Hwang Y S,Rim H C.Two-phase biomedical NE recognition based on SVMs.In:Proceedings of the ACL'2003 Workshop on Natural Language Processing in Biomedicine.Sapporo,Japan,2003:33-40.
    [89]Finkei J,Dingare S,Nguyen H et al.Exploiting context for biomedical entity recognition:from syntax to the Web.In:Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications.Geneva,Switzerland,2004:88-91.
    [90]Settles B.Biomedical named entity recognition using conditional random fields and novel feature sets.In:Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications Geneva,Switzerlan,2004:104-107.
    [91]Navarro G.A guided tour to approximate string matching.ACM Computing Surveys,2001,33(1):31-88.
    [92]Schwartz A S,Hearst M A.A simple algorithm for identifying abbreviation definitions in biomedical text.In:Proceedings of the Pacific Symposium on Biocomputing.Hawai,USA,2003:451-462.
    [93]Tsuruoka Y,Tateishi Y,Kim J D et al.Developing a robust part-of-speech tagger for biomedical text.In:Proceedings of Advances in Informaties-10th Panhellenic Conference on Informatics,2005,LNCS 3746:382-392.
    [94]Laffeny J,McCallum A,Pereira F.Conditional random fields:probabilistic models for segmenting and labeling sequence data.In:Proceedings of the International Conference on Machine Learning.Morgan Kaufmann,San Francisco,CA,2001:282-289.
    [95]Hammersley J,Clifford P.Markov fields on finite graphs and lattices.Unpublished manuscript,1971.
    [96]Byrd R H,Nocedal J,Schnabel R B.Representation of quasi-Newton matrices and their use in limited memory methods.Mathematical Programming,1994,63:129-156.
    [97]Sha F,Pereira F.Shallow parsing with conditional rnadom fields.In:Proceedings of Hunma Language Technology NAACL,Edmonton,Canada,2003.
    [98]Pustejovsky J,Castano J,Zhang J.Robust relational parsing over biomedical literature:extracting inhibit relations,In:Proceedings of the Pacific Symposium on BioComputing,Hawaii,U.S.A,2002:362-373.
    [99]Leroy G,Chen H,Martinez J D.A shallow parser based on closed-class words to capture relations in biomedical text.Journal of Biomedical Informatics,2003,36(3):145-158.
    [100]Park J C,Kim H S,Kim J J.Bidirectional incremental parsing for automatic pathway identification with combinatory categorical grammar.In:Proceedings of the Pacific Symposium on BioComputing,Hawaii,U.S.A,2001:396-407.
    [101]Temkin J M,Gilder M R.Extraction of protein interaction information from unstructured text using a context-free grammar.Bioinformatics,2003,19:2046-2053.
    [102]Ahmed S T,Chidambaram D,Davulcu H et al.IntEx:a syntactic role driven protein-protein interaction extractor for bio-medical text.In:Proceedings of the ACL-ISMB Workshop on Linking Biological Literature,Ontologies and Databases:Mining Biological Semantics,2005:54-61.
    [103]Ono T,Hishigaki H,Tanigami A et al.Automatic extraction of information on protein-protein interactions from the biological literature.Bioinformatics,2001,17(2):155-161.
    [104]Huang M L,Zhu X Y,Hao Y et al.Discovering patterns to extract protein-protein interactions from full texts,Bioinformatics,2004,20(18):3604-3612.
    [105]David C,Bernard B,William L et al.BioRAT:extracting biological information from full-length papers.Bioinformatics.2004,20(17):3206-3213.
    [106]Andrade M A,Valencia A.Automatic extraction of keywords from scientific text:application to the knowledge domain of protein families.Bioinformatic,1998,14(7):600-607.
    [107]Craven M,Kumlien J.Constructing biological knowledge bases by extracting information from text sources.In:Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology,Heidelberg,Germany,1999:77-86.
    [108]Stapley B,Benoit G.Biobibliometrics:information retrieval and visualization from co-occurrences of gene names in Medline abstracts.In:Proceedings of the Pacific Symposium on Biocomputing,Hawaii,U.S.A,2000:529-540.
    [109]Jenssen T K,Laegreid A,Komorowski J et al.A literature network of human genes for high-throughput analysis of gene expression.Nature Genetics,28(1):21-28,2001.
    [110]Marcotte E M,Xenarios 1,Eisenberg D.Mining literature for protein-protein interactions.Bioinformatics, 2001,17(4):359-363.
    [111]Blaschke C,Valencia A.Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study.Comparative and Functional Genomies,2001,2:196-206.
    [112]Lukasz S,Christopher S M,Adam J S et al.The database of interacting proteins:2004 update.Nucleic Acids Research,2004,32(1):449-451.
    [113]王厚峰.指代消解的基本方法和实现技术,中文信息学报,2002,16(6):9-17.
    [114]Van Deemter K,Kibble R.On coreferring:coreference in MUC and related annotation schemes.Computational Linguistics,2000,26(4):615-623.
    [115]Sleator D,Temperley D.Parsing English with a link grammar.Carnegie Mellon University Computer Science technical report CMU-CS-91-196.
    [116]Ding J,Berleant D,Nettleton D et al.Mining MEDLINE:abstracts,sentences,or phrases? In:Proceedings of the Pacific Symposium on Biocomputing.Hawaii,U.S.A,2002:326-337.
    [117]阎辉,张学工,李衍达.应用SVM方法进行沉积微相识别.物探化探计算技术,2000,22(2):158-164.
    [118]张学工.关于统计学习理论与支持向量机.自动化学报,2000,26(1):32-42.
    [119]李凯,郭子雪.一种基于SVM的函数模拟方法.微机发展,2001,3:5-6.
    [120]马云潜,张学工.支持向量机函数拟合在分形插值中的应用.清华大学学报(自然科学版),2000,(31:76-78.
    [121]Muller K R,Smola A J,Ratseh G et al.Predicting time series with support vector machines.In:Proceedings of ICANN'97,Springer Lecture Notes in Computer Science,1997:999-1005.
    [122]Burges C J C.A Tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery.1998,2(2):121-167.
    [123]Lindsay R K,Gordon M D.Literature based discovery by lexical statistics.Journal of the American Society for Information Science.1999,50(7):574-587.
    [124]Gordon M D,Lindsay R K.Toward discovery support systems:a replication,re-examination,and extension of Swanson's work on literature based discovery of a connection between Raynaud's and fish oil.Journal of the American Society for Information Science.1996,47(2):116-128.
    [125]Cimino J J,Barnett G O.Automatic knowledge acquisition from MEDLINE.Methods of information in Medicine.1993,32(2):120-130
    [126]Srinivasan P.Text mining:generating hypotheses from MEDLINE.Journal of the American Society for Information Science and Technology.2004,55(5):396-413.
    [127]Lindberg D A B,Humphreys B L,MeCray A T.The unified medical language system.Methods of information in Medicine.1993,32:281-291.
    [128]The Gene Ontology Consortium.Creating the gene ontology resource:design and implementation.Genome Research.2001,11:1425-1433.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700