GO术语间语义相似性的度量方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
关于相似性的研究在很多研究领域里都起到了关键作用。相似性的研究主要包括结构的相似性和语义的相似性。以往对结构相似性的关注和研究比较多,近几年,语义相似性吸引了越来越多的注意。
     由于历史原因所致,生物学数据来源非常复杂。为了减少或消除概念及术语的混乱,Gene Ontology协会开发了生物学数据的大型语义词典----基因本体GO(Gene Ontology)。GO应用的一个重要方面就是对GO术语的语义相似性进行度量。通常认为,如果两个基因产物的功能相似,那么它们的基因表达就相近,同时它们在GO中注解的术语就相似,所以我们只要能找出GO中术语对的相似度,就可以近似估计两基因表达的相似度,从而判断两基因产物功能的相似程度。所以说,GO术语间语义相似性的度量是解决生物学数据集成中语义异构问题的重要方法。
     本文首先介绍了关于GO的背景知识和对于语义相似性的研究;接着分析了当前GO术语间语义相似性的几种常用度量方法;然后主要针对其中最常用的一种提出了改进的措施----基于语义子图计算GO术语间语义相似性的方法;并以GO图的一小部分为例,做了算法的研究;最后对该方法进行了总结,并探讨了其更为广阔的应用空间。
     本文提出的方法是结合了基于信息量和基于概念距离两方面的方法,可使语义相似性测量的精确度得到进一步的提高,如果能应用到大的GO数据库中,将能更加准确地查找功能相似或者相关的蛋白质,为相关研究及应用打下良好的基础。
The study of similarity includes mostly structural similarity and semantic similarity. The study on structural similarity is pervasive comparatively in the past, and the study of semantic similarity attracts more and more attention till recent years.
     Owing to historical reasons, the data source of biology is very complicated. For reducing or eliminating confusion between concepts and terms, Gene Ontology consortium researched and developed a large semantic dictionary ---- GO (Gene Ontology). The reseach of similarity plays an important role in many study fields. One important aspect of GO application is measuring semantic similarity between GO terms. It is generally believed that if two gene products are similar, we would except that their genetic expressions are similar, and that they are similarly annotated in the GO. Thus, we may compare similarity of function levels of two gene products against their corresponding similarity of annotation in the GO. So measuring semantic similarity between GO terms is an important approach to resolve the problem of semantic heterogeneity in biological data integration.
     At first, we present the background of GO and the study situation of semantic similarity in this paper. Then we analyze several available approaches for measuring semantic similarity between GO terms, and propose a subgraph-based approach against one of the most commonly used approaches. And then, we design an algorithm and testify it upon a part of GO graph. Finally, a summary of this approach is given, and we discuss more broad application space for it.
     The new approach proposed in this paper is an approach which combines information content-based and semantic distance-based methods. It makes semantic similarity measure between GO terms more accurate. If this approach can be used to GO database, it will be promising to search similar or related proteins more accurately, and will lay a good foundation for the relevant study and application of bioinformation.
引文
[1] G A Miller. WordNet: A Lexical Da tabase for English[J]. Comm ACM ,1995,38(11):39-41.
    [2] Boanerges Aleman-Meza, Christian Halaschek-Wiener. Template Based Semantic Similarity for Security Applications[R]. Technical Report, LSDIS Lab, Computer Science Department, University of Gerogia, January 2005.
    [3] Roberto Basili, Marco Cammisa, Alessandro Moschitti. Effective use of WordNet semantics via kernel-based learning[C]. In Ann Arbor eds. Proc 9th Conference Computation Natural Language Learning. Midcigan, USA, June 2005. 29-30.
    [4] M Ashburner, C Ball, J Blake, et al. Gene Ontology: Tool for the unification of biology[J]. Nature Genetics,2000,25(1):25-29.
    [5] Jose L Sevilla, Victor Segura. Correlation between Gene Expression and GO Semantic Similarity[J]. IEEE/ACM Transactions on Computational Biology and Bioinformation, 2005,2(4):330-337.
    [6] 李荣,曹顺良,李园园,等. 基于语义路径覆盖的Gene Ontology术语间语义相似性度量方法[J]. 自然科学进展,2006,16(7):916-920.
    [7] 张永立,张忠平,曹顺良,等. 生物信息学数据仓库中语义相似性方法的应用与实现[J]. 计算机应用研究,2004,160-162.
    [8] Haiying Wang, Francisco Azuaje, Olivier Bodenreider. An Ontology-Driven Clustering Method for Supporting Gene Expression Analysis[C]. Proc 9th IEEE Symposium on Computer-Based Medical Systems,2005.
    [9] Davidson S B, Overton G C, Buneman P. Challenges in integrating biology data sources[J]. Journal of Computational Biology, 1995,2(4):557-572.
    [10] Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource[J]. Nucleic Acids Research,2004,32:D258-261,.
    [11] W N Francis, H Kucera. Brown Corpus Manual—Revised and Amplified[R]. Dept of Linguistics, Brown Univisity, Providence, R I ,1979.
    [12] Mc Hale. A Comparison of WordNet and Roget's Taxonomy for Measuring Semantic Similarity[C]. Proc COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems. Montreal, Canada, August, 1998.115-120.
    [13] Jia Ye, Lin Fang. WEGOA: a web tool for plotting GO annotations[J]. Nucleic Acids Research,2006,34:W293-297.
    [14] Yiyu Yao, Ning Zhong. Web Intelligence: exploring structures, semantics and knowledge of the Web[J]. Knowledge-Based Systems,2004,17:175-177.
    [15] 吴健. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005,28(4):595-602.
    [16] Wei Jinmao, Wang Shuqin, Wang Jing, et al. Fast Kernel for Calculating Structural Information Similarities[J]. IEEEIS06, 2006, 21:59-64.
    [17] M Lee, L Yang, W Hsu, et al. XClust:clustering XML schemas for effective integration [J]. CIKM’02, 2002,11:292-299.
    [18] Elisa Bertino, Giovanna Guerrini, Marco Mesti. A matching algorithm for measuringthe structural similarity between an XML document and a DTD and its application[J]. Information Systems, 2004, 29:23-46.
    [19] Mohammed, J Z. Efficiently Mining Frequent Trees in a Forest[C]. SIGKDD’02, In Alberta eds.Canada:Copyright, ACM, 2002. 71-80.
    [20] A Tatsuya, A Hiroki, et al. Discovering frequent substructures in large unordered trees[C]. Proc 6th International Conference on Discovery Science, 2003,47-61.
    [21] Jayant Madhavan, Philip A Bernstein, Erhard Rahm. Generic Schema Matching with Cupid[C]. Proc 27th VLDB Conference, Roma,Italy,2001.
    [22] X Yan, J Han. gSpan: Graph-based substructure pattern mining[J]. IEEE ICDM:721, 2002.
    [23] H Mili, R Rada. Merging Thesauri: Principles and Evaluation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, March 1988, 10(2):204-220.
    [24] 颜 伟 , 荀 恩 东 . 基 于 WordNet 的 英 语 词 语 相 似 度 计 算 [OL]. 2004 ,http://lib.blcu.edu.cn/per/scbar/pdf/wordnetsem.pdf.
    [25] 于江生,俞士汶. 中文概念词典的结构[J]. 中文信息学报,2004,16(4):12-19.
    [26] 张承立,陈剑波,齐开悦. 基于语义网的语义相似度算法改进[J]. 计算机工程与应用,2006,17:165-166,179.
    [27] 刘群,李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学,2002,7(2):59-76.
    [28] E Agirre, G Rigau. A proposal for Word Sense Disambiguation Using Conceptual Distance[C]. Proceedings of the First International Conference on Recent Advanced in NLP, Bulgaria,1995.
    [29] M Sussna. Word Sense Disambiguation for Free-Text Indexing Using a Massive Semantic Network[C]. Proc Second Int’l Conf Information and Knowledge Management, 1993.
    [30] R Rada, H Mili, E Bicknell, et al. Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions Systems, Man, and Cybernetics,1989,19(1):17-30.
    [31] E Agirre, G Rigau. Word Sense Disambiguation Using Conceptual Density[C].In Proceedings of the 16" International Conference on Computational Linguistics, Copenhagen, Denmark, 1996.
    [32] P Resnik. Using Information Content to Evaluate Semantic Similarity in a Taxonomy[C]. Proceedings of the 14th International Joint Conference on Artificial Intelligence,August 1995,448-453,.
    [33] Yuhua Li. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15(4):871-882.
    [34] 张茜,王敬泽. 基于关联基因本体论注释的蛋白质相互作用预测[J]. 生物化学与生物物理进展,2005,32(5):449-455.
    [35] Mihail Popescu , James M Keller, Joyce A Mitchell. Fuzzy Measures on Gene Ontology for Gene Product Similarity[J]. IEEE/ACM Transactions on Computational Biology and Bioinformation, 2005, 3(3):7-9.
    [36] P Resnik. Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language[J]. Artificial Intelligence Research, 1999, vol 11:95-130.
    [37] A Budanisky, G Hirst. Semantic distance in WordNet: An experimental ,application-oriented evaluation of five measures[C]. WordNet and Other Lexical Resources Workshop Proceedings of NAACL, 2001, 29-34.
    [38] J J Jiang, D W Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy[C]. Proceedings of ROCLING X, International Conference on Research in Computational Linguistics, Taiwan, 1997.
    [39] D Lin. An Information-Theoretic Definition of Similarity[C]. In J Shavlik eds. Proceedings of the 15th International Conference Machine Learning. San Francisco, CA, Morgan Kaufmann, 1998. 296-304.
    [40] P Lord, R Stevens, A Brass, et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation[J]. Bioinformatics, 2003, 19:1275-1283.
    [41] 李鹏,陶兰,王弼佐. 一种改进的本体语义相似度计算及其应用[J]. 计算机工程与设计, 2007,28(1):227-229.
    [42] 刘大有,唐海鹰,孙舒杨等编著.数据结构[M].北京:高等教育出版社,2001.98-99,137-138.
    [43] Adam Drozdek 编著.数据结构与算法——C++版[M].第三版,郑岩 战晓苏 翻译.北京:清华大学出版社,2006.300-301.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700