汉语专业领域命名实体语义关系自动抽取研究

英文题名：A Research for Semantic Relation Automatic Extraction among Named Entities in Chinese Professional Domain
作者：赵君喆
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息抽取 ; 命名实体对 ; 语义关系
英文关键词：Information extraction ; Named entity pair ; Semantic relation
学位年度：2007
导师：何婷婷
学科代码：081202
学位授予单位：华中师范大学
论文提交日期：2007-05-01

摘要

我们处于一个信息爆炸的时代，互联网上的中文信息在飞速地增长。通过信息抽取技术从浩瀚的中文信息海洋中自动寻找用户所需求的信息则显得至关重要。而命名实体语义关系抽取是信息抽取中的主要任务之一，所以近年来命名实体语义关系抽取研究也成为了我国自然语言处理研究领域中的一个热点。
     当前汉语的命名实体语义关系抽取研究主要是有指导(Supervised)或弱有导(Weakly Supervise)的方法，且研究对象大多是一般领域的语料。这些方法在训练语料库的标注、关系抽取规则的编制以及初始关系种子的选取上都费时费力；此外，适用于一般领域语料的关系抽取方法难以满足一些专业领域的需求。所以，本文提出了一套适用于专业语料的无指导命名实体语义关系抽取的方案，并实现了该系统。此外，本文还尝试了利用该系统的抽取结果构造关系模板和关系种子。
     本研究针对专业领域的语料特性，运用语言资源工具对向量空间模型(VSM)进行改进和优化，解决了专业领域语料的特征模糊问题；根据潜在关系信息分布特征，设计了专业领域语料中实体-关系网络的构造方法；利用复杂网络(Complex Networks)理论中的网络社区(Community)特性，实现了在专业领域语料中关系类别的自动发现；通过对词语在上下文中的重要性分析，采用了提取重要性权重最高词作为关系描述词的关系描述方法。
     本文在专业领域语料平台上对该系统进行了实验，并结合权威评价手段对实验进行了评估，另外还构造了有指导关系抽取系统对实验系统获得的关系进行验证。最终结果表明：本系统在专业领域语料中不但能发现几乎所有的人们已知的关系种类，而且能发现一些不为人知的关系种类；系统在无指导的情况下，可以快速并比较准确地得到命名实体之间的关系描述。
     实验证实了本文构造的系统在专业领域语料中及无指导情况下具有良好的性能，同时实验还证实了无指导关系抽取结果对有指导关系抽取系统具有辅助作用。此外，本文还发现该系统提取的关系描述可以为专业领域中关系本体(Ontology)的建设提供依据。
We are in an era of information explosion, and the Chinese information in rapid growth on the Internet. It is crucial to automatically collect the needful information for users by information extraction technology from the large-scale Chinese information. And the semantic relation extraction among named entities is one of major tasks in information extraction. Therefore, in recent years, the research of Chinese semantic relation extraction among named entities has become a hot field in natural language processing research in our country.
     A majority of current methods of Chinese relation extraction are supervised or weakly supervised. And their research objects are corpuses in common domain. There ways are time-consuming and laborious in tagging training corpuses, making relation extraction rules and selecting initial relation seeds. In addition, those methods sometimes are not applicable in certain professional corpuses. Therefore, this paper presents an unsupervised method to discover the semantic relations among named entities in professional corpuses. And this paper achieves the system. In addition, we attempt to use the extracted results of this system to construct the relation templates and relation seeds.
     According to the characteristics of corpus in professional field, we optimized vector space model adopting some linguistic tool to overcome the blurry feature of professional corpus. Then we proposed a method to construct entity-relation network according to the feature of latent relation information distribution. And then, we extracted relations automatically utilizing community characteristic in complex networks. Finally, By importance analysis of words in context, we use the words with highest weight as key words to describe relations.
     We tested our system in the corpus of professional field and evaluated it using standard method. We also constructed a supervised relation extraction system to verify the result of the system. Result indicated that the system can get description among named entities rapidly and accurately while unsupervised. And it could get almost all the known relations, even some kind of unknown relations.
     Experiment shows good performance of our system in both professional field and unsupervised procedure. It also proves that the result of unsupervised relation extraction could assist supervised method. In addition, the relation descriptions of our result can provide basis for the construction of ontology in professional field.

引文

[1] 李保利，陈玉忠，俞士汶．信息抽取研究综述．计算机工程与应用，2003 Vol．39 No．10 P．1-5，66．
    [2] Applet D E, Israel D J. Introduction to Information Extraction Technology.A Tutorial for IJCAI-99, 1999.
    [3] Gaizauskas R, Wilks Y, Information Extraction: Beyond Document Retrieval. Journal of Documentation, 1997.
    [4] Yangarber.R, R.Grishman, P.Tapanainen, and S.Huttunen. Unsupervised discovery of scenario-level patterns for information extraction. In proceedings of the Applied Natural Language Processing Conference (ANLP2000), 2000.
    [5] Deepak Ravichandran and Eduard Hovy. Learning surface text patterns for a question answering system. In Proc. of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 2002, pages 41-47.
    [6] Roberto Navigli and Paola Velardi. Learning Domain Ontologies from document Warehouses and Dedicated Web Sites. Computational Linguistics, 2004.
    [7] Minlie Huang, Xiaoyan Zhu, Donald G.Payan, Kunbin Qu and Ming Li. Discovering patterns to extract protein-protein interactions from full biomedical texts. 20th International Conference on Computational Linguistics (Coling2004), 2004.
    [8] Grishman R, Sundheim B. Message Understanding Conference-6: A Brief History. In Proceedings of the 16h International Conference on Computational Linguistics (COLING-96), August 1996.
    [9] Chinchor, N. and Marsh, E.. MUC-7 Information Extraction Task Definition (version 5.1). In Proceedings of the Seventh Message Understanding Conference, 1998.
    [10] The ACE 2006 (ACE06) Evaluation Plan, Site visited on May 2nd, 2006.
    [11] 孙斌．中文信息提取系统设计与若干相关基础问题的研究．北京大学博士后研究工作报告，2002．5．
    [12] 周剑辉等．金融领域类信息抽取规则的自动获取．Proceedings of 20th International Conferencee on Computer Processing of Oriental Languages, 2003.
    [13] 崔玉珍．从篇章角度看名词性词汇成分的语义关系．Proceedings of 5th Chinese Lexical Semantics Workshop．2004．
    [14] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel methods for relation extraction. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), 2002, pages 71-78.
    [15] Califf M E,Mooney R J. Relational learning of pattern-match rules for information extraction. In: Proc. of the Sixteenth National Conf. on Artificial Intelligence, 1999, 328～334.
    [16] Soderland S. Learning information extraction rules from semi-structured and free text. Machine Learning, 1999, pages 233-272.
    [17] 车万翔，刘挺，李生．实体关系自动抽取．第一届全国内容安全与信息检索学术会(NCIRCS2004)，2004．
    [18] Sergey Brin. Extracting patterns and relations from world wide web. In Proc. Of WebDB Workshop at 6th International Conference on Extending Database Technology (WebDB'98), 1998, pages 172-183.
    [19] Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In Proc. of the 5th ACM International Conference on Digital Libraries (ACM DL'00), 2000, pages 85-94.
    [20] Kiyoshi Sudo, Satoshi Sekine and Ralph Grishman. An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition. Proceedings of ACL 2003, Sapporo, Japan.
    [21] 董振东，董强．《知网》，http://www.keenage.com,2001.
    [22] 梅家驹，竺一鸣，高蕴琦，殷鸿翔．同义词词林[M]．上海：上海辞书出版社，1996．
    [23] 苗传江．HNC(概念层次网络)理论导论[M]．北京：清华大学出版社，2005．
    [24]Yu,J.S.,Yu,S.W.,Liu,Y and Zhang,H.R.Introduction to CCD.Proceedings of ICCC2001,Singapore,2001．
    [25] 吴云方等．双向考察与验证：并列成分中心语的语义关系和CCD的名词语义分类体系．Proceedings of 5th Chinese Lexical Semantics Workshop，2004．
    [26] Takaaki Hasegawa, Satoshi Sekine and Ralph Grishman. Discovering Relations among Named Entities from Large Corpora. Proceeding of Conference (ACL2004), 2004, Barcelona, Spain.
    [27] Chen Jinxiu, Ji Donghong, Tan Chew Lim, Niu Zhengyu. Automatic Relation Extraction with Model Order Selection and Discriminative Label Identification. 2nd International Joint Conference on Natural Language Processing (IJCNLP05), 2005, Jeju Island, Republic of Korea.
    [28] 黄荣怀．关于教育技术学领域中的若干关键技术[J]．中国电化教育，2005．4．5-8．
    [29] 张素香，李蕾，谭咏梅．特定领域下关系模板的研究．北京邮电大学学报，2006．
    [30] 黄昌宁．统计语言模型能做什么．语言文字应用，2002．
    [31] 冯志伟．计算语言学基础[M]．商务印书馆，2001．
    [32] Salton G, MeGill M J. Introduction to medern Information Retrieval[M]. New York: McGraw-Hill Book Company, 1983.
    [33] 鲁松，李晓黎，白硕，王实．文档中词语权重计算方法的改进，中文信息学报，第14卷，第6期．
    [34] 刘群，张华平，骆卫华，孙健等译，James Allen著．《自然语言理解》(第二版)，电子工业出版社．
    [35] Zhou Guodong, Su Jian, Zhang Jie,Zhang Min. Exploring Various Knowledge in Relation Extraction. In Proceedings of the 43rd Annual Meeting of the ACL, 2005, pages: 427-434.
    [36] Landauer T K, Foltz P W, Laham D. Introduction to Latent Semantic Analysis[J]. Discourse Processes,1998.
    [37] Landauer T K,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge[J]. Psychological Review, 1997.
    [38] Dumais S T, et al1. Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88, Conference on Human Factors in Computing, New York, ACM, 1988, pages: 281-285.
    [39] Agirre E., Rigau G.. "A proposal for word sense disambiguation using conceptual distance". Proc. of International Conference Recent Advances in Natural Language Processing (RANLP), 1995, pages: 258-264.
    [40] Dagan I., Marcus S., et al.. Contextual Word Similarity and Estimation from Sparse Data. in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 1993, pages: 164-171.
    [41] J.MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume 1, University of California Press, 1967.
    [42] Selim S Z, Ismail M A. K-Means-Type Algorithms: A Generalized Convergence Theorem and Charadterization of Local Optimality. IEEE Trans Pattern Analysis and Machine Intelligence,1984.
    [43] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon. Network motifs: Simple building blocks of complex networks. Science 298, 2002, pages: 824-827.
    [44] S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics 31, 2002, pages: 64-68.
    [45] G.. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee. Self-organization and identification of Web communities. IEEE Computer 35, 2002, pages: 66-71.
    [46] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci.USA 99, 2002, pages: 7821-7826.
    [47] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Preprint condmat/0308217, 2003.
    [48] 刘群，李素建．基于《知网》的词汇语义相似度计算．第三届汉语词汇语义研讨会，2002，台北．
    [49] He Tingting, Xu Chao, Li Jing, Zhao Junzhe. A Named Entity Relation Extraction Method Based-on Bootstrapping. International Symposium on Computer Science and Technology, 2005.
    [50] Douthat A. The Message Understanding Conference Scoring Software User's Manual. In Proceedings of the Seventh Message Understanding Conference, 1998.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700