实体关系自动抽取技术的比较研究

英文题名：Comparative Study of Automatic Entity Relation Extraction
作者：宁海燕
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：实体关系抽取 ; 领域术语抽取 ; Bootstrapping ; 聚类 ; DCM合并
英文关键词：entity relation extraction ; domain-term extraction ; Bootstrapping ; clustering ; DCM-combination
学位年度：2010
导师：王晓龙
学科代码：081201
学位授予单位：哈尔滨工业大学
论文提交日期：2010-06-01

摘要

随着计算机技术和网络技术的不断发展,海量信息以电子文档的形式出现在人们面前。从这些自然文本中提取出有用的信息,日益成为人们关注的问题。因此信息抽取技术应运而生,关系抽取是其中的一个子任务。
     文本中特定的事实信息称为实体,而确定这些实体之间的关系称为实体关系抽取。实体关系抽取对本体库的构建以及改进信息检索技术等有重要的作用。本文重点对实体关系抽取技术的几个问题进行了研究和解决:
     首先,本文抽取了传统命名实体以外的存在重要语义关系的词:领域术语。针对领域术语评测数据的不统一和评价的困难性,通过词典评测、人工评测在准确率、召回率、F度量等评价指标上与几种主流的基于统计的术语抽取方法进行了详细的对比和分析。本文还提出了基于线性支持向量机权重的术语抽取方法,实验结果表明,该方法能有效地抽取领域术语。
     其次,本文基于不同的应用需求,利用统一的语料对比研究了基于特征的有监督、半监督和无监督的实体关系抽取方法。
     在有监督实体关系抽取方法中,前人的研究工作没有考虑各种特征对两个实体间无关系即no-relation的影响。对此,本文详细对比了通用特征:实体周围词语、实体类型、子类型、实体位置、实体中心词和内容的依存句法分析对真正关系和no-relation的影响,并提出了新特征:特征词位置信息,实验表明该特征能有效提高实体关系抽取的准确率。
     本文通过Bootstrapping半监督实体关系抽取方法进行了不同的对比实验:实体特征、种子集规模对实体关系抽取性能的影响;同等条件下,半监督实体关系抽取方法与有监督实体关系抽取方法的性能比较。实验结果表明半监督实体关系抽取能够提高实体关系抽取的准确率。
     无监督实体关系抽取方法主要采用的是聚类方法,因此本文主要研究了聚类算法以及合并策略对实体关系抽取的影响。本文对比研究了三种聚类算法,即K-means、自组织映射和Affinity Propagation算法,以及两种合并策略(DCM和Cosine)。Affinity Propagation算法能够取得较优的结果,自组织映射算法在运行时间上更有优势。
With the development of computer and network technology, large amount of information in form of electronic documents has appeared. More and more attentions are paid to extract useful information from these texts. Therefore, information extraction technology has become prevalent and relation extraction is one of the important subtasks.
     Specific fact information in text is represented as entity, and the judgment of the relationship between these entities is defined as entity relation extraction. Entity relation extraction plays an important role in constructing ontology and refining information retrieval technology. This thesis focuses on some issues about entity relation extraction technology:
     First of all, domain-specific terms with important semantic relations except traditional named entity extraction are extracted. Because of the variability in the evaluation data of domain-specific term and difficulty in judging domain-specific terms by human, a variety of popular Chinese automatic domain-specific term extraction statistical methods are compared and analyzed in this paper. Both the objective method based on professional computer dictionary and the subjective method based on human judgment are adopted. A comprehensive comparison is performed with many evaluation measurements including precision, recall and F-measure. Moreover, this paper proposes a domain-specific term extraction method based on the weight of linear support vector machine. The experimental results show that this method extracts domain-specific terms effectively.
     Secondly, a unified corpus is employed to make comparison among the supervised, semi-supervised and unsupervised feature-based entity relation extraction in order to meet the requirements of different application.
     Previous studies based on supervised entity relation extraction methods did not consider the effect of features on no-relation between two entities. Thus, this paper compares effects of general features: words around an entity, type and subtype of an entity, location of two entities, dependency parsing of the center words and content of an entity on real relationships and no-relation. Besides, a novel feature that location information of a characteristic word is proposed and relation extraction.
     We do various comparison experiments with different entity features and size of seed set by semi-supervised entity relation extraction method of Bootstrapping. Also, we compare the performance of semi-supervised and supervised entity relation extraction method in the same conditions. Experimental results imply that the semi-supervised entity relation extraction can improve the precision of entity relation extraction.
     Most researchers use data clustering methods in unsupervised entity relation extraction. The effect of clustering algorithms and combined strategies on entity relation extraction is the focus of this thesis. Three clustering algorithms, namely K-means, Self-Organizing Map (SOM) and Affinity Propagation algorithm and two combined strategies (DCM and Cosine) are compared and analyzed in the thesis. Affinity Propagation algorithm can achieve the best precision in our experiment, and the SOM algorithm is superior in the real running time.

引文

1 Kotaro Nakayama, Takahiro Hara, Shojiro Nishio. Wikipedia Link Structure and Text Mining for Semantic Relation Extraction. Proc. of the Workshop on Semantic Search at the 5th ESWC, Tenerife, Spain, 2008:59-73
    2 Zhou Xiaohua, Hu Xiaohua, Lin Xia, Han Hyoil, Zhang Xiaodan. Relation-based document retrieval for biomedical literature database. Database Systems for Advanced Applications, SingaPo, 2006:689-701
    3郑实福,刘挺,秦兵,李生.自动问答综述.中文信息学报. 2002, 16(6):46-52
    4 Velardi P, Missikoff M, et al. Identification of relevant terms to support the construction of domain ontologies. Proceedings of the Workshop on Human Language Technologies and Knowledge Management, France, ACM Press, 2001:1-11
    5 Maedche A, Staab S. Ontology learning Handbook on Ontologies in Information Systems, Heidelberg, Springer-Verlag, 2004:173-190
    6王强军,李芸.信息技术领域术语提取的初步研究.术语标准化与信息技术. 2003, (1):32-33
    7中国标准研究院.中华人民共和国国家标准GB/T10112-1999.术语工作,原则与方法. 1999
    8冯志伟.现代术语学引论.语文出版社, 1997:1-20
    9张榕.术语定义抽取、聚类与术语识别研究.北京语言大学博士学位论文. 2006:1-25
    10 Oakes M P, Paice C. Term extraction for automatic abstracting. Recent Advances in Computational Terminology, Amsterdam/Philadelphia, John Benjamins Publishing Company, 2001, 370: 353-370
    11 Frantzi K. and Anaiadou S. The C-value/NC-value domain independent method for Multi-Word term extraction. Journal of Natural Language Processing, 1999, 6(3):115-130
    12 Patrick Pantel and Dekang Lin. A Statistical Corpus-Based Term Extractor. Conference on AI, Canada, 2001: 36-46
    13 Juan Liu, Yuanchao Liu, Wei Jiang, Xiaolong Wang Research on automaticacquisition of domain terms. Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kun Ming, 2008:12-15
    14贾美英,杨炳儒,郑德权,杨靖.采用CRF技术的军事情报术语自动抽取研究.计算机工程与应用. 2009, 45(32):126-129
    15刘桃,刘秉权,徐志明,王晓龙.领域术语自动抽取及其在文本分类中的应用.电子学报. 2007, 2(35):328-332
    16 VU Thuy, Aiti AW and Min ZHANG. Term Extraction Through Unithood and Termhood Unification. Proceedings of the 3nd International Joint Conference on Natural Language Processing, Hyderabad, India, 2008:631-36
    17张峰,许云,侯艳,樊孝忠.基于互信息的中文术语抽取系统.计算机应用研究. 2005, 2(5):72-77
    18 In: Proceedings of the 6th Message Understanding Conference (MUC-7). National Institute of Standards and Technology, 1998
    19 C. Aone, M Ramos-Santacruz. Rees: A large-scale relation and event extraction system. In: Proceeding of the 6th Applied Natural Language Processing Conference, New York, ACM Press, 2000: 76-83
    20 N. Cristianini, J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000:9-25
    21 T. Zhang. Regularized winnow methods. Advances in Neural Information Processing Systems, 2001: 703-712
    22 D. Zelenko, C. Aone, A. Richardella. Kernel methods for relation extraction. The Journal of Machine Learning Research. 2003, 3:1083-1106
    23 A. Culotta, J. Sorensen. Dependency tree kernels for relation extraction. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 2004:423-429
    24刘克彬,李芳,刘磊,韩颖.基于核函数中文关系自动抽取系统的实现.计算机研究与发展. 2007, 44(8): 1406-1411
    25 E. Agichtein, L. Gravano. Snowball: Extracting relations from large plain-text collections. In Process of the Fifth ACM International Conference on Digital Libraries, San Antonio, 2000:85-94
    26 R. Bunescu, R. Mooney. Learning to extract relations from the web using minimal supervision. In Proceedings of the Association for ComputationalLinguistics (ACL). 2007, 45(1):576-583
    27 M. Pasca, D Lin, J. Bigham, A. Lifchits, A. Jain. Names and similarities on the web: Fact extraction in the fast lane. In Proceedings of the Association for Computational Linguistics (ACL). 2006, 44(2):809-816
    28 Jinxiu Chen, Donghong Ji, Chen Lim Tan, Zhengyu Niu. Unsupervised Feature Selection for Relation Extraction. Proceedings of IJCNLP-2005, Jeju Island, Korea, 2005:262-267
    29张志田.无监督关系抽取方法研究.哈尔滨工业大学硕士论文. 2007:26-40
    30 Hasegawa Takaaki, Satoshi Sekine, Ralph Grishman. Discovering Relations among Named Entities from Large Corpora. ACL'2004, Barcelona, Spain. 2004:415-422
    31 Dimitra Farmakiotou, Vangelis Karkaletsis, John Koutsias, et al. Rule-Based Named Entity Recognition for Greek Financial Texts. Proc. of the Workshop on Computational lexicography and Multimedia Dictionaries, Kato Achaia, Greece, 2000:75-78
    32 Shaojun Zhao. Named Entity Recognition in Biomedical Texts using an HMM Model. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Geneva, Switzerland, 2004:84-87
    33 Hai Leong Chieu, Hwee Tou Ng. Named Entity Recognition: A Maximum Entropy Approach Using Global Information. Proceedings of the 19th Coling, Taipei, Taiwan,2002:190-196
    34 Burr Settles. Biomedical Named Entity Recognition Using Conditional Random Field and Rich Feature Sets. Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, Geneva, Switzerland, 2004:104-107
    35廖先桃.中文命名实体识别方法研究.哈尔滨工业大学大学硕士论文. 2006:37-56
    36 C. Cortes and V. Vapnik. Support-Vector Networks Machine Learning. 1995, 20:273-297
    37刘华.基于文本分类中特征提取的领域词语聚类.语言文字应用. 2007, 1(1):139-134
    38 Ronnald Rosenfeld. A maximum entropy to adaptive statistical language learning. Computer Speech and Language.1996,10(3): 187-228
    39黄鑫.基于特征向量的中文实体间语义关系抽取研究.苏州大学硕士论文. 2009:48-55
    40 Brin S. Extracting Patterns and Relations from WWW. Proc. of WebDB Workshop at the 6th International Conference on Extending Database Technology, Valencia, Spain, 1998: 172-183
    41姜吉发,王树西.一种自举的二元关系和二元关系模式获取方法.中文信息学报. 2005,19(2):71-77
    42 Abney S. Bootstrapping. 40th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, Philadelphia, Pennsylvania, 2002:360-367
    43何婷婷,徐超,李晶,赵君喆.基于种子自扩展的命名实体关系抽取方法.计算机工程. 2006,32(21):183-184
    44 Kanungo T, Mount DM, Netanyahu NS. An Efficient K-means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7):881-892
    45 Teuvo Kohonen. The Self-Organizing Maps. Proceedings of the IEEE,1990, 78(9):1464-1480
    46 Brendan J. Frey, Delbert Dueck. Clustering by Passing Messages Between Data Points. Science. 2007,315(16):972-976
    47 Gabriel Pui, Cheng Fung, Jeffrey Xu, et al. Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, 2002: 187-194
    48 Chen Jinxiu, Ji Donghong, Tan Chew Lim, et al..Automatic Relation Extraction with Model Order Selection and Discriminative Label Identification. The 2nd International Joint Conference on Natural Language Processing, Hawaii, 2005: 390-401

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700