基于自举的弱指导中文语义关系抽取研究

英文题名：Bootstrapping-based Weakly Supervised Chinese Entity Relation Extraction
作者：李小红
论文级别：硕士
学科专业名称：软件工程
中文关键词：中文语义关系抽取 ; 弱指导学习 ; 自举 ; 分层聚类
英文关键词：Chinese Semantic Relation Extraction ; Weakly Supervised Learning ; Bootstrapping ; Hierarchical Clustering
学位年度：2010
导师：钱龙华
学科代码：081202
学位授予单位：苏州大学
论文提交日期：2010-10-01

摘要

命名实体间语义关系抽取是信息抽取中的重要环节。虽然有指导的学习方法在这一方面已经获得了一定的成功,但它依赖于大规模的已标注语料库,而这需要费时费力的人工标注。
     本文提出了基于自举的弱指导中文语义关系抽取方法。给定一个小规模的已标注数据集(初始种子集)和一个大规模的未标注数据集,利用种子集本身来不断地扩展标注数据集,从而在小规模的标注集上也能取得较好的结果。特别地,本文提出了基于层次聚类的分层种子选取策略,通过将关系实例聚类到不同的簇并从中选取相应种子的方法来构造初始种子集,然后在此基础上实现基于自举的弱指导中文语义关系抽取。
     在ACE RDC 2005中文基准语料库上进行的关系大类分类的实验表明,采用基于层次聚类的分层选取策略,弱指导中文语义关系抽取的F值达到了63.4,相比随机选取(F值为57.9)和顺序选取(F值为52.4)方法,其F值分别提高了5.5和11,这说明本文所提出的方法能显著提高弱指导中文语义关系抽取的性能。
Semantic relation extraction between named entities is an important research subtask in information extraction. Although supervised learning approaches have achieved certain success in this area, they rely heavily on large scale manually annotated corpora, which are both time-consuming and labor-intensive.
     This paper proposes a bootstrapping-based approach to weakly supervised Chinese semantic relation extraction. Given a small-scale annotated dataset (the initial seed set) and a large-scale un-annotated dataset, the training dataset can be expanded iteratively from the initial seed set, thus leading to a better extraction result on a small-scale annotated dataset. Furthermore, a clustering-based stratified seed sampling strategy is proposed to select the initial seed set. This is done by first clustering all relation instances into various clusters, then choosing corresponding seeds from each cluster to form an initial seed set, finally from this seed set a weakly supervised Chinese semantic relation extraction system is bootstrapped.
     Experimental evaluation on the ACE RDC 2005 benchmark corpus shows that, the F-measure of weakly supervised relation extraction for major type relation classification based on this strategy is 63.4, outperforming those on random sampling (57.9) and sequential sampling (52.4) by 5.5 and 11 units respectively. This demonstrates that our method can significantly improve the performance of weakly supervised Chinese semantic relation extraction.

引文

[1]李保利,陈玉忠,俞士汉.信息提取研究综述[J].计算机工程与应用. 2003, 39(10):1-5.
    [2]奚斌,周国栋,钱龙华,潘坤.基于分层策略的弱指导语义关系抽取[J].广西师范大学学报:自然科学版.2008 (01):178-181.
    [3] Lodhi H., Saunders C., Shawe-Taylor J., Cristianini N., and Watkins C. Text classification using string kernel[C]. Journal of Machine Learning Research, 2002(2): 419-444.
    [4] Abe N. and Mamitsuka H. Query learning strategies using boosting and bagging [C]. In Proceedings of the 15th International Conference on Machine Learning (ICML’98), 1998, pages 1-9. Madison, WI.
    [5] Abney S. Bootstrapping [C]. ACL’2002, 2002, pages 221-229.
    [6] Brin S. Extracting patterns and relations from the World Wide Web [C]. In WebDB Workshop at 6th International Conference on Extending Database Technology (EDBT’98), 1998.
    [7] Agichtein E. Extracting Relations from Large Text Collections [D]. Doctor's degree thesis, Graduate School of Arts and Sciences, Columbia University, 2005.
    [8] Agichtein E. and Gravano L. Snowball: Extracting Relations from Large Plain-Text Collections [C]. Proceedings of the fifth ACM conference on Digital libraries, 2000.
    [9] Blum A. and Mitchell T. Combining labeled and unlabeled data with co-training [C]. In Proceedings of the Workshop on Computational Learning Theory, 1998.
    [10] Zhang Z. Weakly supervised relation classification for Information Extraction [C]. In proceedings of ACM 13th conference on Information and Knowledge Management (CIKM’2004), 8-13 Nov. 2004, pages 581-588. Washington D.C., USA.
    [11] Zhu X. J. and Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation [R]. CMU CALD Technical Report CMU-CALD-02-107, 2002.
    [12] Chen J. X., Ji D. H., and Tan C. L. Relation Extraction using Label Propagation Based Semi supervised Learning [C]. COLING-ACL’2006, July 2006, pages 126-139. Sydney, Australia.
    [13] Collins M. and Duffy N. Convolution Kernels for Natural Language[C]//NIPS,2001:625-632.
    [14] Aberdeen J., Day D., Hirschman F., and Robinson P. Description of the Alembic system used for MUC-6 [C]. MUC-6, 1995, pages 141-155.
    [15] Fisher D., Soderland S., Joseph M., et al. Description of the UMass systems as used for MUC-6[C]. In Proceedings of the 6th Message Understanding Conference (MUC-6), 1995.
    [16] Chen H. H., Ding Y. W., Tsai S. C., et al. Description of the NTU System Used for MET2 [C]. In Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998.
    [17] Yu S. H., Bai S. H., and Wu P. Description of the Kent Ridge Digital Labs System Used for MUC-7 [C]. In Proceedings of the Seventh Message Understanding Conference, 1998.
    [18] Rich E. and LuperFoy S. An architecture for anaphora resolution [C]. ANLP’1988, 1988, pages 18-24.
    [19] Suzuki J., Hirao T., Sasaki Y., and Maeda E. Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data [C]. ACL’2003, 2003.
    [20] Moschitti A. A Study on Convolution Kernels for Shallow Semantic Parsing [C]. ACL’2004, 2004. Barcelona, Spain.
    [21] Soon W. M., Ng H. T., and Lim C.Y. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics [J]. 2000, 27(4): 521-544.
    [22] Ng V. Machine Learning for Coreference Resolution: From Local Classification to Global Ranking [C]. ACL’2005, June 2005, pages 157-164. Ann Arbor, USA.
    [23]庄成龙,钱龙华,周国栋.基于树核函数的实体语义关系抽取方法研究[J].中文信息学报, 2009,23(1): 3-8,34.
    [24] Hobbs J. The Generic Information Extraction System [C]. In Proceedings of the Fifth Message Understanding Conference (MUC-5), 1993, pages 87-91.
    [25] Yangarher R. and Grishman R. Description of the Proteus/PET System as Used for MUC-7 [C]. In Proceedings of the Seventh Message Understanding Conference [MUC-7], 1998.
    [26] Sager N. Syntactic Analysis of Natural Language. Advances in Computers 8, 1967, papes 153-188. Academic Press, NY.
    [27] Dejong G. An Overview of the Frump System. In Strategies for Natural LanguageProcessing, 1982, pages 149–176. Hillsdale, N.J.: Lawrence Erlbaum.
    [28] Aone C., Halverson L., Hampton T., and Ramos-Santacruz M. SRA: Description of the IE2 System Used for MUC-7 [C]. In Proceedings of the 7th Message Understanding Conference (MUC-7). 1998.
    [29] Aone C. and Ramos-Santacruz M. REES: A large-scale relation and event extraction system [C]. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP00), 2000, pages 76-83.
    [30] Miller S., Fox H., Ramshaw L. , et al. A novel use of statistical parsing to extract information from text[A]. ANLP’2000[C]. 2000, 226-233.
    [31] Kambhatla N. Combining lexical, syntactic and semantic features with Maximum Entropy models for extracting relations [C]. ACL’2004 (poster), July 2004, pages 178-181. Barcelona, Spain.
    [32] Zhao S. B. and Grishman R. Extracting relations with integrated information using kernel methods [C]. ACL’2005, June 2005, pages 419-426. Ann Arbor, USA.
    [33] Zhou G. D., Su J., Zhang J., and Zhang M. Exploring various knowledge in relation extraction [C]. ACL’2005, June 2005, pages 427-434. Ann Arbor, USA.
    [34] Wang T., Li Y. Y., and Bontcheva K. Automatic Extraction of Hierarchical Relations from Text [C]. In Proceedings of the Third European Semantic Web Conference (ESWC 2006), 2006, pages 401-416.
    [35] Jiang J. and Zhai C. X. A Systematic Exploration of the feature Space for Relation Extraction [C]. NAACL-HLT’2007, 2007, pages 113~120. Rochester, NY, USA.
    [36] Zelenko D., Aone C., and Richardella A.. Kernel methods for relation extraction [J]. Journal of Machine Learning Research. 2003, 3 (2003): 1083-1106.
    [37] Culotta A. and Sorensen J. Dependency tree kernels for relation extraction [C]. ACL’2004, 2004, pages 423-429. Barcelona, Spain.
    [38] Bunescu R. C. and Raymond J. M. A Shortest Path Dependency Kernel for Relation Extraction [C]. EMNLP’2005, 2005, pages 724-731. Vancover, B.C.
    [39] Zhang M., Zhang J., Su J., and Zhou G. D. A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features [C]. COLING-ACL’2006. 2006, pages 825-832. Sydney, Australia.
    [40]姜吉发,王树西.一种自举的二元关系和二元关系模式获取方法[J].中文信息学报, 2005, 19(2): 71-77.
    [41] Zhou G. D., Zhang M., Ji D. H., and Zhu Q. M. Tree Kernel-based Relation Extraction with Context-Sensitive Structured Parse Tree Information [C]. EMNLP/CoNLL’2007, 2007, pages 728-736. Prague, Czech.
    [42] Qian LongHua, Zhou GuoDong, Zhu QiaoMing, Qian PeiDe. Exploiting constituent dependencies for tree kernel-based semantic relation extraction[C]//COLING-ACL,2008: 697-704.
    [43] Pabitra M., Murthy C.A., and Sankar K. Unsupervised Feature Selection Using Feature Similarity [J]. IEEE transactions on pattern analysis and machine intelligence, 2002, 24(3).
    [44] Takaaki Hasegawa, Satoshi Sekine and Ralph Grishman. Discovering Relations among Named Entities from Large Corpora[C]//ACL, 2004: 415-422.
    [45] Zhang M., Su J., Wang D. M., Zhou G. D., and Tan C. L. Discovering Relations between Named Entities from a Large Raw Corpus Using Tree Similarity-Based Clustering [C]. IJCNLP’2005, 2005, pages 378-389.
    [46] Hasegawa T., Sekine S., and Grishman R.. Discovering Relations among Named Entities from Large Corpora [C]. ACL’2004, 2004. Barcelona, Spain.
    [47] Chen J. X., Ji D. H., Tan C. L., and Niu Z. Y. Unsupervised Feature Selection for Relation Extraction [C]. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2005, pages 411-418.
    [48] Dash M. and Li H. Feature Selection for Clustering [C]. Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PKADD). April 18~20, 2002. Kyoto, Japan.
    [49] Fung G. P. C., Jeffrey X. Y., and Lu H. J. Discriminative Category Matching: Efficient Text Classification for Huge Document Collections [C]. In Proceedings of ICDM'2002, pages 187~194. Japan.
    [50]刘群,李素建.基于《知网》的词汇语义相似度计算[J].计算语言学与中文信息处理, 2002, 7(2): 59-76.
    [51] Huang R. H., Sun L., and Feng Y. Y. Study of Kernel-Based Methods for Chinese Relation Extraction [C]. LNCS (Lecture Notes in Computer Science): Volume 4993, pages 598-604, 2008. Springer Berlin/Heidelberg.
    [52] Christopher D. Manning, Hinrich Sch tze. Foundations of Statistical Natural Language Processing [M]. Beijing: Publishing House of Electronics Industry, 2005.
    [53] Gelbard R, Goldman O, Spiegler I. Investigating diversity of clustering methods: An empirical comparison. Data & Knowledge Engineering, 2007,63(1):155?166.
    [54] Kumar P, Krishna PR, Bapi RS, De SK. Rough clustering of sequential data. Data & Knowledge Engineering, 2007,3(2):183?199.
    [55] Marques JP, Written; Wu YF, Trans. Pattern Recognition Concepts, Methods and Applications. 2nd ed., Beijing: Tsinghua University Press, 2002. 51?74 (in Chinese).
    [56] Zhang Y. M. and Zhou J. F. A Trainable Method for Extracting Chinese Entity Names and their Relations. In Proceedings of the 2nd Chinese Language Processing Workshop, October 2000. Hong Kong.
    [57] Walter D., Bosch A. V. D., Zavrel J., Veenstra J., Buchholz S., and Busser B. Rapid development of NLP modules with memory-based learning [C]. In Proceedings of ELSNET in Wonderland, 1998, papes 105-113. Utrech, Netherlands.
    [58]车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报, 2005, 19(2): 1-6.
    [59]张素香,文娟,秦颖等.哈尔滨工程大学学报. 2006,27(增),370-373.
    [60]董静,孙乐,冯元勇,黄瑞红.中文实体关系抽取中的特征选择研究[J].中文信息学报, 2007, 21(4): 80-85, 91.
    [61] Wallach M. H. Conditional Random Fields: An Introduction [R]. Technical Report MS-CIS-04-21. University of Pennsylvania, Department of Computer and Information Science. 2004.
    [62] Li W. J., Zhang P., Wei F. R., Hou Y. X., and Lu Q. A Novel Feature-based Approach to Chinese Entity Relation Extraction[C]. ACL’2008: HLT (Short Papers), pages 89–92, June 2008. Columbus, Ohio, USA.
    [63] Che W. X., Jiang, J. M. Su Z., Pan Y., and Liu T. Improved-Edit-Distance Kernel for Chinese Relation Extraction [C]. In Proceedings of the 2nd international Joint Conference on Natural Language Processing (IJCNLP’05), 2005. Jeju Island, Korea.
    [64]刘克彬,李芳,刘磊,韩颖.基于核函数中文关系自动抽取系统的实现[J].计算机研究与发展, 2007, 44(8): 1406-1411.
    [65]梅家驹,竺一鸣,高蕴琦,殷鸿翔编.《同义词词林》第二版[M],上海:上海辞书出版社, 199.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700