基于SVM和TSVM的中文实体关系抽取

英文题名：SVM and TSVM Based Chinese Entity Relation Extraction
作者：徐芬
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：信息抽取 ; 实体关系抽取 ; SVM ; TSVM ; 特征选取 ; 训练样例数目 ; 多分类器
英文关键词：information extraction ; entity relation extraction ; SVM ; TSVM ; feature selection ; training example quantities ; multi-TSVM
学位年度：2007
导师：王挺
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2007-11-01

摘要

信息抽取技术自动将无结构文本转化为有结构文本,既可以自成系统满足人们的强烈需求,同时还是其它应用如信息检索、文本分类、自动问题回答等的重要基础技术。实体关系抽取是信息抽取技术中的重要环节,正成为越来越热门的研究课题。中文实体关系抽取工作尚处于起步阶段,还有大量的工作需要完成。
     本文针对中文实体关系的特点,设计了一系列的特征,包括词、词性标注、实体属性和提及信息、实体间交迭关系和知网提供的概念信息等,以构成实体间关系的上下文特征向量并使用SVM分类器进行中文实体关系抽取。以ACE2004的训练语料作为实验数据,得到了较好的识别性能。同时根据分级实验的结果,详细考察了各种特征集和不同训练样例数目对中文实体关系性能的影响。实验结果表明:不同细化程度的任务应该选取不同抽象程度特征集组合。其中词性特征集较适合关系发现任务,知网概念特征集较适合关系大类和子类识别任务,词特征集是最基本特征集,实体间交迭特征集对抽取性能贡献最大。训练语料库规模的增加可以提高识别性能,开发较大规模的训练语料库对使用SVM分类器是很有必要的;但当语料库达到一定规模后,语料库规模的增加对性能的影响变弱,这时则应该把主要的注意力放在特征集构造上。
     在上述研究的基础上,针对SVM对大规模训练语料库的依赖,将半监督学习方法TSVM引入到中文实体关系抽取工作中。实验结果显示,在训练向量数目非常小时TSVM的性能远远超过SVM,但在训练向量数目较大后,TSVM的性能反而不如SVM。在关系发现这样相对简单的问题上,TSVM分类器仅使用少量标注语料和大量未标注语料,就可以得到不错的性能,降低了抽取系统的成本、改善了其可移植性;但在更复杂的关系类别识别问题上,TSVM分类器的性能仍不甚理想,应该考虑更多其他的半监督学习方法。同时本文研究并实现了TSVM多分类器构造。
     进一步的工作包括两个方面,一是改善现有的特征集如将更多的特征如组块识别、知网概念结构等加入到特征集以提高关系抽取性能和进行更精确的参数选择,二是定量研究标注数据的选择对性能的影响以及SVM和TSVM要求的标注数据规模规律。
Information Extraction Technology automatically transforms unstructured texts into structured ones, which not only forms a system to satisfy the strong request, but also affords a basis for other applications such as Information Retrieval, Text Category, Question Answering. Entity Relation Extraction is so important in Information Extraction that it receives more and more interest from researchers. The task of Chinese entity relation extraction still needs much further study, calling for a mass of work.
     This paper presents the work of Chinese entity relation extraction. We have designed the context vector by using several new features including word, part of speech tag, entity and mention, overlap and HowNet concepts. Based on the context information, we apply an SVM classifier to detect and classify the relations between entities. We take the training data of ACE 2004 as our experimental data and have obtained encouraging results. The experimental results are analyzed in detail, which helps us investigate the impact of various features and training example quantities on the extraction performance. The experimental results indicate: it would be advisable to choose different features for different extraction task. The word features are suitable for relation detection task, while Hownet concept features are appropriate for relation type and subtype characterization tasks. Word features is a basic one and overlap features contribute most. The performance will rise with the increasement of training examples, so it will be necessary to develop large corpus if you want to use SVM classifier. But after the amount of corpus achieves certain level, the gain from adding more training examples is so trivial that we must find other way to enhance extraction performance, developing more features for instance.
     Aiming at the dependence of SVM method on large scale corpus, we propose the introduction of semi-supervised learning method TSVM to relation extraction. to see whether it can improve the extraction performance by using both labeled and unlabeled datum. Results from experiments show that: TSVM performs much better than SVM in the same context when labeled examples are very few, while SVM performs little better than TSVM when there are many labeled examples. TSVM can perform well on relation detection task, which makes it practicable on this kind of task. But on the task of relations type recognition, TSVM perfoms not very good, forcing us to look for other semi-superisved learning methods. An multi-TSVM classifier is also constructed.
     Future works include developing more features such as chunking information, Hownet concept structure to improve the extraction performance, choosing parameters for the classifier and invesigating the rule of example quantities needed by SVM and TSVM.

引文

[1] Agichtein E. and Gravano L. 2000. Snowball: Extracting Relations from large Plain-Text Collections, In Proceedings of the 5th ACM International Conference on Digital Libraries. ACMDL-2000
    [2] Chinese Annotation Guidelines for Entity Detection and Tracking(EDT) Version 4.2.4.2004. http://www.ldc.upenn.edu/Projects/ACE/docs/ChineseEDTV4-2-4.PDF
    [3] Chinese Annotation Guidelines for Relation Detection and Characterization(RDC) Version4.3 2004. http://www.ldc.upenn.edu/Projects/ACE/docs/ChineseRDCV4-3.PDF
    [4] Blum A., Mitchell T. 1998. Combining labeled and unlabeled data with co-training. COLT: Proceedings of the Workshop on Computational Learning Theory
    [5] Bottou L., Cortes C., Denker J., Drucker H., Guyon I., Jackel L. LeCun Y., Muller U.,Sackinger E., Simard P., and Vapnik V. 1994. Comparison of classifier methods: a case study in handwriting digit recognition. In International Conference on Pattern Recognition, pages 77-87. IEEE Computer Society Press.
    [6] Brin S. 1998. Extracting patterns and relations from world wide web. In Proceedings of WebDB Workshop at 6th International Conference on Extending Database Technology WebDB-1998.pages 172-183
    [7] Burges J. C. 1997. A Tutorial on Support Vector Machines for Pattern Recognition Bell Laboratories, Lucent Technologies. 1997
    [8] Bunescu R. C. and Mooney R. J. 2005. A Shortest Path Dependency Kernel for Relation Extraction. EMNLP-2005
    [9] Che W. X., Jiang J. M., Su Zh., Pan Y., Liu T. 2005. Improved-Edit-Distance Kernel for Chinese Relation Extraction. In Proceedings of IJCNLP 2005. Jeju Island, Korea,Oct. 2005
    [10] Chen J. X., Ji D. H., Tan Ch. L., Niu Zh. Y. 2006. Relation Extraction Using Label Propagation Based Semi-supervised Learning. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 129-136, Sydney, July 2006
    [11]Collobert R., Sinz F., Weston J., Bottou L. 2006. Large Scale Transductive SVMs.Journal of Machine Learning Research 7(2006) 1687-1712
    [12]Culotta A. and Sorensen J. 2004. Dependency Tree Kernel for Relation Extraction.ACL-2004
    [13] Cunningham H., Maynard D., Bontcheva K., Tablan V. 2002. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications.Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Philadelphia, July 2002
    [14] Crammer K. and Singer Y. 2001. On the learnability and design of output codes for multiclass problems. Technical report, School of Computer Science and Engineering,Hebrew University, 2001
    [15]Daelemans W., Bosch A., Zavrel J., Van der Sloot K., and Vanden Bosch A. 2000.TiMBL: Tilburg Memory Based Learner, Version 3.0, Reference Guide. Technical Report ILK-00-01, ILK, Tilburg University. Tilburg, The Netherlands, 2000
    [16]Dietterich T.G. 1997. Machine Learning Research: Four Current Directions. AI Magazine, 18(4):97-136, 1997
    [17] FreeICTCLAS. http://www.ict.ac.cn/freeware/
    [18]Freitag and McCallum. 2000. Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the 7th Conference on Artificial Intelligence (AAAI-00) and of the 12th Conference on Innovative Applications of Artificial Intelligence (IAAI-00), pages 584-589,Menlo Park, CA, July 30- 3 2000. AAAI Press.2000
    [19] Friedman J. 1996. Another approach to polychotomous classification. Technical report,Department of Statistics, Stanford University, 1996. Available at http://www-stat.stanford.edU/reports/friedman/ploy.ps.z
    [20]HowNet. http://www.keenage.com
    [21]Hsu and Lin. 2002. A Comparison of Methods for Multi-class Support Vector Machines. IEEE Trans. on Neural Networks, 2002.13(2):415-425
    [22] Joachims T. 1999. Transductive Inference for Text Classification using Support Vector Machines. In: Proceeding of the 16~(th) international conference on machine learning. San Francisco, Morgan Kaufmann, 1999. page200-209
    [23]Kambhatla N. 2004. Combining lexical, syntactic and semantic features with Maximum Entropy models for extracting relations. ACL-2004 (poster)
    [24]Lafferty J., McCallum A., Pereira F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18~(th) International Conf. on Machine Learning. CA(2001) 282-28. ICML-2001
    [25]Lewis D.D., Yang Y., Rose T., and Li F. 2003. Rcvl: A new benchmark collection for text categorization research. In: Journal of Machine Learning Research. 2003
    [26]Li Y., Shawe-Taylor J. 2003. The SVM with uneven margins and Chinese document categorization. In Proceeding of The 17~(th) Pacific Asia Conference and Language,Information and Computation (PACLIC17), pages 216-227, Singapore, Oct 2003

    [27] LIBSVM. http://www.csie.ntu.edu.tw/cilin/libsvm
    [28]McCallum A., Freitag D., Pereira F. 2000. Maximum entropy Markov models for information extraction and segmentation. In: Pro. ICML, Stanford, California, 2000.591-598
    [29]Miller S., Crystal M., Fox H., Ramshaw L., Schwartz R., Stone R., and Weischedel R.1998. Algorithms that learn to extract information - BBN: Description of the SIFT system as used for MUC-7. In Proceedings of MUC-7, 1998
    [30]Morik K., Brockhausen P., and Joachims T. 1999. Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring. In Proc. 16th Int'l Conf. on Machine Learning (ICML-99), pages 268-277, San Francisco, CA.Morgan Kaufmann, 2000

    [31 ]MUC. 1987-1998. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
    [32]Platt J.C., Cristianini N., and Shawe-Taylor J. 2000. Large margin DAGs for muliticlass classification. In: Advances in Neural Information Processing Systems,volume 12, pages 547-553. MIT Press, 2000
    [33]Ratnaparkhi A. 1997. A simple introduction to maximum entropy models for natural language processing. Technical Report 97-08,Institute for Research in Cognitive Science,University of Pennsylvania, 1997
    [34]Rosenfeld B., Feldman R. 2007. Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the web. In: Proceedings of the 45~(th) Annual Meeting of the Association of Computational Linguistics, pages 600-607, Prague,Czech Republic, June 2007

    [35]SVM-lightV3.5. http://www.joachims.org/svm_light
    [36] Wang T., Li Y. Y., Bontcheva K., Cunningham H., Wang J. 2006. Automatic Extraction of Hierarchical Relations from Text. In: Proc. of the 3rd European Semantic Web Conference, Springer-Verlag Lecture Notes in Computer Science 4011, P.215-229
    [37] Weston J. and Watkins C. 1999. Multi-class support vector machines. In M. Verleysen,editor, Proceedings of ESANN99, Brussels, 1999. D.Facto Press
    [38]Vapnik V. 1995. The Nature of Statistical Learning Theory. Springer, second edition,1995

    [39]Vapnik V. 1998. Statistical Learning Theory. Weiley, 1998
    [40]Yang Y. 2001. A study on thresholding strategies for text categorization. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), pages 137-145, New York. The Association for Computing Machinery
    [41]Yao T. and Uszkoreit H. 2005. A Novel Machine Learning Approach for the Indentification of Named Entity Relation. In: Proc. of the Workshop on Feature Engineering for Machine Learning in Natural Language Processing (ACL 2005 workshop), pages 1-8. Michigan, USA, 2005
    [42]Yao T. and Uszkoreit H. 2006. Chinese Named Entity and Relation Identification System. In: Pro. of the COLING/ACL 2006 Interative Presentation Sessions, pages 37-40, Sydney, July 2006
    [43]Zelenko D.,Aone C.and Richardella A.2003.Kernel Methods for Relation Extraction.Journal of Machine Learning Research.2003(2):1083-1106
    [44]Zhang M,Zhang J,Su J,Zhou G.D.2006.A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features,ACL-2006,pages 825-832
    [45]Zhang Z.2004.Weakly-supervised relation classification for Information Extraction,In Proceedings of ACM 13th conference on Information and Knowledge Management (CIKM'2004).Washington D.C.,USA.8-13 Nov 2004.
    [46]Zhao S.B.and Grishman R.2005.Extracting Relations with Integrated Information Using Kernel Methods.ACL-2005
    [47]Zhou G.D.and Su J.2002.Named Entity Recognition using an HMM-based Chunk Tagger.In Proceedings of the 40th Annual Meeting of the ACL,pages 473-480,Philadelphia,PA,2002
    [48]Zhou G.D.,Su J,Zhang J.and Zhang M.2005.Exploring Various Knowledge in Relation Extraction.ACL-2005
    [49]Zhu X.J.2005.Semi-Supervised Learning Literature Survey.Computer science,University of Wisconsin-Madison.1530.Availbable at:http://pages.cs.wisc.edu/j erryzhu/research/ssl/semireview.html
    [50]车万翔,刘挺,李生,实体关系自动抽取,中文信息学报,2004,19(2)
    [51]何婷婷,徐超,李晶,赵君拮,基于种子自扩展的命名实体关系抽取方法,计算机工程,2006,32(21)
    [52]李保利,陈玉忠,俞士汶,信息抽取研究综述,计算机工程与应用,2003,39(10)
    [53]刘克彬,李芳,刘磊,韩颖,基于核函数中文关系自动抽取系统的实现,计算机研究与发展,2007,44(8)
    [54]唐晋韬,面向中文文本的本体构建和自动扩充,国防科学技术大学计算机软件与理论专业硕士学位论文,2005.12
    [55]孙斌,信息提取技术概述,自然语言处理,2003,2(1)
    [56]王建会,中文信息处理中若干关键技术的研究,复旦大学计算机软件与理论专业博士学位论文,2004
    [57]许嘉璐,现状和设想——试论中文信息处理与现代汉语研究,中国语文,2000.6
    [58]张素香,文娟,秦颖,袁彩霞,钟义信,实体关系的自动抽取研究,哈尔滨工程大学学报,2006,27(增刊)
    [59]张学工,统计学习理论的本质,清华大学出版社,2000
    [60]周俊生,戴新宇,尹存燕,陈家骏,自然语言信息抽取中的机器学习方法研究,计算机科学2005,32(3)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700