体育领域信息抽取系统的研究

英文题名：Research on the Information Extraction System in Sports Domain
作者：高国洋
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：信息抽取 ; 命名实体识别 ; 实体关系抽取 ; 条件随机场
英文关键词：information extraction ; named entity recognition ; entity relation extraction ; condition random fields
学位年度：2010
导师：戚银城 ; 张素香
学科代码：081001
学位授予单位：华北电力大学（河北）
论文提交日期：2009-12-10

摘要

信息抽取作为一种自动化信息处理技术,已成为自然语言处理领域的研究热点。本文首先针对信息抽取中的两大关键技术命名实体识别和实体关系自动抽取进行了研究,提出了融合多知识的基于条件随机场的中文命名实体识别方法和针对体育领域的实体关系自动抽取方法;其次,在此基础上,基于统计与规则相结合的原则,针对体育领域提出并实现了赛事信息抽取系统,实验语料来自新浪和搜狐,实验证明本文提出的方法卓有成效,系统的准确率、召回率、和F-值分别达到了95.70%、93.00%和94.33%。
Information extraction as an automated information processing technology interests many researchers in natural language processing. Firstly, Named entity recognition and relation extraction as the key technology of information extraction have been studied in this paper, a new approach is proposed to recognize entity based on conditional random fields, which fuses multiple knowledges, and a new approach is proposed to extract the entity relation in sports news based on conditional random fields. Secondly, the information extraction system in sports game news is designed and realized, which is mainly based on statistics and rules to extract sports game news. The experiments corpus comes from the www.sina.com and www.sohu.com. The experiments results show that the precision of system is 95.70%, the recall of system is 93.00% and the F-measure of system is 94.33%, which prove the validity of our approach.

引文

[1]刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法的研究,计算机应用研究,2007,24(7):6~9
    [2]邓尚民,孙玉伟.信息抽取系统的研究现状.现代图书情报技术,2006,3:56~58
    [3]李保利,陈玉忠,俞士汉.信息抽取研究综述.计算机工程与应用,2003,39(10):l~5
    [4] Chinchor N. Overview of MUC-7/MET-2. In Proceedings of the 7th Message Understanding Conference, San Diego, 1998
    [5]赵琦,刘建华,冯浩然.从ACE会议看信息抽取技术的发展趋势,现代图书情报技术,2008,3:19~23
    [6] Douglas E A, Jerry R H, John B, et al. FASTUS: a finite-state processor for information extraction from real-world text. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence (UCAL-93), Chambery, 1993:1172~1178
    [7]孙斌.信息提取技术概述.术语标准化与信息技术,2003,1:34~37
    [8] Rohini K S, Wei Li, Cheng Niu, et al. InfoXtract: a customizable intermediate level information extraction engine. In: Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS), Morristown, 2003:52~59.
    [9] Chinchor N. MUC-7 information extraction task definition (version 5.1). In Proceedings of the 7th Message Understanding Conference, San Diego, 1998
    [10] Huaping Zhang, Qun Liu, Hongkui Yu, et al. Chinese named enttity recognition using role model. The International Journal of Computational Linguistics and Chinese Language Processing, 2003, 8(2):29~70.
    [11]俞鸿魁,张华平,刘群,等.基于层叠隐马尔可夫模型的中文命名实体识别,通信学报,2006,27(2):87~93
    [12]张玥杰,徐智亭,薛向阳.融合多特征的最大熵汉语命名实体识别模型,计算机研究与发展,2008,45(6):1004~1010
    [13] Yimin Zhang, Joe F Z. A trainable method for extracting Chinese entity names and their relations. In proceedings of the 2nd Workshop on Chinese Language Processing, Hong Kong, 2000, 12:66~72.
    [14]车万翔,刘挺,李生.实体关系自动抽取.中文信息学报,2005,19(2):1~6.
    [15]张素香,李雷,秦颖,等.基于Boot Strapping的中文实体关系自动生成,微电机学与计算机,2006,23(12):15~18.
    [16]刘路,李弼程,张先飞.基于正反例训练的SVM命名实体关系抽取.计算机应用,2008,28(6):1444~1446
    [17]郑家恒,王兴义,李飞.信息抽取模式自动生成方法的研究.中文信息学报,2004, 18(1):48~54
    [18]牟力科.Web中文信息抽取技术与命名实体识别方法的研究:[硕士学位论文].西安:西北大学计算机软件与理论系,2008
    [19] Yonggui Yang, Lei Li. Research on Sports Game News Information Extraction. International conference on Natural Language Processing and knowledge Engineering2007, NLP-KE2007, Beijing, 2007:96~101
    [20] Cardie C. Empirical methods in information extraction. AI Magazine, 1997, 18(4): 65-78.
    [21]贾自艳. Web信息智能获取若干关键问题研究:[博士学位论文].北京:中科院计算所,2004
    [22] Chinatsu A, Mila R. Rees: A large-scale relation and event extraction system. In: Proc. of the 6th Applied Natural Language Processing Conference, Washington, 2000:76~83
    [23] Ion M. Extraction patterns for information extraction tasks: a survey. Proc of AAAI Workshop on Machine Learning for Information Extraction. Orlando, 1999
    [24] Chinatsu A, Lauren H, Tom H, et al. SRA: Description of the IE2 system used for MUC-7. In Proceedings of the 7th Message Understanding Conference, San Diego, 1998.
    [25] John L, Andrew M, Fernando P. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, 2001:282~289
    [26] Paul V, Mukund N. Learning to extract information from semi-structured text using a discriminative context free grammar. In: Proceedings of the28th annual international ACM SIGIR conference on Research and development in information retrieval. New York, 2005:330~337
    [27]陈晴.基于条件随机场的自动分词技术:[硕士学位论文].沈阳:东北大学计算机系统结构,2004
    [28]王志强.基于条件随机域的中文命名实体识别研究:[硕士学位论文].南京:南京理工大学计算机应用技术,2008
    [29] David P, Andrew M, Xing Wei, et al. Table extracting using conditional random fields, Proceedings of the 26th ACM SIGIR, Toronto, 2003:235~242
    [30] Bernard M. Tagging English text with a probabilistic model. Computational linguistics 1994, 20(2):155~157
    [31] Kupiec J. Robust part-of-speech tagging using a Hidden Markov model. Computer Speech and Language, 1992, 6:225~242
    [32]王小捷,常宝宝.自然语言处理技术基础.北京:北京邮电大学出版社,2002
    [33] Andrew M, Dayne F, Fernando P. Maximum Entropy Markov Models for information extraction and segmentation. Proceedings of the International Conference on Machining Learning (ICML-2000), California, 2000:591~598
    [34] Wei Li, Andrew M. Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Transactions on Asian Language Information Processing [TALIP], 2003, 2(3):290~294
    [35] Clifford S P. Integration of stereo vision and optical flow using Markov random fields. In: Proc. IEEE International Conference on Neural Networks, New York, 1998:577~584
    [36] Hanna M W. Conditional Random Fields: An Introduction. University of Pennsylvania CIS Technical Reports MS-CIS-04-21, 2004:1~9
    [37] Vincent D P, John L. Including features of random fields. IEEE transactions on pattern analysis and machine intelligence, 1997, 19(4):380~393
    [38] Darroch J and Ratcliff D. Generalized iterative scaling for long-linear models. The Annals of Mathematical Statistics, 1972, 43:1470~1480
    [39] Malouf R. A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNll-2002), Morristown, 2002:1~7
    [40] Dong C L, Jorge N. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1998, 45:503~528
    [41] Fei S, Fernando P. Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Morristoen, 2003:134~141
    [42] Andrew M, Fangfang Feng. Chinese Word Segmentation with Conditional Random Fields and Integrated Domain Knowledge. In Unpublished Manuscript, 2003
    [43] Beth M. Sundheim. Named entity task definition (version 2.1). In Proceedings of the Sixth Message Understanding Conference, 1995:319~332
    [44] Youzheng Wu, Jun Zhao, Bo Xu, et al. Chinese Named Entity Recognition Based on Multiple Features. Proceedings of Human Language Technology Conference and Conference on Empirica Methods in Natural Language Processing (HLT/EMNLP), Vancouver, 2005:427~434
    [45] Xudong Lin, Hong Peng, Bo Liu, Chinese named entity recognition using support vector machines. Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 2006:4216~4220
    [46] Aitao Chen, Fuchun Peng, Roy S, et al. Chinese Named Entity Recognition with Conditional Probabilistic Models. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, 2006:173~176.
    [47]张晓艳,王挺,陈火旺.基于混合统计模型的汉语命名实体识别方法,计算机工程与科学,2006,28(6):135~139
    [48] Yuejie Zhang, Zhiting Xu, Tao Zhang. Fusion of Multiple Features for Chinese Named Entity Recognition based on CRF models. Information Retrieval Technology, 2008, 4993:95~10
    [49] Nanda K. Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Extracting Relations. Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, Barcelona, 2004:178~181

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700