基于事件框架的突发事件信息抽取

英文题名：Breaking Events' Information Extraction Based on Event Frame
作者：冯礼
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息抽取 ; 事件框架 ; 新侧面探测 ; 词对特征
英文关键词：information extraction ; event frame ; new event detection ; word pair feature
学位年度：2008
导师：盛焕烨
学科代码：081203
学位授予单位：上海交通大学
论文提交日期：2008-01-01

摘要

在目前信息爆炸的时代,基于事件框架的新闻信息抽取技术能够更好地满足人们获知网上有效信息的需要。通过对新闻语料的分析,可以预定义三类突发事件的框架结构,由此可对事件各侧面采取定制的处理。利用对新闻报道的词性标注、对地点数据库的查询以及基于语料研究的一些抽取规则的制订,能有效地抽取新闻事件的时间、地点、结果等各侧面信息。
     由于新闻事件的复杂及动态发展的特点,基于事件框架信息抽取中存在一个问题:静态结构的框架限定了能抽取的侧面内容。为此,本文引入事件新侧面探测方法,采用自动探测方法寻找框架中未预定义的侧面。为充分利用句子中词性、语序及词之间的关系,本文使用词对特征模型进行特征提取,选择基于段落的LSA聚类算法来实现新侧面探测。
     根据原型系统在突发事件语料库上的测试结果,本文提出的方法被证明是切实可行的,对于突发事件新闻要素的抽取达到了较高的正确率和召回率。事件新侧面探测的结果较好地表现了单个事件的特性和同类事件未包含在框架内的某些共性。实验结果证明了本研究的应用前景。
In today's information explosion age,the technology of events’information extraction,which is based on event frame, can better satisfy the need of getting valid information from Internet.By analyzing the news corpus,we predefine three kinds of breaking news' event frame and thus deal with each news' flank in customized methods.By the use of POS tagging on news article,querying in location database and defining rules based on corpus study,we can effectively extract news event's flank information such as time,location and results.
     The complexity and the dynamic changing of news events cause such a problem: the static frame structure restricts extractable contents. In order to solve this problem in information extraction system, we propose a new technology called events' new flank detection,which uses automatic detection to find out undefined flanks.To take fully advantage of the POS,word order and the relations between words in sentences,we use word pair feature model to extract features and select paragraph-oriented LSA clustering algorithm to implement new flank detection.
     According to the testing results on the prototype system on three kinds of breaking events corpus, it is proved that the methods in this thesis are feasible. The extraction of breaking news' elements reaches high precise and recall rates. The results of event new flank detection show uniqueness of single event and several common points in events of same kind, which are not included in the event frame.The experiment results ensure the application foreground of this research.

引文

[1]李向阳,苗壮 ,自由文本信息抽取技术[J],情报学报 2004.7
    [2]Chinchor N,Marsh ,”E.Muc-7 Informaion Extraction Task Definition”[C]. In: Proceedings of the Seventh Message Understanding Conference,Morgan Kaufman,1998
    [3]ZHANG Yimin, ZHOU J F. A trainable method for extracting Chinese entity names and their relation[C]: proc. of the 2nd Chinese Language Processing Workshop[c].Hong Kong[s.n.],2000
    [4] George Doddington, Alexis Mitchell, Mark Przybocki. The Automatic Content Extraction (ACE) Program Tasks, Data, and Evaluation[C]. 2004, Automatic Content Extraction [http://projects.ldc.upenn.edu/ace/]
    [5]刘迁,焦慧,贾慧波.“信息抽取技术的发展现状及构建方法的研究”[J],计算机应用研究,24 卷第 7 期,2007
    [6]林尧瑞,马少平.人工智能导论[M].北京:清华大学出版社,1989
    [7] 吴平博,陈群秀,马亮,“基于事件框架的事件相关文档的智能检索研究”[J],中文信息学报 2003.5
    [8] 梁晗,陈群秀,吴平博,“基于事件框架的信息抽取系统”[J]. 中文信息学报 2006.2
    [9] 吴平博,陈群秀,马亮 ,“基于时空分析的线索性事件的抽取与集成系统研究”[J] 中文信息学报 2006.2
    [10] 金珠林鸿飞赵晶“基于 HowNet 的话题跟踪及倾向性分类研究”[J]情报学报 2005
    [11]林鸿飞,宋丹,杨志豪,“基于语义框架的话题跟踪方法“[C] 中文信息处理前沿进展-中国中文信息学会二十五周年学术会议 2006 PP383-392
    [12]Ying Han, Fang Li et al, “ Template based Chinese News Event Summarization”[C] in the proceeding of 2nd International conference on Semantics, Knowledge and Grid 2006 (SKG Oct. 2006)
    [13]李芳,毛顺福,蒋德良等“中文新闻事件要素抽取研究”[C] ,2007 全国计算机大会论文集,苏州,2007.10
    [14] Lisa Ferro. Guidelines for Timestamping ACE Relations and Events[C],ACE 2005.
    [15] Nancy Chinchor, Erica Brown, Lisa Ferro & Patty Robinson. 1999. 1999 Named Entity. Recognition Task Definition Version 1.4 .The MITRE Corporation and SAIC.
    [16] 谭红叶,郑家恒,刘开瑛. 中国地名自动识别系统的设计与实现[J]. 计算机工程,2002,28(8):128-129
    [17] 黄德根,岳广玲,杨元生. 基于统计的中文地名识别[J]. 中文信息学报,2003,17(2):36-41
    [18] 刘开瑛,郭炳炎. 自然语言处理[M].科学出版社,1991,北京。
    [19] 李丽双,黄德根,陈春荣,杨元生. SVM 与规则相结合的中文地名自动识别. 中文信息学报, 第 20 卷,第 5 期.
    [20]Mehran Habibi. Java Regular Expressions:Tamingthe java.util.regex Engine. [M].Apress.2004
    [21]Ellen Riloff , Janyce Wiebe. Learning Extraction Patterns for Subjective Expressions[C].In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03)
    [22] J.Allan,J.Carbonel,G.Doddington,J.P.Yamron,and Y Yang. Topic Detection and Tracking Pilot Study: Final Report[C]. Proceedings of Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998
    [23]李保利,俞士汶话题识别与跟踪研究[J],计算机工程与应用,39 卷 17 期,2003
    [24]王会珍,面向话题追踪的特征选取与文本表示技术的研究,学位论文,东北大学,2004
    [25] Hua-Jun Zeng Zheng Chen Wei-Ying Ma1 Qi-Cai He Jinwen Ma Learning to cluster web search results[C], Proceedings of the 27th International ACM Conference on Research and Development in Information Retrieval (SIGIR’04), Sheffield, UK, July 25-29, 2004
    [26] Dell Zhang,Yisheng Dong. Semantic, Hierarchical. Online Clustering of Web Search Results[C].In:proceedings of.the 6th Asia Pacific Web Conference (APWEB) [C].Hangzhou,.China,April 2004
    [27] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990
    [28] Langacker, T.K., Foltz, P.,& Laham, D. An Introduction ot Latent Semantic Analysis, Discourse Presses[J]. IN:1998
    [29]The Latent Semantic Analysis Research.www.Colorado.com,2001.
    [30] 刘云峰齐欢代建民, 潜在语义分析在中文信息处理中的应用[J],计算机工程与应用, 2005-3
    [31] P. Drineas, A. Frieze, R. Kannan, S. Vempala and V.Vinay. Clustering in large graphs and matrices, In Proceedings of ACM-SIAM Symposium on DiscreteAlgorithms, 1999.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700