突发事件追踪报道信息抽取的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息传播技术的不断进步,信息传播方式也在不断发生变化,特别是Internet这一新媒体的发展和普及,人们的生活已经融入了这个信息大繁荣的社会大潮中。为了应对信息大爆炸,信息大繁荣这一社会现象,相应的自然语言处理技术亟待发展,给予大力支持。近几年来,各种类型的突发事件频繁发生,对人民的生活产生了极大影响,因此,人们对突发事件的关注程度也越来越高,每次突发事件的发生,都会有权威媒体的一系列的相关报道,收集这些信息,进行分析、处理,对于突发事件深入研究、预防都有一定帮助,同时也可以帮助人民大众从整体上把握事件发生、发展、结束整个过程,可以很好地帮助人们消防对突发事件的恐惧心理。
     信息抽取就是从相关的文档中将感兴趣的信息准确高效地抽取出来。本文在分析了信息抽取的研究背景及意义,发展状况,相关技术的基础上,对于目前中文信息抽取工作的不足之处,大多数中文事件抽取只是针对一篇文档进行事件抽取,本文以同一事件的多个文本为研究对象,提出一种使用模式匹配与统计学习相结合的方法来实现信息的准确抽取。
     本文以三类突发追踪事件作为研究对象,分析相关事件报道之间的连续性、多角度性等文本特点,通过确定每类突发追踪事件的信息抽取模板,采用模式匹配与统计学习相结合的方法对突发事件的相关文档进行信息抽取,并使用简单的语义推理技术融合该突发事件的相同方面的信息,最后将实验抽取结果时序显示出来。本文主要做了以下几个方面的工作:
     1.通过对同一个突发事件的多个文本的数据集进行统计、观察,深入分析突发事件新闻报道的自身特征及其与相关后续追踪报道之间的关系,最终找到可支持事件信息的有效特征,构建相应的事件抽取规则来实现对突发事件的相关方面信息的抽取并写入数据库。
     2.对于突发事件的追踪报道中抽取出来的同一个突发事件相同侧面的信息,构造对应的语义推理规则来实现信息的融合。
     3.针对追踪报道抽取结果中出现的一些异常规律的数据,给出相应的注释来解释出现这种情况的原因,形成突发事件的信息发展链,并以时序追踪的方式显示出来。
     实验结果表明,本文提出的针对突发事件追踪报道的信息抽取方法在同一事件多文本的抽取方面进行了初步的探索,并取得一定的效果。要高效地实现追踪报道的信息抽取,本文的研究还不够全面,可以引入中文信息处理的相关技术方法进一步进行研究。
Along with the development of technology of information communication, the spreading means of information is changing in the constantly, especially the development and popularization of the Internet as a new media, people's life have been integrated into the information of the tide of prosperity of society. In order to deal with the social phenomenon of information explosion and information prosperity, corresponding development of the natural language processing technology is in demand, gives great support. Various sudden events frequently occurred in recent years, enormously influenced people's life, therefore, people to disasters is also more and more high degree of concern, with the occur of each sudden event, there are series of relevant reports of the authoritative medium. Collecting the information for analysis and dispose is helpful not only for the further research and prevention of the sudden events, but also for the people to seize the occur, development, the end of the process and to fire the sudden events fears.
     Information extraction is extracting the interested information accurately and efficiently from the related documents. Analyzing the research background and significance, development condition, relevant technology of information extraction, most Chinese event extraction only for a document, this paper takes the same event more than a text as the research object and introduces the method of combining pattern matching&statistical learning to resolve the present deficiency of Chinese information extraction and realize the accurate extraction of information.
     Based on the three kinds of catastrophic track events as the research object, analyzing the gender characteristics of text such as continuity and perspectives between the related reported events, through determine the information extraction template of each type of sudden tracking events, this paper introduces the method of combining pattern matching&statistical learning for information extraction of sudden tracking events, and uses simple semantic reasoning technology for merging the related information of the sudden event, finally the experimental extraction results are showed in chronological order. This paper mainly completed the following several aspects:
     1. With an emergency text data sets for statistical observations, in-depth analysis of the relationship between the incident news reports of their own characteristics and its follow-up reports and related follow-up, and ultimately find the effective features to support the event information and construct the corresponding event extraction rules to achieve incident related aspects of information extraction and writes into the database.
     2. For emergency tracking report extracted from the same incident the same information, structure the corresponding semantic reasoning rules to realize the integration of information.
     3. According to tracking report results appear in the extraction of some anomaly regularity of data, give some comments to explain the situation reasons, formatting the development chain of the information for the sudden event and showing itself in the temporal tracking way.
     The experimental results show that a preliminary exploration of the information extraction method proposed in this paper for emergency follow-up reports in the same event more than a text extraction, and obtained a certain effect. To efficiently achieve the tracking report information extraction, this research is also not comprehensive, may introduce the relevant Chinese information processing technology and methods for further investigation.
引文
[1]李保利,陈玉忠,俞士汶.信息抽取研究综述[J].计算机工程与应用,2003,10,1-5.
    [2]S.Soderland.Learning Information Extraction Rules for Semistructured and Free Text[J].Machine Learning,1999,34,233-272.
    [3]Grishman R, Sundheim B, Message Understanding Conference-6:A Brief History, In Proceedings of the 16h International Conference on Computational Linguistics, 1996,8.
    [4]刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法的研究[J].计算机应用研究,2007,24,6-9.
    [5]孙斌.信息提取技术概述.自然语言处理.2003,1,28-37.
    [6]李芳,盛焕烨,张冬茉.多种语言投资信息抽取系统的实现.上海交通大学学报,2004,1,21-25.
    [7]钟涛,陈群秀.基于层式有限状态自动机的突发事件抽取系统.第三届全国信息检与内容安全学术会议论文集[C],2007,24-30.
    [8]蒋德良.基于规则匹配的突发事件结果信息抽取研究[J],计算机工程与设计;2010,14.3294-3297.
    [9]吴平博.基于事件框架的主体相关文档智能检索的初步研究[D],清华大学硕士论文,2003,9.
    [10]吴平博,陈群秀,马亮.基于事件框架的事件相关文档的智能检索研究[J],中文信息学报,2003,17,25-30.
    [11]杨尔弘.突发事件信息提取研究[D],北京语言大学博士论文,2005,9.
    [12]粱晗,陈群秀,吴平博.基于事件框架的信息抽取系统[J],中文信息学报,2006,20,40-46.
    [13]吴平博,陈群秀,马亮.基于时空分析的线索性事件的抽取与集成系统研究.中文信息学报,2006,20,21-28.
    [14]David Ahn. The stages of event extraction[A]. In:Proceedings of the Workshop on Annotations and Reasoning about Time and Events[C].2006,1-8.
    [15]Naomi Daniel, Dragomir Radev and Timothy Allison. Sub-event based Multi-document Summarization [A]. In:Proceedings of the HLT-NAACL Workshop on Text Summarization[C].2003,9-16.
    [16]Elena Filatova and Vasileios Hatzivassiloglou. Event-based Extractive summarization[A]. In:Proceedings of ACL Workshop on Summarization[C].2004, 104-111.
    [17]Wenjie Li, Mingli Wu and Qin Lu. Extractive Summarization using Inter-and Intra-Event Relevance[A]. In:Proceedings of the 44" Annual Meeting of the Association for Computational Liguistics[C].2006,369-376.
    [18]姜吉发.自由文本的信息抽取模式获取的研究[D].博士学位论文.北京:中国科学院,2004,9.
    [19]赵妍妍,秦兵,车万翔等.中文事件抽取技术研究[J].中文信息学报,2008,1,3-8.
    [20]ACE(Automatic Content Extraction)Chinese Annotation Guidelines for Events. National Institute of Standards and Technology[R].2005.
    [21]赵伟,戴新宇,尹存燕,陈家俊.一种规则与统计相结合的汉语分词方法[J].计算机应用研究,2004,3,23-25.
    [22]刘春辉,金顺福,刘国华等.基于优化最大匹配与统计结合的汉语分词方法[J].燕山大学学报,2009,2,124-129.
    [23]化柏林编著.文本信息分析与全文检索技术[M].北京:科学技术文献出版社,2008,38-48.
    [24]于江德,樊孝忠,庞文博.事件信息抽取中语义角色标注研究[J].计算机科学,2008,3,155-157.
    [25]赵妍妍.中文事件抽取的相关技术研究[D].硕士学位论文.哈尔滨:哈尔滨工业大学,2007,9.
    [26]夏彦,何琳,潘运来,欧阳辰晨.基于规则与统计相结合的互联网突发事件识别研究[J].《现代图书情报技术》,2010,10,65-68.
    [27]David Aim. The stages of event extraction. Proceedings of the Workshop On Annotations and Reasoning about Time and Events, Sydney,2006,7,1-8.
    [28]董萍.基于知网语义关系的中文事件信息抽取研究[D].硕士学位论文.西安电子科技大学,2010,9.
    [29]吕国英,冯艳,李茹.基于CFN的教材内容提要信息抽取研究[J].山西大学学报,2010,1,73-76.
    [30]Jeffrey E F Fried 1.Mastering regular expressions[M].3rd Ed. O'Reilly and Associates Inc,2006.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700