网络新闻多文档自动摘要技术研究

英文题名：Research on Multi-document Automatic Summarization of Online News
作者：许旭阳
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：网络新闻 ; 条件随机场 ; 自定义规则 ; 时间表达式 ; 分类 ; 事件实例 ; 聚类 ; 事件抽取 ; 多文档摘要
英文关键词：Online News ; Conditional Random Fields ; User-Defined Rules ; Time Expression ; Classification ; Event Sample ; Clustering ; Event Extraction ; Multi-Document Summarization
学位年度：2011
导师：李弼程
学科代码：081002
学位授予单位：解放军信息工程大学
论文提交日期：2011-04-15

摘要

互联网的日益普及和计算机技术的不断发展给人们获取信息带来了极大的便利,但是面对海量的网络数据环境,如何获取感兴趣、有用的知识仍然是一个亟待解决的问题。在众多的研究方法中,多文档自动摘要被视为解决上述问题的有效工具之一,它是利用计算机将同一话题下的多个文档描述的主要内容通过信息压缩技术提炼为一个短文的自然语言处理技术,在军事和民用方面都具有极其重要的实用意义。本文主要研究网络新闻多文档自动摘要技术,首先从网络新闻话题中抽取相关的事件,然后采用不同的技术组织事件,最终生成摘要。论文的研究成果如下:
     (1)研究了时间表达式识别技术,提出一种基于条件随机场与自定义规则的时间表达式识别方法。该方法针对传统时间识别方法单一、应用领域局限等缺点,采用条件随机场对时间表达式进行初步识别;然后自定义规则对错识别和漏识别的时间表达式进行修正。实验结果表明,该方法有效提高了时间表达式识别的准确率和召回率,为时间表达式的识别建立了一种弹性的分析模型。
     (2)研究了事件抽取技术,提出一种基于事件实例驱动的新闻文本事件抽取方法。该方法针对事件触发词或事件元素驱动的事件抽取方法存在的正反例不平衡和数据稀疏问题,采用事件实例进行驱动;然后引入聚类的思想完成新闻文本集中事件的有效抽取,突破了传统方法对事件类别限制的局限性。实验结果表明,该方法显著提高了新闻文本集中事件抽取的性能,是一种有效的事件抽取方法。
     (3)研究了多文档自动摘要技术,提出一种基于事件抽取的多文档自动摘要方法。该方法针对目前以段落或句子聚类的摘要方法存在的冗余问题,采用事件抽取技术将原始文档转化为以事件为单位的内容逻辑划分;然后通过主旨事件抽取、排序及润色,生成摘要。实验结果表明,该方法所生成的摘要更贴近人的理解,从而有效地帮助用户及时、准确、便捷地获取事件的来龙去脉。
The growing popularity of the Internet and the continuous development of the computer technology have brought convenience for people to receive information. However, how to obtain interesting information and useful knowledge from massive network data environment is still a serious problem that is urgent to be solved. Among many research methods, multi-document summarization is considered to be one of effective tools to resolve this problem. Multi-document summarization is a natural language processing technology, which uses the computer to extract the main concepts of multi documents under the same topic into a short text by information compressing technic, and has been successfully applied to both military and civil fields. This paper studies the technologies of online news multi-document automatic summarization.
     Concerned events from news topic are extracted, different measures are used to organize them, and a summary is obtained. Research contributions of the thesis are listed as follows:
     (1) Recognition of time expression is studied, and CRFs combining with user-defined rules based a time expression recognition method is proposed. Aiming at the shortage of traditional recognition methods singularity and application fields limitation, CRFs is used to primarily recognize time expression, then user-defined rules are performed to revise the error and missing time expression. Experimental results show that the proposed method improves the precision and recall of time expression recognition effectively and establishes an elasticity analysis model for time expression recognition.
     (2) Event extraction technology is considered, and a news text event extraction method driven by event sample is put forward. Aiming at the positive and negative samples imbalance and data sparseness problems resulted from event trigger-driven or argument-driven, event sample is adopted to drive, then the idea of clustering is introduced to complete event extraction from online news documents effectively, which breaks the limitation on the event categories of traditional methods. Experimental results indicate that the designed method improves the performance of event extraction, and is an effective method for event extraction.
     (3) Multi-document automatic summarization is studied, and an event extraction based multi-document automatic summarization method is presented. Aiming at redundancy of paragraph or sentence based multi-document automatic summarization method, event extraction technology is used to translate the original documents' into logical division based on events, then the summarization is derived through the extraction, taxis and embellishment of the major ideas. Experimental results demonstrate that the summarization obtained is close to the understanding of people, and helps people to acquire cause and effect of events timely and accurately.

引文

[1]苏新宁.信息检索理论与技术[M].北京:科学技术文献出版社, 2004: 311-323.
    [2] XiaoPeng Yang, XiaoRong Liu. Personalized multi-document summarization in information retrieval[C]. In: Proceedings of 2008 International Conference on Machine Learning and Cybernetics, Kunming, 2008: 4108-4112.
    [3] Meiling Liu, Tiejun Zhao. Chinese multi-document summarization based on Topic Detection technology[C]. In: Proceedings of 2009 Asia-Pacific Conference on Computational Intelligence and Industrial Applications, Wuhan, 2009: 233-236.
    [4]胡侠,林晔,王灿等.自动文本摘要技术综述[J].情报杂志, 2010, 29(8): 144-147.
    [5]张春芳.摘要撰写“标准”新解与翻译探微[J].标准科学, 2009, 5期: 58-63.
    [6]李阜.基于滑窗取词的单文档自动摘要技术研究[D].长沙:国防科学技术大学, 2010: 1-2.
    [7]新闻出版署.中华人民共和国国家标准(GB6447-86)《文摘编写规则》[M].北京:中国标准出版社, 1998: 141-142.
    [8] Luhn H P. The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research Development, 1958, 2(2): 159-165.
    [9]俞士汶,段慧明.自动文摘评测报告[J].计算机世界报, 1996, 12期: 183.
    [10] Molina P. Document Abstracting: Toward a methodological model[J]. Journal of the American Society for Information Science, 1995, 46(3): 225-234.
    [11] Lawrence H. Reeve, Hyoil Han and Ari D. Brooks. The use of domain-specific concepts in biomedical text summarization[J]. Information Processing & Management, 2007, 43(6): 1765-1776.
    [12] Sarkar K. Using Domain Knowledge for Text Summarization in Medical Domain[J]. International Journal of Recent Trends in Engineering, 2009, 1(1): 200-205.
    [13] Naomi Daniel, Dragomir Radev, and Timothy Allison. Sub-event based multi-document summarization[C], In: Proceedings of HLT NAACL Workshop on Text Summarization, Edmonton Alberta, Canada, 2003: 9-16.
    [14] Filatova E, Hatzivassiloglou V. Event-based Extractive Summarization[C]. In: Proceedings of ACL 2004 Workshop on Summarization, 2004: 104-111.
    [15] Shiyan Ou, Christopher S.G. Khoo, Dion H. Goh. Multi-document summarization of news articles using an event-based framework[J]. ASLIB Proceedings, 2006, 58(3): 197-217.
    [16]王兵.美国机编文摘概况[J].情报学报, 1985, 4(2): 166-171.
    [17] Wenqian Ji, Zhoujun Li, Wenhan Chao, et al. A New Method for Calculating Similaritybetween Sentences and Application on Automatic Abstracting[J]. Intelligent Information Management, 2009, 1(1): 36-42.
    [18] Sicui Wang, Weijiang Li, Feng Wang, et al. A Survey on Automatic Summarization[C]. Information Technology and Applications (IFITA), 2010 International Forum on, 2010: 193-196.
    [19]吴立德.大规模中文文本处理[M].上海:复旦大学出版社, 1997: 164-170.
    [20]杨乐.基于同义词词林的自动文摘系统的研究[D].天津:天津大学, 2007: 1-6.
    [21]王建会,周水庚,胡运发.基于聚类的自动摘要[J].模式识别与人工智能, 2004, 17(3): 291-298.
    [22]胡珀.基于自适应聚类的中文自动文摘研究[D].武汉:华中师范大学, 2005: 20-46.
    [23]陈戈,段建勇,陆汝占.基于潜在语义索引和句子聚类的中文自动文摘[J].计算机仿真, 2008, 25(7): 82-85.
    [24]刘茂福,李文捷,姬东鸿.基于事件项语义图聚类的多文档摘要方法[J].中文信息学报, 2010, 24(5): 77-84.
    [25] Aymen Elkhlifi and Rim Faiz. French-Written Event Extraction Based on Contextual Exploration[C]. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2010: 180-185.
    [26] Zwaan R A, Radvansky G A. Situation models in language comprehension and memory [J]. Psychological Bulletin, 1998, 123(2): 162-185.
    [27]赵妍妍.中文事件抽取的相关技术研究[D].哈尔滨:哈尔滨工业大学, 2007: 2-48.
    [28] Qian Zhu, Xianyi Cheng. The Overview of Chinese Information Extraction[J]. International Journal of Computer Science and Network Security, 2010, 10(9): 171-174.
    [29] Hector Llorens, Estela Saquete, Borja Navarro-Colorado. TimeML Events Recognition and Classification: Learning CRF Models with Semantic Roles[C]. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, 2010: 725-733.
    [30] Stephanie Strassel, Mark Przybocki, Kay Peterson, et al. Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction[C]. In: Proceedings the 6th Language Resources and Evaluation Conference, 2008: 153-156.
    [31] Ji H, Grishman R. Refining event extraction through unsupervised Cross-document inference[C]. In: Proceedings of ACL-08, HLT Columbus, USA, 2008: 254-262.
    [32]谭红叶.中文事件抽取关键技术研究[D].哈尔滨:哈尔滨工业大学, 2008: 46-72.
    [33]许红磊,陈锦秀,周昌乐等.自动识别事件类别的中文事件抽取技术研究[J].心智与计算, 2010, 4(1): 34-44.
    [34]付剑锋,刘宗田,付雪峰.基于依存分析的事件识别[J].计算机科学, 2009, 36(11): 217-219.
    [35]于江德,樊孝忠,庞文博.事件信息抽取中语义角色标注研究[J].计算机科学, 2008, 35(3): 155-157.
    [36]吴刚.基于主题的中文事件抽取技术研究及应用[D].苏州:苏州大学, 2009: 22-35.
    [37] Jianfeng Fu, Zongtian Liu, Zhaoman Zhong. Chinese Event Extraction Based on Feature Weighting [J]. Information Technology Journal, 2010, 9(1): 184-187.
    [38]姜吉发.自由文本的信息抽取模式获取的研究[D].北京:中国科学院, 2004: 16-17.
    [39] Kim J and Moldovan D. Acquisition of linguistic patterns for knowledge-based information extraction[J]. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(5): 713-724.
    [40] Ellen Riloff and Jay Shoen. Automatically Acquiring Conceptual Answer Patterns Without an Annotated Corpus[C]. In: Proceedings of the Third Workshop on Very Large Corpora, 1995: 148-161.
    [41] Joyce Yue Chai. Learning and Generalization in the Creation of Information Extraction Systems[D]. North Carolina: Duke University, 1998: 33-36.
    [42] Roman Yangarber. Scenario Customization for Information Extraction[D]. New York: New York University, 2001: 20-21.
    [43]梁晗,陈群秀,吴平博.基于事件框架的信息抽取系统[J].中文信息学报, 2006, 20(2): 40-46.
    [44]冯礼.基于事件框架的突发事件信息抽取[D].上海:上海交通大学, 2008: 14-69.
    [45] Hai Leong Chieu and Hwee Tou Ng. A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text[C]. In: Proceedings of the 18th National Conference on Artificial Intelligence, 2002: 786-791.
    [46]宋凡.音乐领域典型事件抽取技术的研究[D].哈尔滨:哈尔滨工业大学, 2009: 52-60.
    [47] David Ahn. The stages of event extraction[C]. In: Proceedings of the Workshop on Annotations and Reasoning about Time and Events, Sydney, 2006: 1-8.
    [48]张先飞,郭志刚,刘嵩等.基于触发词指导的自相似度聚类事件检测[J].计算机科学, 2010, 27(3): 212-214.
    [49] Naughton M, Kushmerick N and Carthy J. Event Extraction from Heterogeneous News Sources[C]. In: Proceedings of the Workshop Event Extraction and Synthesis, American National Conference in Artificial Intelligence (AAAI), Boston, 2006: 7-13.
    [50]潘渊.网络新闻主题检测与追踪技术研究[D].郑州:解放军信息工程大学, 2008: 9-17.
    [51] Pawel Mazur and Robert Dale. A Rule Based Approach to Temporal Expression Tagging[C]. In: Proceedings of the International Multi-conference on Computer Science and Information Technology, 2007: 293-303.
    [52]邬桐,周雅倩,黄萱菁等.自动构建时间基元规则库的中文时间表达式识别[J].中文信息学报, 2010, 24(4): 3-10.
    [53] David Ahn, Joris van Rantwijk, Maarten de Rijke. A Cascaded Machine Learning Approachto Interpreting Temporal Expressions[C]. In: Proceedings of NAACL-HLT, 2007: 420-427.
    [54]潘越群.时间表达式识别与归一化研究[D].哈尔滨:哈尔滨工业大学, 2008: 24-38.
    [55] Le Song, Byron Boots, Sajid Siddiqi, et al. Hilbert Space Embeddings of Hidden Markov Models[C]. In: Proceedings of the 26th International Conference on Machine Learning (ICML), 2010: 991-998.
    [56] Ziping Zhao, Tingjian Zhao, Yaoting Zhu, et al. A Maximum Entropy Markov Model for Prediction of Prosodic Phrase Boundaries in Chinese TTS[C]. In: Proceedings of IEEE International Conference on Fremont, CA, 2007: 498.
    [57] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[J]. The Journal of Machine Learning Research, ICML01, 2001: 282-289.
    [58] Benjamin Belmudez, Veronique Prinet, JianFeng Yao, et al. Conditional mixed-state model for structural change analysis from very high resolution optical images[C]. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, 2009: 988-991.
    [59] Kudo T. CRF++: Yet another CRF toolkit [DB/OL]. http://crfpp.sourceforge.net, 2009-12-25.
    [60]廖先桃. CRF理论、工具包的使用及在NE上的应用[DB/OL]. http://ir.hit.edu.cn/phpwebsite, 2010-06-05.
    [61]贺瑞芳,秦兵,刘挺等.基于依存分析和错误驱动的中文时间表达式识别[J].中文信息学报, 2007, 21(5): 36-40.
    [62]中国互联网络发展状况统计报告[DB/OL]. http://www.cnnic.net.cn/uploadfiles, 2010-08-03.
    [63] Vapnik V. Nature of Statistical Learning Theory[M]. New York: Springer Press, 2000: 138-167.
    [64]王晓龙,关毅等.计算机自然语言处理[M].北京:清华大学出版社, 2006: 139-142.
    [65] Kashif Riaz. Rule-based Named Entity Recognition in Urdu[C]. In: Proceedings of the ACL 2010 Named Entities Workshop, 2010: 126-135.
    [66] Asif Ekbal and Sivaji Bandyopadhyay. Named Entity Recognition Using Appropriate Unlabeled Data, Post-processing and Voting[J]. Informatica, 2010, 34(1): 55-76.
    [67]张先飞.事件级网络信息挖掘技术研究[D].郑州:解放军信息工程大学, 2010: 28-40.
    [68]董振东,董强.知网[DB/OL]. http://www.keenage.com, 2010-07-18.
    [69]刘群,李素建.基于《知网》的词汇语义相似度的计算[J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76.
    [70]孙吉贵,刘杰,赵连宇.聚类算法研究[J]. Journal of software, 2008, 19(1): 48-61.
    [71]周杰.网络舆情话题情感倾向性分析技术研究[D].郑州:解放军信息工程大学, 2010: 47-51.
    [72]刘海涛.面向新闻文本的自动摘要技术研究[D].长沙:国防科学技术大学, 2005: 33-46.
    [73]周丹.基于子主题的多文档摘要关键技术研究[D].北京:北京邮电大学, 2008: 9-23.
    [74] Peiying Zhang, Cunhe Li. Automatic text summarization based on sentences clustering and extraction[C]. In: Proceedings of 2nd IEEE International Conference on Computer Science and Information Technology (ICCSIT), 2009: 167-170.
    [75] Changjin Jiang, Hong Peng, Qianli Ma, et al. Automatic Summarization for Chinese Text Based on Combined Words Recognition and Paragraph Clustering[C]. In: Proceedings of 2010 3rd International Symposium on Intelligent Information Technology and Security Informatics (IITSI), 2010: 591-594.
    [76]唐骏. SSC软聚类算法在面向查询的多文档文摘中的应用[J].计算机工程与科学, 2010, 32(6): 112-114.
    [77] Park Sun, Cha ByungRea, An Dong. Automatic Multi-document Summarization Based on Clustering and Nonnegative Matrix Factorization[J]. IETE Technical Review, 2010, 27(2): 167-178.
    [78]周进华.基于概念的多文档自动摘要研究[D].合肥:中国科学技术大学, 2008: 37-48.
    [79]宋宣辰.基于统计与语义分析的多文档自动摘要研究[D].合肥:中国科学技术大学, 2009: 25-54.
    [80]贺瑞芳,秦兵,刘挺等.基于宏微观重要性判别模型的时序多文档文摘[J].计算机研究与发展, 2009, 46(7): 1184-1191.
    [81] Minghui Wang, Hediheko Tanaka. Summarization of Multiple Chinese Technical Articles [C]. In: Proceedings of the 1st International Conference on Information, Fukuoka, Japan, 2002: 16-19.
    [82] Kathleen R. McKeown, Regina Barzilay, David Kirk Evans, et al. Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster[C]. In: Proceedings of the Human Language Technology Conference, 2008: 177-181.
    [83]秦兵.基于子主题的多文档文摘技术的研究[D].哈尔滨:哈尔滨工业大学, 2005: 47-50.
    [84]司联合.《概念层次网络理论》(HNC)述评[J].语言科学, 2003, 2(4): 101-108.
    [85] Inderjeet Mani. Summarization Evaluation: An Overview[C]. In: Proceedings of the NTCIR Workshop 2 Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, 2001: 77-85.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700