Web多文档自动文摘研究

英文题名：Research of Web Multi-document Automatic Summarization
作者：付红艳
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：多文档自动文摘 ; 句子相似度 ; 局部主题 ; 文摘句
英文关键词：Multi-document automatic summarization ; Sentence similarity ; Partial subject ; Summarization sentence
学位年度：2010
导师：张文燚
学科代码：081202
学位授予单位：哈尔滨工程大学
论文提交日期：2009-12-01

摘要

政党外交辅助决策支持系统是一个智能聚类搜索系统,通过输入主题词能搜索出同主题的大量文档集合,并给出文档自动文摘的内容,方便用户快速浏览信息,及时准确地做出正确决策。自动文摘是此系统的一个组成部分,为了进一步优化系统,提出了本课题的研究。
     Web多文档自动文摘旨在呈现全面、简洁的信息给用户,节省用户的浏览时间。目前,多文档自动文摘主要有两类方法:一是把整个文档集合中的句子按照权重大小统一进行排序,根据压缩比依次选择文摘句;二是把文档集合划分成几个局部主题,然后从不同的局部主题中选择文摘句。鉴于用户对文摘全面、简洁的要求,本文重点研究了第二类方法。
     本文重点研究了多文档自动文摘的几个方面:相似度计算、局部主题划分、文摘句优选、文摘句排序。
     本文通过对以上几个方面的深入研究、分析,改进了基于局部主题划分的文摘句优选及排序方法,主要包括:改进了词语语义距离的计算方法,提出了欧氏距离与语义距离融合的句子相似度计算方法;优化了k-中心点算法,基于句子密度智能地发现种子点和类别数;改进了局部主题打分方法和句子信息覆盖率判定方法,从而优化了迭代优选文摘句策略;在二层排序方法的基础上提出了改进的三层排序法。最后将算法应用到Web多文档自动文摘系统中,并对算法进行了实验及结果分析。
The Political Party Diplomacy Auxiliary Decision Supporting System is an intelligent system for clustering-searching, which can find the massive document-sets about the same subject by inputing keywords, and show the contents of automatic summarization so that the user can glance over the information fast and make the correct decision promptly. The automatic summarization is an important part of this system, and a research on this subject is proposed to further optimize the system.
     The Web multi-document automatic summarization is for the purpose of presenting the comprehensive and concise information to the users, which has saved the users’browsing-time.At present, two kinds of methods have been used about the multi-document automatic summarization.First, sorting unifily the entire document-set' s sentences according to the weight, and choosing the summarization sentences in turns according to the compression ratio; Second, dividing the document-set into several partial subjects, then choosing the summarization sentences from the different partial subjects. In view of the fact that the users require comprehensive and concise summarization, this paper has studied the second kind of methods with emphasis.
     This paper has studied several aspects of multi-document automatic summarization with emphasis: similarity computation, partial subject division, summarization sentences optimal selection, and summarization sentences sorting.
     This paper has improved the summarization sentence optimal selection and sorting method based on the partial subject division through the deep research and analysis on above several aspects. It mainly includes: Improved the computational method of semantic distance between words and words,and proposed the computational method of sentence similarity based on euclidean distance and semantic distance;Optimized the k- central point algorithm which can discover the seeds and category number based on sentence density intelligently; Improved the scoring method on partial subject and the judgement method on sentence information coverage fraction, thus optimized the iterative and optimal summarization sentence selection strategy; Proposed the improvd three -rank ordering method based on two-rank ordering method. Finally, applied the algorithms in the web multi-document automatic summarization system, and has carried on the experiments and the result analysis about the algorithms.

引文

[1]杜玮,邸书灵,孙树静.基于互联网技术的问答系统研究[J].微计算机信息.2007,23(3)
    [2]杨涛.智能信息处理技术在互联网舆情分析中的应用[D].同济大学硕士学位论文.2008:3-5页
    [3]聂坤明.基于文章主题分析的自动文摘研究[D].中国石油大学硕士学位论文.2008:1-5页
    [4] Kathleen R. McKeown, Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Judith L. Klavans, Ani Nenkova, Carl Sable, Barry Schiffman, Sergey Sigelman. Tracking and Summarizing News on a Daily Basis with Columbia's Newsblaster[J]. Proceedings of the second international conference on Human Language Technology Research, USA, 2002:280-285P
    [5] Chin-Yew Hn.Eduard Hovy.From Single to Multi-document Summarization:A Prototype System and its Evaluation[A].In Proceeding of the 40th Anniversary Meeting of the Association for Computational Linguistics(ACL-02)[c],Philadelphia USA,2002:25-34P
    [6] F. Monroy-Pérez, A. Anzaldo-Meneses. The step-2 nilpotent (n, n(n+1)/2) sub-Riemannian geometry[J]. Journal of Dynamical and Control Systems, 2006, 12(2)
    [7] C. Simon, Peter Summons. Automated testing of databases and spreadsheets - the long and the short of it[R]. Melbourne, Australia: Australasian conference on Computer science education, 2000:215-219P
    [8] Advaith Siddharthan, Ani Nenkova, Kathleen McKeown. Syntactic simplification for improving content selection in multi-document summarization[R]. Geneva, Switzerland: Proceedings of the 20th international conference on Computational Linguistics, 2004
    [9] DragomirR.Radev,Hongyanjing,MalgorzataBudzikowska.Centroid-basedSummarization of multiple documents:sentence extraetion,utility- based evaluation,and user studies.In: ANLP/NAACL2000Workshop[C]
    [10]刘美茹.计算机对文章意义段划分的研究[J].计算机工程.2007(13)
    [11]李生,刘挺.哈尔滨工业大学信息检索研究室论文集(第五卷)[c].哈尔滨:哈尔滨工业大学,2007
    [12]秦兵,刘挺,李生.基于局部主题判定与抽取的多文档文摘技术[J].自动化学报.2004(6):10-20页
    [13]黄丽雯.多文档文本摘要的一种改进HITS算法[J].计算机应用.2006,26(11):2600-2700页
    [14]张艳. Web挖掘在搜索引擎个性化中的应用研究[D].合肥工业大学硕士学位论文.2008:30-50页
    [15]代书.基于概念语义分析的文本聚类研究[D].东北师范大学硕士学位论文.2008:30-50页
    [16]潘启蒙.文本聚类算法的研究与实现[D].吉林大学硕士学位论文.2008:7-13页
    [17]王建会.中文信息处理中若干关键技术的研究[D].复旦大学硕士学位论文.2004:10-20页
    [18]刘双林. LUCENE实现的基于RSS的博客搜索引擎[D].哈尔滨工程大学硕士学位论文.2009:44-46页
    [19] Ramiz M. Aliguliyev. A new sentence similarity measure and sentence based extractive technique for automatic text summarization[J]. Expert Systems with Applications, 2009
    [20]俞辉.基于LSA和PLSA的多文档自动文摘[J].计算机工程与科学.2009(09)
    [21]张培颖.基于句子特征和语义距离的文本摘要技术[J].微计算机应用.2009,(07)
    [22]胡东滨.决策问题管理系统及其开发组件研究[D].中南大学硕士学位论文.2008:15-30页
    [23] Zhang Shu, Zhao Tiejun, Zheng Dequan, Zhao Hua.Two-STAGE SENTENCE ON APPROACHE FOR MULTI-DOCUMENTSUMMARIZATION.JOURNAL OF ELECTRONICS.[J],2008,25(4)
    [24]李雄飞,李军.数据挖掘与知识发现[M].北京:高等教育出版社,2003:93-117页
    [25]袁玉波,杨传胜,黄廷祝,徐成贤.数据挖掘与最优化技术及其应用[M].北京:科学出版社,2007:100-127页
    [26] Ng R, Han J.Efficient and effective clustering method for spatial data mining.Proc of 1994 Int Conf very large dada bases(VLDB’94), 1994(9):144-145P
    [27] Arora, Rachit, Ravindran, Balaraman. Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization[R]. ICDM '08. Eighth IEEE International Conference, 2008
    [28]徐永东,王亚东,刘杨,王伟,权光日.多文档文摘中基于时间信息的句子排序策略研究[J].中文信息学报.2009(4)
    [29]蒋效宇,樊孝忠,陈康.用于多文档文摘句排序的改进MO算法[J].华南理工大学学报.2008(9):5-10页
    [30]陈琦,伍朝辉,姚芳,宋秀荣,张付志.基于TF*IDF的垃圾邮件过滤特征选择改进算法[J].计算机应用研究.2009(6)
    [31]梅家驹,竺一鸣,殷鸿翔.同义词词林[M].上海:上海辞书出版社,1983
    [32]吕震宇,林永民,赵爽,朱卫东.基于同义词词林的文本特征选择与加权研究[J].情报杂志.2008(5)
    [33]徐超,王萌,何婷婷,张勇.基于局部主题关键句抽取的自动文摘方法[J].计算机工程.2008(12)
    [34]焦李成,刘芳,刘静,陈莉.智能数据挖掘与知识发现[M].西安:西安电子科技大学出版社,2006:330-331页
    [35]王健,韩广琳.基于统计的Web文本自动摘要技术分析[J].福建电脑.2007(8)
    [36] P.D.Ji, S.Pulman. Sentence Ordering with Manifold-based Classification in Multi-Document Summarization. In Proceedings of the 2006 Conference on EMNLP, 2006:526-533P
    [37]崔灵珍.Web文本摘要技术的研究与应用[D].武汉理工大学硕士学位论文.2007:43-50页
    [38] Tingting He, Wei Shao, Fang Li, Zongkai Yang, Liang Ma. The Automated Estimation of Content-Terms for Query-Focused Multi-document Summarization. Fuzzy Systems and Knowledge Discovery, 2008. FSKD '08. Fifth International Conference, 713-718P

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700