面向主题的文档摘要技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

面向主题的文档摘要技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：刘治华
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：文档摘要 ; 主题 ; 关键词提取 ; 句子包含关系 ; 垂直搜索
英文关键词：Summarization ; Query ; Keywords Extraction ; Inclusion Relationship between Sentences ; Vertical Search
学位年度：2011
导师：王景中 ; 张华平
学科代码：081202
学位授予单位：北方工业大学
论文提交日期：2011-05-09

摘要

随着经济、社会的飞速发展,尤其是互联网的迅猛发展,网络信息量呈爆炸式增长。如何从海量的信息中快速的获取有效信息已成为目前亟待解决的问题。目前蓬勃发展的搜索引擎技术主要用于通用信息的获取,而对于特定地、内容不适合公开的领域则还没有成熟的系统。
     本文针对标准化过程中的重要阶段——标准信息的有效挖掘与获取,实现了一个海量信息垂直搜索引擎,并对搜索引擎中的摘要技术(即面向主题的摘要)进行重点研究,着重于对摘要提取效率、与查询主题相关、反映文档主要内容三者的平衡,以满足用户的信息需求。
     本文主要在提高摘要的处理效率、基于关键词提取摘要、摘要中句子间的冗余去除方面进行了深入细致的工作。在提高效率方面,引入搜索引擎的倒排链表结构统计词、句子的特征,并使用双数组Trie树存储分词词典和用户主题词表,以期提高词的查找效率。将关键词提取与摘要提取相结合,在关键词提取中引入了词邻接类别、词的位置局部性分别提高高频词、低频词的质量。在去除句子冗余度方面,提出句子之间的包含度概念,能够对文章中存在包含关系的一类句子进行了有效的排重,降低了文摘的冗余度,提高了文摘质量。
     另外,本文实现了一个垂直搜索原型系统,并将其作为面向主题摘要的一个应用场景。成功将编码压缩、内存交换、缓冲池等技术应用于实际系统中,并将该系统应用于标准检索与组织机构搜索,目前已在河北省标准化研究院、中国邮政集团名址中心上线。将自动文摘、垂直搜索、数据库连接起来,形成一个完整的针对标准信息进行管理的统一解决方案。将文摘、搜索技术呈现给最终用户。
With the fast development of our economy, society and Internet, network information is explosive growth. How to quickly obtain useful information from information has become a problem to be solved. At present the search engine technology mainly used for general information processing, but in specific fields, there is not mature system.
     This paper studied summarization of the search engine technology. Focused on efficiency, related to query and reflecting the main content of document.
     Automatic summarization is the key of this paper. This paper has focused on work of three areas:improving the summarization efficiency, introducing keywords extraction to summarization, the removal of redundancy between summary sentences. To improve summarization efficiency, we introducing the inverted list structure to calculation the features of words and sentences, and use Double-Array Trie to storage segmentation dictionary and user thesauri, so as to improve the efficiency of word search. Combined keywords extraction and summary extraction, introducing accessor variety and the position locality of the words to respectively improve the extraction of high frequency and low-frequency word. To lessen the redundancy between sentences, this paper propose the concept of inclusion between sentences. Through sentences inclusion, reduced the probability of extracting sentences that one includes another to summary together, so as to improve the abstract quality.
     In addition, the paper implemented a vertical search prototype system, and applied query-focused summarization in vertical search. Successfully applied coding compression, memory exchange and memory cache technology in word segmentation system, and the application of this system in the standard retrieval and organization search, has been online in Institute of Standardizatioin of Hebei province and Name Address Center of China Post. During the period of testing. It integrated automatic summarization, vertical search and database connection together, and provided an uniform solution of standard management. The solution take summarization, search technology to end users.

引文

[1]中国互联网信息中心.中国互联网发展统计报告. (CNNIC,2009.) http://www.cnnic.net.cn/index/0E/00/11/index.htm,2009.
    [2]徐晋信息检索技术鲁棒性研究.中科院研究生院硕士学位论文.2005
    [3]H. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development,2(2):159-165,1958.
    [4]K. S. Jones and E2N Brigitte. Introduction:Automatic Summarizing. Inf ormation Processing & Management,31(5):625-630,1995.
    [5]Resina Barzilay and Michael Elbadad. Using lexical chains for text summarization. In Proceedings of the ACL/EACL Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, pages 86-90,1997.
    [6]Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval, Melbourne, Australia, pages 335-336,1998.
    [7]Dragomir R. Radev, Hongyan Jing, Magorzata Sty, and Daniel Tam. Centroid-based summarization of multiple documents. Inf. Process. Manage, 40(6):919-938,2004.
    [8]Witten, I.,Moffat,A.,and Bell, T.(1999). Managing Gigabytes. Van Nostrand Reinhold.
    [9][美]David A. Grossman, Ophir Frieder著,张华平李恒训刘治华等译信息检索-算法与启发式方法人民邮电出版社2010年9月
    [10]Jin Zhang, Xueqi Cheng, Gaowei Wu, Hongbo Xu. AdaSum:An Adaptive Model for Summarization, In Proceedings the ACM 17th Conference on Information and Knowledge Management (CIKM 2008),2008.
    [11]刘挺,王开铸.自动文摘的四种主要方法.情报学报,18(1)：10-19,1999.
    [12]Jin Zhang, Xueqi Cheng, Hongbo Xu, XiaoleiWang, Yiling Zeng, ICTCAS's ICT-Grasper at TAC 2008:Summarizing Dynamic Information with Signature Terms Based Content Filtering, In Proceedings of the Text Content Analysis, NIST, USA, 2008.
    [13]H. P. Edmundson and R. E. Wyllys:Automatic Abstracting and Indexing-Survey and Recommendations. Com2 munications of the ACM,1961.
    [14]Ramiz M. Aliguliyev. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications.36(2009) 7764-7772.2009.
    [15]周进华,刘贵全.基于衰减词共现图的多文档摘要研究.小型微计算机系统,30(1)：173-177,2009.
    [16]GAunes Erkan and Dragomir R. Radev. Lexrank:Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. (JAIR),22:457-479,2004.
    [17]彭波大规模搜索引擎检索系统框架与实现要点计算机工程与科学,2006年第3期
    [18]D. Cutting and J. Pedersen. Optimizations for dynamic inverted index maintenance. In J.-L.Vidick, editor, Proc. ACM-SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 405-411, Brussels, Belgium, Sept.1990.
    [19]国家标准化体系建设工程指南
    [20]Joel Larocca Neto, Alexandre Santos, Celso A. A. Kaestner, and Alex Alves Freitas. Generating text summaries through the relative importance of topics. In Proceedings of IBERAMIA-SBIA 2000, Atibaia, SP, Brazil, pages 300-309,2000.
    [21]Gauch and Wang 1996 Gauch, s. and Wang, J.(1996). Corpus analysis for TREC-5 query expansion. In Proceedings of the Fifth Text Retrieval Conference(TREC-5),pages 537-546
    [23]张庆国薛德军张振海张君玉海量数据集上基于特征组合的关键词自动抽取情报学报,2006年10月
    [24]黄玉兰有意义串挖掘及其应用中科院研究生院硕士学位论文2009
    [25]Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval, Melbourne, Australia, pages 335-336,1998.
    [26]张华平语言浅层分析与句子级新信息检测研究中科院研究生院博士学位论文2005
    [27]宗成庆,统计自然语言处理,清华大学出版社,2008.
    [28]吴玲达,雷震,老松杨,雷永林.基于局部话题句群的事件相关多文档摘要研究.计算机仿真,23(11)：263-267,2006.
    [29]oracle认证：详解Oracle数据库中文全文索引.http://www.examda.cm/oracle/zhonghe/20090423/110018230.html

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700