基于RSS源文本的自动文摘系统研究

英文题名：Research on Automatic Summarization System Based on RSS
作者：刘启元
论文级别：硕士
学科专业名称：信息资源管理
中文关键词：自动文摘 ; 机器学习 ; 自动分类 ; 回归分析 ; 自动文摘评价
英文关键词：Automatic Summarization ; Machine Learning ; Automatic Classification ; Regression Analysis ; Automatic Summarization Evaluation
学位年度：2012
导师：叶鹰
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2012-06-01
答辩委员会主席：刘晓清

摘要

随着网络信息资源总量指数级的增长,如何在海量的数据中检索信息并获取主旨,是一个值得研究的问题。搜索引擎和RSS推送技术解决了信息的“源”问题,却没有很好的解决信息的“量”问题。自动文摘技术正是对信息进行压缩和精炼的有效应用之一。自动文摘利用计算机技术,自动从原始文档中抽取或总结出能够反映文本中心内容的简短连贯短文,以帮助用户快速、准确和全面的获取信息主旨。
     本文认为不同主题类型的新闻文摘具有不同形式的文本特征组合模型,因此应将文本自动分类结果作为自动文摘的前提。通过网页抓取、网页清洗和数据存储构建分类语料库,并在此基础之上利用不同特征选择算法和分类算法实现了自动归类。提出文摘句的可能性(Probability)和可行性(Possibility)两种度量方式,基于文摘语料库的构建,采用基于回归分析的有监督机器学习算法(线性回归和Logistic回归)进行训练学习,以确定文摘句特征组合模型的最优参数。针对中文文本,提出改进型ROUGE-CN系列评价算法,用于对文摘句可能性的度量和对机器文摘的测评。
     基于机器学习的自动文摘方法产生的文摘与基准文摘和Word文摘的对比实验结果表明,以自动分类为前提,利用基于回归分析的有监督机器学习算法,能够有效的提高机器文摘质量。
     以在线RSS数据源与基于回归机器学习的自动文摘方法的结合作为创新点,最终设计和实现了基于RSS源文本的自动文摘系统。系统以在线RSS源文本为数据来源,利用正则表达式匹配的方式抽取原文元数据内容,提供不同特征选择算法、自动分类算法、机器学习算法和压缩率选项,结合自动分类和自动文摘技术得出分类标签并生成机器文摘,实现了新闻文摘与原文的在线双重呈现。
With the increasing amount of information, it's valuable to figure out how to retrieve information and obtain its summary. Search Engine and the "PUSH" technology of RSS offering the "Source" of information has not addressed the issue of the "quantity" of information. Automatic Summarization technology is one of the best ways to deal with the information overload.
     This article assumes that documents with different topics should have different features combination models, thus automatic classification is the prerequisite of the automatic summarization procedure. After the construction of a self-build classification corpus, four features selection algorithms have been used with the classification algorithm Simple Vector Distance to finish automatic classification. Two measures for the evaluation of summary sentences have been proposed in this article: Probability and Possibility. Based on the summary corpus, machine learning algorithms including Linear Regression and Logistic Regression have been applied to construct the optimum features combination model of the summary sentences. This article proposes ROUGE-CN algorithm to deal with Chinese text.
     The experimental comparison results show that, the combination of automatic classification methods and machine learning algorithms based on regression statistics improves the quality of machine-generated Chinese news summaries.
     Innovation of this paper is the combination of online RSS feeds and automatic summarization technology based on machine learning. An automatic Summarization System Based on RSS Feeds has been implemented in the end. The system obtains news text from online RSS feeds, extracts metadata using regex matching, provides users with various options, and then generates the class label and summary.

引文

Barzilay, R. (1997). Using lexical chains for text summarization[C]. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. Madrid, Spain.
    Baxendale, P. B. (1958). Machine-made index for technical literature—an experiment[J]. IBM Journal of Research and Development,2(4):354-61.
    Blake, C. (2007). The role of sentence structure in recognizing textual entailment. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (RTE'07). Association for Computational Linguistics, Stroudsburg, PA, USA:101-106.
    Blake, C. (2010). Beyond genes, proteins, and abstracts:Identifying scientific claims from full-text biomedical articles. Journal of Biomedical Informatics,43(2): 173-189.
    Bonzi, S.,& Liddy, E. (1989). The use of anaphoric resolution for document description in information retrieval[J]. Inform Process Manag,25(4):429-41.
    Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence [J]. ComputLinguist, 19(1):61-74.
    Eaglet.(2011).盘古分词[OL]Available at:http://pangusegment.codeplex.com/
    Edmundso, H. P.(1969). New methods in automatic extracting[J]. JACM,16(2):264.
    Edmundson, H. P.(1961). Automatic abstracting and indexing-survey and recommendations[J]. ACM,4(5):226-34.
    Fum, D., Guida, G.,& Tasso, C.(1982). Forward and Backward Reasoning in Automatic Abstracting. In J. Horecky (Ed.), COLING 82:Proceedings of the Ninth International Conference on Computational Linguistics, Prague, July 5-10, 1982 (pp.83-88). Amsterdam:North-Holland.
    Halteren, V. H.,& Teufel, S. (2003). Examining the Consensus between Human Summaries:Initial Experiments with Factoid Analysis[J]. Proceedings of the HLT-NAACL Workshop on Automatic Summarization. Edmonton, Canada.
    Hovy, E., Lin, C. Y., Zhou, L.,& Fukumoto, J. (2006). Automatic Summarization Evaluation with Basic Elements[C]. Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC). Genoa. Italy.
    Jacobs, P. S.,& Rau, L. F. (1990). SCISOR:Extracting information from on-line news [J]. Commun Acm,33(11):88-97.
    Karen, S. J. (2007). Automatic summarizing:The state of the art [J]. Information Processing & Management,43(6),1449-1481.
    Kupiec, J., Pedersen, J.,& Chen, F. (1995). A trainable document summarizer[C]. In Proceedings of the 18th ACM SIGIR conference. Springer.
    Landauer, T. K., Foltz, P. W..& Laham, D. (1998). An introduction to latent semantic analysis[J]. Discourse processes, 25(2):259-84.
    Lin, C. Y. (2004). Rouge:a package for automatic evaluation of summaries[C]. In Proceedings of the Workshop in Text Summarization, ACL'04.
    Lin, C. Y. (2012). Summary Evaluation Environment[ol]. Available at: http://www.isi.edu/-cyl/SEE.
    Lin, C. Y.,& Hovy, E. H. (2002). Manual and Automatic Evaluation of Summaries[C]. Proceedings of the Document Understanding Conference Workshop at Conference of the ACL (DUC-02). Philadelphia, PA.
    Lin, C. Y.,& Hovy, E. H. (2003). Automatic Evaluation of Summarization Using N-gram Co-Occurrence Statistics[C]. Proceedings of the Human Technology Conference, May 27-June 1, Edmonton, Canada.
    Luhn, H. P. (1958). The automatic creation of literature abstracts [J]. IBM Journal of Research and Development,2(2):159-165.
    Maeda, T. (1981). An approach toward functional text structure analysis of scientific and technical documents[J]. Information Process Management,17(6):329-39.
    Mani, I.,& Bloedorn, E. (1999). Summarizing similarities and differences among related documents [J]. Information Retrieval,1(1-2):35-67.
    Nenkova, A.,& Passonneau, R. (2004). Evaluating Content Selection in Summarization:The Pyramid Method[C]. Proceedings of the HLT-NAACL conference. Boston. MA.
    Ono, K., Sumita, K.,& Miike, S. (1994). Abstract generation based on rhetorical structure extraction[M]. Pub Place:Association for Computational Linguistics.
    Paice, C. D.(1980). The automatic generation of literature abstracts:an approach based on the identification of self-indicating phrases[C]. SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval. Butterworth & Co. Kent, UK.
    Paice, C. D.,& Jones, P. A. (1993). The identification of important concepts in highly structured technical papers [C]. Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 69-78.
    Papineni, K., Roukos, S., Ward, T.,& Zhu, W. (2001). BLEU:A Method for Automatic Evaluation of Machine Translation[C]. Proceedings of the conference of the Association for Computational Linguistics (ACL),311-318, Philadelphia. PA.
    Salton, G.,& Buckley, C. (1988). Term-weighting approaches in automatic text retrieval[C]. Inform Process Manag,24(5):513-23.
    Salton, G., Allan, J.,Buckley, C.,& Singhal, A. (1994). Automatic analysis, theme generation, and summarization of machine-readable texts [J]. Science,264(5164): 1421-1426.
    Salton, G., Singhal, A., Mitra, M.,& Buckley, C. (1997). Automatic text structuring and summarization [J]. Inform Process Management,33(2):193-207.
    Salton, G., Wong, A.,& Yang, C. S. (1975). A vector space model for automatic indexing. Commun[J].ACM,18(11):613-620
    Schank R. C., Roger, C., Abelson,& Robert, P. (1977). Scripts, plans, goals, and understanding [M]. Hillsdale, NJ:Erlbaum.
    SPSS12.0统计分析中文教程[OL].[2011-11-11]Available at http://www.stathome.cn/spss/Regression/regre.html
    Tadashi, N.,& Yuji, M. (1997). Data Reliability and Its Effects on Automatic Abstracting. In Proceedings of the Fifth Workshop on Very Large Corpora, Beijing/Hong Kong, China, ACL SIGDAT.
    Tan, P. N.,Steinbach, M.,& Kumar, V. (2005). Introduction to Data Mining[M]. Addison:Addison Wesley,146.
    Weiss, M. S., Indurkhya, N.,Zhang, T.,& Damerau, J. F. (2005). Text mining: Predictive methods for analyzing unstructured information[M]. New York: Springer,1-16.
    WIKI(2012-02-05).熵(信息论)[G/OL]Available at: http://zh.wikipedia.org/w/index.php?title=%E7%86%B5_(%E4%BF%Al%E6%8 1%AF%E8%AE%B A)&oldid=18193380.
    WIKI(2012-03-01).文本简易聚合RSS[OL]. Available at: http://en.wikipedia.org/wiki/RSS
    Yang Y.(1999). Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval,1(1-2):69-90
    百度.(2012-02-01).百度新闻聚合[OL]Available at: http://www.baidu.com/search/rss.html
    郭燕慧,钟义信,马志勇,姚均勇.(2002).自动文摘综述[J].情报学报,21(5)：582-91.
    国际文本理解会议官网.(2012-03-01)Introduction[OL].Available at: http://www-nlpir.nist.gov/projects/duc/intro.html
    刘挺,王开铸.(1999).自动文摘的四种主要方法[J].情报学报,(1)：10-9.
    刘挺,吴岩,王开铸.(1998).自动文摘综述[J].情报科学,(1)：63-9.
    罗文娟,马慧芳,河清,史忠植.(2011).权衡熵和相关度的自动摘要技术研究[J].中文信息学报,25(5)：10
    马庆国.(2010).应用统计学：数理统计方法、数据获取与SPSS应用[M].北京：科学出版社：278.
    莫燕,王永成.(1993).中文文献摘要的自动编制[J].现代图书情报技术,(03)：10-12.
    庞剑锋,卜东波,白硕.(2001).基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究,(09)：23-26.
    秦进,陈笑蓉,汪维家,陆汝占.(2003).文本分类中的特征抽取[J].计算机应用,(02)：45-46.
    谭种,陈跃新.(2008).自动摘要方法综述[J].情报学报,27(1)：62-68.
    王兵.(1985).美国机编文摘概况[J].情报学报,(2)：166-171
    王永成,许慧敏.(1997).OA中文文献自动摘要系统[J],情报学报,16(2)：129
    吴中勤.(2008).英文多文档查询型自动文摘研究[D].导师：吴立德.：复旦大学
    伍玉伟.(2006).RSS：网络信息“聚合”利器[J].现代情报.(2)：221-2.
    章芝青.(2010).基于语义的单文档自动摘要算法[J],计算机应用,6：1673-1675

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700