面向主题的多文档自动文摘关键技术研究

英文题名：The Research on Topic-oriented Multi-document Summarization
作者：李鹏
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：多文档自动文摘 ; 主题模型 ; 依存解析树 ; 整数线性规划 ; 文本蕴含识别
英文关键词：Multi-Document Summarization ; Topic Model ; Dependency Parser
英文关键词：Tree ; Integer Linear Programming ; Recognized Text Entailment
学位年度：2013
导师：王英林
学科代码：081203
学位授予单位：上海交通大学

摘要

随着移动互联网的迅速崛起，用户经常使用移动终端在大批量文本信息中快速查找并获取有用的信息。这就要求信息服务供应商具备能够提供对大批量文本内容快速并且有效的挖掘，并且用简洁概括的方式向用户呈现重要的信息的能力。用户可以使用手机订阅自动文摘服务。该服务就可以自动从多篇文档中抽取出重要信息，并按照主题将信息呈现给用户。高质量的自动文摘内容脉络清晰，可读性强，能够多角度向用户呈现事件的来龙去脉，节省用户浏览信息的时间，同时能够减轻用户需要从多个信息源获取完整信息的负担。本文正是在这种技术趋势下，对面向主题的多文档自动文摘关键技术进行了探索性研究。
     本文的研究提出了以下创新性的理论和方法：
     1.提出一种新颖的基于LDA的建模方法来捕获文档集中的主题。为了能够定量评估该建模方法的有效性，本文使用该建模方法从大批量同类型文本集合中生成面向主题的文摘描述模板。首先提出一种基于LDA的实体主题模型，该模型用来同时对句子和句子中的词进行语义标注和聚类。其次，在已经聚类并且标注好的句子的依存解析树上利用频繁子树模式挖掘算法来构建面向主题的文摘描述模板。为了进一步验证生成模板的有效性，本文实现了一种基于模板的面向主题的自动文摘生成方法。
     2.提出一种非监督机器学习方法去生成面向主题的多文档自动文摘。在该方法中，提出了基于LDA的事件-主题模型，该模型扩展了的传统的LDA模型，通过计算单词在领域中出现的概率分布，以及在特定事件文档中出现的概率分布，有利于改善句子聚类的效果。其次使用扩展的LexRank算法对每个聚类簇中的句子进行排序，接着从每个聚类簇中使用整数线性规划生成能反映主题的代表性句子用来作为文摘。该方法的主要优势是把句子聚类，排序和选择有机地串联在一起。同时我们又改进了基于依存解析树句子压缩算法，使压缩效果显著提高。
     3.提出一种新颖的基于语言生成模型的自动文摘生成方法。该方法首先从句子的依存解析树中抽取重要的片段信息，其次利用这些片段信息同时结合英语句法知识对原始句子进行重构。句子重构的实现是利用英语句法结构把片段信息翻译成语言生成模型的输入，然后通过自然语言生成模型，生成包含片段信息的简单句子。最后使用整数线性规划方法从重构后的句子集合中选择出与主题最相关的句子集合。
     4.提出一种新颖的ccTAM(Cross collection topic aspect model)模型来对文档集中的主题和方面建模。然后利用该模型的输出,在二分图上使用迭代互增强来抽取互补式文摘。
     基于上述理论和方法，本文实现了一个面向主题的多文档自动文摘系统。该系统连续两年参加国际权威会议TAC组织的自动文摘系统评测，各项指标取得了良好效果。
With the rapid rising of Mobile Internet, user always need retrieve useful information fromhuge data sets via Mobile device. This motivate the information service suppliers have the capabil-ities that can offer fast and deep mining on the huge data sets and then present useful informationto the user in a concise way. User can feed summarization service via Mobile device, then thisservice can extracts interesting information from multiple documents and present to the user ac-cording to the topics. High quality of automatic generated summary has well defined structure,good readability and can present the important context of the particular event according to thetopics. This advantage can save user’s browsing time and reduce the heavy burdens of readingmultiple documents from which use need digest complete information. Follow this trend, we ex-plore topic-oriented multi-document summarization.
     This paper form many innovated theories and approaches, including:
     1. We proposed a novel LDA based modeling process for capturing topics in multiple doc-uments. To quantitatively evaluate the effectiveness of LDA model, we implement a novelapproach that can generate templates for topic-oriented summarization with LDA model. Wefirst develop an entity-topic LDA model to simultaneously cluster both sentences and wordsinto topics. Then apply frequent subtree pattern mining on the dependency parse trees of theclustered and labeled sentences to discover sentence patterns that well represent the topics.To quantitatively evaluate the effectiveness of automatically generated templates, we use thegenerated templates to construct summaries for new Wikipedia entities.
     2. We propose an unsupervised approach to automatic generation of topic-oriented summariesfrom multiple documents. In this method, we propose an event-topic model which based onthe traditional LDA model. It can improve sentence clustering effectiveness via computingprobability distribution that words appears in both domain and specific news event. Then useextended LexRank algorithm to rank the sentences in each cluster and select representativesentences using Integer Linear Programming. The advantage of our approach is that it canunify clustering, ranking, selection component together.Also we proposed a new rule-based sentence compression algorithm which uses dependency tree can reduce the redundancy ef-fectively.
     3. We proposed a novel approach to automatic generation of topic-oriented summaries withnatural language generation model. We first extract important information items from depen-dency parser tree of the sentence, then generate new sentences with these information itemsusing English grammatical knowledge. With grammatical relation in dependency parser tree,we can translate the information item according to the input format of natural language gen-eration engine. Finally, we select topic-oriented sentences form generated sentence list withInteger Linear Programmer.
     4. We proposed cross collection topic aspect model to joint modeling topic and aspect. Thengenerating complementary summary by random walk on bipartite graph with iterative mutualreinforcement.
     Based on the proposed theories and methodologies above, we implement an topic-orientedsummarization system. Our system evaluation based on the TAC guided summarization task thatwe attend in recent two years and have good performance.

引文

[1]赵林,面向查询的多文档自动文摘关键技术研究, Ph.D. thesis,复旦大学,2008.
    [2] Luhn H.,“The automatic creation of literature abstracts”, IBM Journal of research and devel-opment,1958,2(2),159–165.
    [3] Edmundson H. and Oswald V., Automatic indexing and abstracting of the contents of docu-ments, Planning Research Corporation,1959.
    [4] Baxendale P.,“Machine-made index for technical literature―an experiment”, IBM Journalof Research and Development,1958,2(4),354–361.
    [5] Goldstein J., Mittal V., Carbonell J., et al.,“Multi-document summarization by sentenceextraction”, Proceedings of the2000NAACL-ANLP Workshop on Automatic Summarization,Association for Computational Linguistics,2000,40–48.
    [6] Ding Y.,“A survey on multi-document summarization”, Department of Computer and Infor-mation Science University of Pennsylvania,2004.
    [7] Harabagiu S. and Maiorano S.,“Multi-document summarization with gistexter”, Proc. ofLREC, Citeseer,2002.
    [8] Jing H., Cut-and-Paste Text Summarization, Ph.D. thesis, Computer Science Department,Columbia University, New York, N.Y,2001.
    [9] Masao U. and Ko iti H.,“Multi-topic multi-document summarization”, Proceedings of the18th conference on Computational linguistics-Volume2, Association for Computational Lin-guistics,2000,892–898.
    [10] Barzilay R., Information fusion for multi-document summarization: paraphrasing and gen-eration, Ph.D. thesis, Columbia University,2003.
    [11] Barzilay R., McKeown K.R., and Elhadad M.,“Information fusion in the context of multi-document summarization”, Proceedings of the37th Association for Computational Linguis-tics, Association for Computational Linguistics, Maryland,1999.
    [12] Barzilay R. and Lee L.,“Catching the drift: probabilistic content models, with applicationsto generation and summarization”, Proceedings of NAACL-HLT,2004.
    [13] Barzilay R. and McKeown K.R.,“Sentence fusion for multi-document news summarization”,Computational Linguistics,2005.
    [14] K. McKeown J.K.,“Towards multidocument summarization by reformulation: progress andprospects”, Proceedings of AAAI,1999.
    [15] Knight K. and Marcu D.,“Summarization beyond sentence extraction: A probabilistic ap-proach to sentence compression”, Artificial Intelligence,2002.
    [16] Daniel N., Radev D., and Allison T.,“Sub-event based multi-document summarization”, Pro-ceedings of the HLT-NAACL03on Text summarization workshop,2003.
    [17] Radev D.R., Jing H., Stys M., et al.,“Centroid-based summarization of multiple documents”,Information Processing and Management,2004.
    [18] Hofmann T.,“Probabilistic latent semantic indexing”, Proceedings of the22nd annual inter-national ACM SIGIR conference on research and development in information retrieval,1999,50–57.
    [19] Blei D., Ng A., and Jordan M.,“Latent dirichlet allocation”, The Journal of Machine LearningResearch,2003,3,993–1022.
    [20] Griffiths T.,“Gibbs sampling in the generative model of latent dirichlet allocation”, StandfordUniversity,2002,518(11).
    [21] Zhai C., Velivelli A., and Yu B.,“A cross-collection mixture model for comparative textmining”, Proceedings of the tenth ACM SIGKDD international conference on Knowledgediscovery and data mining, ACM,2004,743–748.
    [22] Paul M.J., Zhai C., and Girju R.,“Summarizing contrastive viewpoints in opinionated text”,Proceedings of the2010Conference on Empirical Methods in Natural Language Processing,2010,66–76.
    [23] Paul M. and Girju R.,“Cross-cultural analysis of blogs and forums with mixed-collectiontopic models”, Proceedings of the2009Conference on Empirical Methods in Natural Lan-guage Processing: Volume3-Volume3, Association for Computational Linguistics,2009,1408–1417.
    [24] Paul M. and Girju R.,“A two-dimensional topic-aspect model for discovering multi-facetedtopics”, Proceedings of the24th AAAI Conference on Artificial Intelligence,2010,545–550.
    [25] Erkan G. and Radev D.,“Lexrank: Graph-based lexical centrality as salience in text summa-rization”, Journal of Artificial Intelligence Ressearch,2004,22,457–479.
    [26] Passonneau R. and Nenkova A.,“Evaluating content selection in human-or machine-generated summaries: The pyramid scoring method”,2003.
    [27] Nenkova A. and Passonneau R.,“Evaluating content selection in summarization: The pyra-mid method”, Proceedings of HLT-NAACL, volume2004,2004.
    [28] Harnly A., Nenkova A., Passonneau R., et al.,“Automation of summary evaluation by thepyramid method”, Proceedings of the Conference on Recent Advances in Natural LanguageProcessing, Citeseer,2005.
    [29] McKeown K., Passonneau R., Elson D., et al.,“Do summaries help a task-based evaluation ofmulti-document summarization”, Proceedings of SIGIR, volume5, Citeseer,2005,210–217.
    [30] Nenkova A. and Adviser-Mckeown K., Understanding the process of multi-document sum-marization: content selection, rewriting and evaluation, Columbia University,2006.
    [31] Passonneau R.,“Measuring agreement on set-valued items (masi) for semantic and pragmat-ic annotation”, Proc.5th International Conference on Language Resources and Evaluation(LREC-06), Citeseer,2006,831–836.
    [32] Nenkova A.,“Automatic text summarization of newswire: Lessons learned from the docu-ment understanding conference”, Proceedings of the National Conference on Artificial Intel-ligence, volume20, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press;1999,2005,1436.
    [33] Passonneau R.,“Evaluating an evaluation method: The pyramid method applied to2003document understanding conference (duc) data”, Technical report, Technical Report CUCS-010-06, Department of Computer Science, Columbia University,2005.
    [34] Passonneau R., Nenkova A., McKeown K., et al.,“Applying the pyramid method in duc2005”, Proceedings of the Document Understanding Conference (DUC05), Vancouver, BC,Canada,2005.
    [35] Lin C.Y.,“Rouge: A package for automatic evaluation of summaries”, S.S. Marie-FrancineMoens (ed.), Text Summarization Branches Out: Proceedings of the ACL-04Workshop, As-sociation for Computational Linguistics, Barcelona, Spain,2004,74–81.
    [36] Lin C. and Hovy E.,“Automatic evaluation of summaries using n-gram co-occurrence statis-tics”, Proceedings of the2003Conference of the North American Chapter of the Associationfor Computational Linguistics on Human Language Technology-Volume1, Association forComputational Linguistics,2003,71–78.
    [37] Hovy E., Lin C., Zhou L., et al.,“Automated summarization evaluation with basic elements”,Proceedings of the5th International Conference on Language Resources and Evaluation(LREC), Citeseer,2006,899–902.
    [38] Wu F. and Weld D.S.,“Autonomously semantifying Wikipedia”, Proceedings of the16thACM Conference on Information and Knowledge Management,2007,41–50.
    [39] Filatova E., Hatzivassiloglou V., and McKeown K.,“Automatic creation of domain tem-plates”, Proceedings of21st International Conference on Computational Linguistics and the44th Annual Meeting of the Association for Computational Linguistics,2006,207–214.
    [40] Sudo K., Sekine S., and Grishman R.,“An improved extraction pattern representation modelfor automatic IE pattern acquisition”, Proceedings of the41st Annual Meeting of the Associ-ation for Computational Linguistics,2003,224–231.
    [41] Shinyama Y. and Sekine S.,“Preemptive information extraction using unrestricted relationdiscovery”, Proceedings of the Human Language Technology Conference of the North Amer-ican Chapter of the Association for Computational Linguistics,2006,304–311.
    [42] Sekine S.,“On-demand information extraction”, Proceedings of21st International Confer-ence on Computational Linguistics and the44th Annual Meeting of the Association for Com-putational Linguistics,2006,731–738.
    [43] Yan Y., Okazaki N., Matsuo Y., et al.,“Unsupervised relation extraction by mining wikipediatexts using information from the web”, Proceedings of the Joint Conference of the47th An-nual Meeting of the ACL and the4th International Joint Conference on Natural LanguageProcessing of the AFNLP, Association for Computational Linguistics, Suntec, Singapore,2009,1021–1029.
    [44] Sauper C. and Barzilay R.,“Automatically generating Wikipedia articles: A structure-awareapproach”, Proceedings of the Joint Conference of the47th Annual Meeting of the ACL andthe4th International Joint Conference on Natural Language Processing of the AFNLP,2009,208–216.
    [45] Chemudugunta C., Smyth P., and Steyvers M.,“Modeling general and specific aspects ofdocuments with a probabilistic topic model”, Advances in Neural Information ProcessingSystems19,2007,241–248.
    [46] Titov I. and McDonald R.,“Modeling online reviews with multi-grain topic models”, Pro-ceeding of the17th International Conference on World Wide Web,2008,111–120.
    [47] Griffiths T.L. and Steyvers M.,“Finding scientific topics”, Proceedings of the National A-cademy of Sciences of the United States of America,2004,101(Suppl.1),5228–5235.
    [48] Zaki M.J.,“Efficiently mining frequent trees in a forest”, Proceedings of the8th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining,2002,71–80.
    [49] Daume′III H. and Marcu D.,“Bayesian query-focused summarization”, Proceedings of the21st International Conference on Computational Linguistics and the44th Annual Meeting ofthe Association for Computational Linguistics,2006,305–312.
    [50] Haghighi A. and Vanderwende L.,“Exploring content models for multi-document summa-rization”, Proceedings of the2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics,2009,362–370.
    [51] Li P., Jiang J., and Wang Y.,“Generating templates of entity summaries with an entity-aspectmodel and pattern mining”, Proceedings of the48th Annual Meeting of the Association forComputational Linguistics,2010,640–649.
    [52] McDonald R.,“A study of global inference algorithms in multi-document summarization”,Advances in Information Retrieval,2007,557–564.
    [53] Gillick D. and Favre B.,“A scalable global model for summarization”, Proceedings of theWorkshop on Integer Linear Programming for Natural Langauge Processing,2009,10–18.
    [54] Zajic D., Dorr B., Lin J., et al.,“Multi-candidate reduction: Sentence compression as a toolfor document summarization tasks”, Information Processing&Management,2007,43(6),1549–1570.
    [55] Gillick D., Favre B., Hakkani-Tur D., et al.,“The icsi/utd summarization system at tac2009”,Proceedings of the Second Text Analysis Conference, Gaithersburg, Maryland, USA: Nation-al Institute of Standards and Technology,2010.
    [56] Nenkova A. and Vanderwende L.,“The impact of frequency on summarization”, MicrosoftResearch, Redmond, Washington, Tech. Rep. MSR-TR-2005-101,2005.
    [57] Zhang J., Cheng X., and Xu H.,“GSPSummary: a graph-based sub-topic partition algorithmfor summarization”, Proceedings of the4th Asia information retrieval conference on Infor-mation retrieval technology, Springer-Verlag,2008,321–334.
    [58] Genest P.E., Lapalme G., and Yousfi-Monod M.,“Hextac: the creation of a manual extractiverun”, TAC2009Workshop, Gaithersburg, Maryland, USA: National Institute of Standardsand Technology,2009.
    [59] Genest P.E. and Lapalme G.,“Text generation for abstractive summarization”, TAC2010Workshop, Gaithersburg, Maryland, USA: National Institute of Standards and Technology,2010.
    [60] Gatt A. and Reiter E.,“Simplenlg: A realisation engine for practical applications”, Pro-ceedings of the12th European Workshop on Natural Language Generation, Association forComputational Linguistics,2009,90–93.
    [61] Kevin L. and McDonald R.,“Contrastive summarization: an experiment with consumer re-views”, Proceedings of Human Language Technologies: The2009Annual Conference ofthe North American Chapter of the Association for Computational Linguistics, CompanionVolume: Short Papers, Association for Computational Linguistics,2009,113–116.
    [62] Kim H. and Zhai C.,“Generating comparative summaries of contradictory opinions in text”,Proceeding of the18th ACM Conference on Information and Knowledge Management,2009,385–394.
    [63] Yang Z., Cai K., Tang J., et al.,“Social context summarization”, Proceedings of the34th in-ternational ACM SIGIR Conference on Research and Development in Information Retrieval,2011,255–264.
    [64] Zhao W., Jiang J., Weng J., et al.,“Comparing twitter and traditional media using topicmodels”, Proceedings of the33rd European Conference on Information Retrieval,2011,338–349.
    [65] Titov I. and McDonald R.,“A joint model of text and aspect ratings for sentiment sum-marization”, Proceedings of the46th Annual Meeting of the Association for ComputationalLinguistics,2008,308–316.
    [66] Weng J., Lim E.P., Jiang J., et al.,“Twitterrank: finding topic sensitive influential twitterers”,Proceedings of the third ACM International Conference on Web Search and Data Mining,2010,261–270.
    [67] Deng H., Lyu M., and King I.,“A generalized co-hits algorithm and its application to bipartitegraphs”, Proceedings of the15th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining,2009,239–248.
    [68] Metzler D., Dumais S., and Meek C.,“Similarity measures for short segments of text”, Pro-ceedings of the29th European Conference on Information Retrieval,2008,16–27.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700