文本摘要常用数据集和方法研究综述

英文篇名：A Survey to Text Summarization:Popular Datasets and Methods
作者：侯圣峦 ; 张书涵 ; 费超群
英文作者：HOU Shengluan;ZHANG Shuhan;FEI Chaoqun;Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences;University of Chinese Academy of Sciences;
关键词：文本摘要 ; 自然语言处理 ; 机器学习 ; 人工智能
英文关键词：text summarization;;natural language processing;;machine learning;;artificial intelligence
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：中国科学院计算技术研究所智能信息处理重点实验室;中国科学院大学;
出版日期：2019-05-15
出版单位：中文信息学报
年：2019
期：v.33
基金：国家重点研发计划项目(2016YFB1000902);; 国家自然科学基金(61232015,21472412,61621003)
语种：中文;
页：MESS201905001
页数：16
CN：05
ISSN：11-2325/N
分类号：6-21

摘要

成为人们从互联网上海量文本信息中便捷获取知识的重要手段。现有方法都是在特定数据集上进行训练和效果评价,包括一些公用数据集和作者自建数据集。已有综述文献对现有方法进行全面细致的总结,但大多都是对方法进行总结,而缺少对数据集的详细描述。该文从调研数据集的角度出发,对文本摘要常用数据集及在该数据集上的经典和最新方法进行综述。对公用数据集的综述包括数据来源、语言及获取方式等,对自建数据集的总结包括数据规模、获取和标注方式等。对于每一种公用数据集,给出了文本摘要问题的形式化定义。同时,对经典和最新方法在特定数据集上的实验效果进行了分析。最后,总结了已有常用数据集和方法的现状,并指出存在的一些问题。
Text summarization has become an essential way of knowledge acquisition from mass text documents on the Internet.The existing surveys to text summarization are mostly focused on methods,without reviewing on the experimental datasets.This survey concentrates on evaluation datasets and summarizes the public and private datasets together with corresponding approaches.The public datasets are recorded for the data source,language and the way of access,and the private dataset are recorded with the scale,access and annotation methods.In addition,the formal definition of text summarization by each public dataset are provided.We analyze the experimental results of classical and latest text summarization methods on one specific dataset.We conclude with the present situation of existing datasets and methods,and some issues concerning them.

引文

[1]Erkan G,Radev D R.Lexrank:Graph-based lexical centrality as salience in text summariza-tion[J].Journal of Artificial Intelligence Research.2004,22:457-479.
    [2]Gambhir M,Gupta V.Recent automatic text summarization techniques:a survey[J].Artificial Intelligence Review,2017,47(1):1-66.
    [3]Nenkova A,McKeown K.Automatic summarization[J].Foundations and Trends in Information Retrieval,2011,5(2-3):103-233.
    [4]Nenkova A,McKeown K.A survey of text summarization techniques[M].Mining Text Data.Boston:Springer,2012:43-76.
    [5]Baralis E,Cagliero L,Fiori A,et al.Mwi-sum:A multilingual summarizer based on frequent weighted itemsets[J].ACM Transactions on Information Systems(TOIS),2015,34(1):5.
    [6]Cheng J,Lapata M.Neural summarization by extracting sentences and words[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Association for Computational Linguistics,2016:484-494.
    [7]Mihalcea R,Tarau P.Textrank:Bringing order into text[C]//Proceedings of the 2004Conference on Empirical Methods in Natural Language Processing.2004.
    [8]Page L,Brin S,Motwani R,et al.The PageRank citation ranking:Bringing order to the web[R].Stanford InfoLab,1999.
    [9]Baralis E,Cagliero L,Mahoto N,et al.GRAPHSUM:Discovering correlations among multiple terms for graph-based summarization[J].Information Sciences,2013,249:96-109.
    [10]Gillick D,Favre B.A scalable global model for summarization[C]//Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing.Association for Computational Linguistics,2009:10-18.
    [11]Fattah M A.A hybrid machine learning model for multi-document summarization[J].Applied in-telligence,2014,40(4):592-600.
    [12]Rush A M,Chopra S,Weston J.A neural attention model for abstractive sentence summarization[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.2015:379-389.
    [13]Chopra S,Auli M,Rush A M.Abstractive sentence summarization with attentive recurrent neural networks[C]//Proceedings of the 2016Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:93-98.
    [14]Nallapati R,Zhou B,dos Santos C,et al.Abstractive Text Summarization using Se-quence-to-sequence RNNs and Beyond[C]//Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning.2016:280-290.
    [15]Zhou Q,Yang N,Wei F,et al.Selective Encoding for Abstractive Sentence Summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2017:1095-1104.
    [16]Cao Z,Wei F,Li W,et al.Faithful to the original:Fact aware neural abstractive summarization[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.2018.
    [17]Manning C,Surdeanu M,Bauer J,et al.The stanford CoreNLP natural language processing toolkit[C]//Proceedings of 52nd annual meeting of the association for computational linguistics:system demonstrations.2014:55-60.
    [18]Cao Z,Li W,Li S,et al.Retrieve,rerank and rewrite:Soft template based neural summarization[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:152-161.
    [19]Hermann K M,Kocisky T,Grefenstette E,et al.Teaching machines to read and comprehend[C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems.2015:1693-1701.
    [20]See A,Liu P J,Manning C D.Get to the point:Summarization with pointer-generator networks[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2017:1073-1083.
    [21]Durrett G,Berg Kirkpatrick T,Klein D.Learningbased Single-document summarization with compression and anaphoricity constraints[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:1998-2008.
    [22]Ma S,Sun X,Lin J,et al.A hierarchical End-to-End model for jointly improving text summarization and sentiment classification[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence,2018.
    [23]Hu B,Chen Q,Zhu F.LCSTS:A large scale Chinese short text summarization dataset[C]//Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing.2015:1967-1972.
    [24]Ma S,Sun X,Xu J,et al.Improving semantic relevance for Sequence-to-Sequence learning of Chinese social media text summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:635-640.
    [25]莫鹏,胡珀,黄湘冀,等.基于超图的文本摘要与关键词协同抽取研究[J].中文信息学报,2015,29(06):135-140.
    [26]Xu H,Cao Y,Shang Y,et al.Adversarial reinforcement learning for Chinese text summarization[C]//Proceedings of the 18th International Conference on Computational Science.2018:519-532.
    [27]Ko Y,Seo J.An effective sentence-extraction technique using contextual information and statistical approaches for text summarization[J].Pattern Recognition Letters,2008,29(9):1366-1371.
    [28]Hu M,Sun A,Lim E P.Comments-oriented document summarization:understanding documents with readersfeedback[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2008:291-298.
    [29]林莉媛,王中卿,李寿山,等.基于PageRank的中文多文档文本情感摘要[J].中文信息学报,2014,28(2):85-90.
    [30]Barzilay R,Elhadad M.Using lexical chains for text summarization[J].Advances in automatic text summarization,1999:111-121.
    [31]Chen Y,Wang X,Guan Y.Automatic text summarization based on lexical chains[C]//Proceedings of the1st International Conference on Natural Computation.Springer,2005:947-951.
    [32]Yu L,Ma J,Ren F,et al.Automatic text summarization based on lexical chains and structural features[C]//Proceedings of the 8th ACIS International Conference on Software Engineering,Artificial Intelligence,Networking,and Parallel/Distributed Computing,2007,2:574-578.
    [33]Wu X,Xie F,Wu G,et al.PNFS:personalized web news filtering and summarization[J].International Journal on Artificial Intelligence Tools,2013,22(05):1360007.
    [34]Ercan G,Cicekli I.Using lexical chains for keyword extraction[J].Information Processing&Management,2007,43(6):1705-1714.
    [35]Hou S,Huang Y,Fei C,et al.Holographic Lexical Chain and Its Application in Chinese Text Summarization[C]//Proceedings of the 2nd Asia-Pacific Web(APWeb)and Web-Age Information Management(WAIM)Joint Conference on Web and Big Data.Springer,2017:266-281.
    [36]王继成,武港山,周源远,等.一种篇章结构指导的中文Web文档自动摘要方法[J].计算机研究与发展,2003,3:398-405.
    [37]Hu P,He T,Ji D.Chinese text summarization based on thematic area detection[C]//Proceedings of the ACL-04 Workshop:Text Summarization Branches Out Text Summarization Branches Out,2004:112-119.
    [38]Baumel T,Cohen R,Elhadad M.Query-chain focused summarization[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2014:913-922.
    [39]Blei D M.Probabilistic topic models[J].Communications of the ACM,2012,55(4):77-84.
    [40]庞超,尹传环.基于分类的中文文本摘要方法[J].计算机科学,2018,45(01):144-147,178.
    [41]Hsu W T,Lin C K,Lee M Y,et al.A unified model for extractive and abstractive summarization using inconsistency loss[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:132-141.
    [42]Jadhav A,Rajan V.Extractive summarization with SWAP-NET:Sentences and words from alternating pointer networks[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:142-151.
    [43]Lin J,Sun X,Ma S,et al.Global encoding for abstractive summarization[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:163-169.
    [44]Ma S,Sun X,Lin J,et al.Autoencoder as assistant supervisor:improving text representation for Chinese social media text summarization[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:725-731.
    [45]Zhou Q,Yang N,Wei F,et al.Neural document summarization by jointly learning to score and select sentences[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:654-663.
    [46]Wu Y,Hu B.Learning to extract coherent summary via deep reinforcement learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.2018.
    [47]Zhou Q,Yang N,Wei F,et al.Sequential copying networks[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.2018.
    [48]Liu L,Lu Y,Yang M,et al.Generative adversarial network for abstractive text summarization[C]//Proceedings of 32nd AAAI Conference on Artificial Intelligence.2018.
    [49]Singh A K,Gupta M,Varma V.Unity in Diversity:Learning distributed heterogeneous sentence representation for extractive summarization[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.2018.
    [50]Peyrard M,Eckle-Kohler J.Supervised learning of automatic pyramid for optimization-based multi-document summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:1084-1094.
    [51]Hirao T,Nishino M,Nagata M.Oracle summaries of compressive summarization[C]//Proceedings of the55th Annual Meeting of the Association for Computational Linguistics.2017:275-280.
    [52]Nayeem M T,Chali Y.Extract with order for coherent multi-document summarization[C]//Proceedings of TextGraphs-11:the Workshop on Graph-based Methods for Natural Language Processing.2017:51-56.
    [53]Ghalandari D G.Revisiting the centroid-based method:A strong baseline for multi-document dummarization[C]//Proceedings of the EMNLP 2017Workshop on New Frontiers in Summarization.2017:85-90.
    [54]Nallapati R,Zhai F,Zhou B.SummaRuNNer:A recurrent neural network based sequence model for extractive summarization of documents[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence.2017:3075-3081.
    [55]Cao Z,Li W,Li S,et al.Improving Multi-document summarization via text classification[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence.2017:3053-3059.
    [56]Wan X,Yang J,Xiao J.Towards an iterative reinforcement approach for simultaneous document summarization and keyword extrac-tion[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics.2007:552-559.
    [57]Wang K,Liu T,Sui Z,et al.Affinity preserving random walk for multi-document summarization[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017:210-220.
    [58]Li P,Lam W,Bing L,et al.Cascaded attention based unsupervised information distillation for compressive summarization[C]//Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing.2017:2081-2090.
    [59]Li P,Lam W,Bing L,et al.Deep recurrent generative decoder for abstractive text summarization[C]//Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing.2017:2091-2100.
    [60]Isonuma M,Fujino T,Mori J,et al.Extractive summarization using multi-task learning with document classification[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017:2101-2110.
    [61]Parveen D,Mesgar M,Strube M.Generating coherent summaries of scientific articles using coherence patterns[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:772-783.
    [62]Filippova K,Mieskes M,Nastase V,et al.Cascaded filtering for topicdriven multi-document summarization[C]//Proceedings of the 7th Document Understanding Conference.2007:26-27.
    [63]Kurisinkel L J,Zhang Y,Varma V.Abstractive Multi-document summarization by partial tree extraction,recombination and linearization[C]//Proceedings of the 8th International Joint Conference on Natural Language Processing.2017:812-821.
    [64]Chali Y,Tanvee M,Nayeem M T.Towards abstractive Multi-document summarization using submodular function-based framework,sentence compression and merging[C]//Proceedings of the 8th International Joint Conference on Natural Language Processing.2017:418-424.
    [65]Peyrard M,Eckle-Kohler J.A general optimization framework for Multi-document summarization using genetic algorithms and swarm intelligence[C]//Proceedings of the 26th International Conference on Computational Linguistics:Technical Papers.2016:247-257.
    [66]Wang X,Nishino M,Hirao T,et al.Exploring text links for coherent multi-document summarization[C]//Proceedings of the 26th International Conference on Computational Linguistics:Technical Papers.2016:213-223.
    [67]Li W,He L,Zhuge H.Abstractive news summarization based on event semantic link network[C]//Proceedings of the 26th International Conference on Computational Linguistics:Technical Papers.2016:236-246.
    [68]Wong K F,Wu M,Li W.Extractive summarization using supervised and semi-supervised learning[C]//Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1.Association for Computational Linguistics,2008:985-992.
    [69]Zhang R,Li W,Liu N,et al.Coherent narrative summarization with a cognitive model[J].Computer Speech&Language,2016,35:134-160.
    [70]Filatova E,Hatzivassiloglou V.Event-based extractive summarization[C]//Proceedings of Text Summarization Branches Out,2004.
    [71]Parveen D,Strube M.Multi-document summarization using bipartite graphs[C]//Proceedings of TextGraphs-9:the workshop on Graph-based Methods for Natural Language Processing.2014:15-24.
    [72]McDonald R.A study of global inference algorithms in multi-document summarization[C]//Proceedings of the 29th European Conference on Information Retrieval.Berlin:Springer,Heidelberg,2007:557-564.
    [73]Tang J,Yao L,Chen D.Multi-topic based query-oriented summarization[C]//Proceedings of the 2009SI-AM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2009:1148-1159.
    [74]Lin C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the ACL-04Workshop:Text Summarization Branches Out,2004.
    [75]Yang Y S,Zhang M,Chen W,et al.Adversarial Learning for Chinese NER from Crowd Annotations[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.2018.
    (1)http://duc.nist.gov/
    (2)http://www.nist.gov/tac/
    (1)https://catalog.ldc.upenn.edu/ldc2003t05
    (1)https://edition.cnn.com/
    (2)http://www.dailymail.co.uk/home/index.html
    (3)https://github.com/deepmind/rc-data
    (1)https://catalog.ldc.upenn.edu/LDC2008T19/
    (2)RST-DT是人工标注的篇章结构树,共包括385篇来自华尔街日报(Wall Street Journal,WSJ)的新闻文章,具体数据在https://catalog.ldc.upenn.edu/LDC2002T07。
    (3)http://snap.stanford.edu/data/web-Amazon.html
    (4)http://icrc.hitsz.edu.cn/Article/show/139.html
    (5)http://weibo.com/
    (1)http://tcci.ccf.org.cn/conference/2015/pages/page05_evadata.html
    (2)http://tcci.ccf.org.cn/conference/2017/taskdata.php
    (3)http://tcci.ccf.org.cn/conference/2018/taskdata.php
    (1)http://blogs.discovermagazine.com/cosmicvariance#.WyyfqadLjIU/
    (2)https://blogs.msdn.microsoft.com/ie/
    (3)https://www.amazon.cn
    (4)http://news.163.com
    (5)http://ictclas.nlpir.org
    (1)http://www.sina.com.cn
    (2)http://www.ccw.com.cn
    (3)医学、生命科学领域的科研文献检索数据库,https://www.ncbi.nlm.nih.gov/pmc/
    (4)https://en.wikipedia.org/wiki/Wiki
    (5)https://www.webmd.com
    (6)http://www.chinanews.com
    (1)日文语料库。日本上市公司的财务报告语料,日本经济新闻网https://www.nikkei.com有3 911篇对这些报告的摘要。
    (2)共包括50篇科技文章,每篇文章包括编辑写的人工摘要。
    (3)从网站www.epinions.com爬取的英文产品评论数据集,共包括44个不同商品的评论。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700