中文自动文摘关键技术的研究与实现

英文题名：Research and Implementation of Chinese Automatic Abstracting
作者：乔小斐
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：自动文摘 ; 文本表示 ; 语句评价 ; MMR
英文关键词：Automatic Abstracting ; Text Presentation ; Sentence Evaluation ; MMR
学位年度：2010
导师：陈平
学科代码：081202
学位授予单位：西安电子科技大学
论文提交日期：2010-01-01

摘要

现有中文自动文摘技术存在原文内容覆盖不全面以及信息冗余的问题。针对上述问题,本文开展了相关的研究工作。
结合已有的“统计全切分中文分词系统”,本文首先提出了基于通用分词词典的最长组合模式逆向匹配算法来修正通用分词词典分词粒度过细的问题,并在分词的基础上进行特征计算与筛选,将文本以特征词表示。此后设计了基于形式特征的语句加权函数应用于分句过程,并且结合最大边缘相关(Maximal Marginal Relevance, MMR)思想提出了应用于自动文摘的MMR公式以降低文摘的冗余,并将该公式作为语句评价标准,据此给出了一种新的文摘句选取算法。最后本文阐述了一个中文自动文摘系统的设计与实现,并通过实验证明由本系统抽取的文摘具有良好的完备性和低冗余性。
There has been a rapid development in Chinese automatic abstracting in last 20 years. However, limitations still exist in automatic abstracting techniques, which represent as the non-completeness and high redundancy of the automatic abstraction.
Specified study has been made in this paper for the correction of the limitations. At the beginning of the paper, a reverse maximum matching method based on the universal segmentation dictionary for the longest word-combination is proposed to modify the fine grained segmentation, followed with calculations and filter of term words. Then the weighting function of the sentence is summarized with the combination of other researchers' study and the text feature characters, which is applied in the sentence segmentation algorithm. An MMR equation has also designed based on the maximal marginal relevance theory. It is used in a new abstraction summarizing method in order to reduce the redundancy. In the end of the paper, a Chinese document automatic abstracting system is designed and implement. Experiments indicate that the automatic abstraction made by the system has a fine quality with completeness and low redundancy.

引文

[1]刘开瑛.中文文本自动分词和标注[M].北京：商务印书馆,2000.
    [2]刘挺,吴岩,王开铸.自动文摘综述.情报科学,1998(1)：63-69.
    [3]H. P. Luhn. A Statistical Approach to Mechanized Encoding and Searching of Literary Information [J]. IBM Journal,1957,309-317.
    [4]K. S. Jones and E2N Brigitte. Introduction:Automatic Summarizing. Information Processing & Management,1995,31(5):625-630.
    [5]R.Brandow, K.Mitze and L. F. Rau. Automatic Condensation of Electronic Publications by Sentence Selection. Information Processing & Management,1995, 31(5):675-685.
    [6]J. J. Pollock and A. Zamora. Automatic Abstracting Research at Chemical Abstracts Service. Journal of Chemical Information and Computer Sciences,1975, 15(4):226-232.
    [7]H. P. Edmundson and R. E. Wyllys. Automatic Abstracting and Indexing-Survey and Recommendations. Communications of the ACM,1961,4(5):226-234.
    [8]H. P. Edmundson. Automatic Abstracting, TRW Computer Division, Thompson Ram Wooldridge, Inc., Canoga Park, California:1963, AD 406 155.
    [9]H. P. Edmundson. Problems of automatic abstracting. Communications of ACM, 1964,7(4):259-263.
    [10]H. P. Edmundson. New methods in automatic abstracting. Journal of the Association for Computing Machinery,1969,16(2):264-285.
    [11]L. F. Rau, P. S. Jacobs and Uri Zernik. Information Extracting and Text Summarization Using Linguistic Knowledge Acquisition. Information Processing & Management,1989,25(4):419-428.
    [12]P. S. Jacobs and L. F. Rau. Scisor. Extracting Information from Online News. Communication of the ACM,1990,33(11):88-97.
    [13]R. Kuhlen. Some Similarities and Differences between Intellectual and Machine Text Understanding for the Purpose of Abstracting. Proceedings of the Fifth International Research Forum in Information Science(IRFIS 5), Heidelberg, F. R. G., September 5-7,1983, Edited by H. J. Dietschmann, Amsterdam, New York: Oxford, Elsevier Science Publishers,1984,87-109.
    [14]王兵.美国机编文摘概况.情报学报,1985,4(2)：166-171
    [15]王文欣,黄萱菁等.基于统计方法的汉语自动文摘系统研究.计算机应用与软件,2000,17(9)：28-33.
    [16]王永成,许惠敏.OA中文自动摘要系统.情报学报,1997,16(2)：128-132.
    [17]马希文,李小滨,徐越.自然语言处理与自动文摘.智能技术与系统基础,1988,99-117.
    [18]李小滨,徐越.自动文摘系统EAAS.软件学报,1991,(4),12-18.
    [19]刘挺,吴岩,王开铸.基于信息抽取和文本生成的自动文摘系统设计.情报学报,1997,16(增刊)：24-29.
    [20]王建波,杜春玲,王开铸.基于篇章理解的自动文摘研究.中文信息学报,1995,9(3)：33-42.
    [21]杨小兰,宋帆,钟义信.基于选择生成文摘法的自动文摘系统研究与实现.见：全国第四届计算语言学联合学术会议论文集,北京：清华大学出版社,1997.313-318.
    [22]刘伟权.自然语言理解与汉语文本信息处理理论研究(博士论文).北京邮电大学图书馆,1997.
    [23]李蕾,郭祥昊,钟义信.面向特定领域的理解型中文自动文摘系统.计算机研究与发展,2000,37(4)：6-10.
    [24]Baxendal P. B. Machine-made Index for Technical Literature——All Experiment IBM, Journal of Research and Development,1959, (4),354-361.
    [25]钟义信,李蕾,郭祥昊.全信息理论在自动文摘系统中的应用.计算机工程与应用.2000(1)：4-7.
    [26]Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K.,&Harshman, R, Indexing by Latent Semantic Analysis, Journal of the American Society For Information Science,1990,41(6),391-407.
    [27]哈罗德.博科,查尔斯.L.贝尼.埃合著,赖茂生,王知津合译.文摘的概念与方法.书目文献出版社,1991.
    [28]吴报任.模糊冗余度及其应用.合肥工业大学学报,1999,22(3)：48-51.
    [29]李庆虎,陈玉健,孙家广.一种中文分词词典新机制——双字哈希机制.中文信息学报,2003,04：14-19.
    [30]Jaime Carbonell, Jade Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval,1998:335-36.
    [31]Tatsunori MORI, Takuro. Information Gain Ratio meets Maximal Relevance. National Institute of Informatics. SASAKI,2003.
    [32]Goldstein. J, V. Mittal, J. Carbonell. Creating and Evaluating Multi-Document Sentence Extract Summaries. In CIKM'00:Ninth International Conference on Information Knowledge Management.2000.
    [33]黄良,赖茂生.商标图像检索技术评述.现代图书情报技术.2004(4)：32-36.
    [34]王荔.统计全切分中文分词系统的研究与实现.西安电子科技大学硕士毕业论文.2009.
    [35]B. A. Mathis and J. E. Rush. Abstracting. Encyclopedia of Computer and Technology, Vol.1, New York:Marcel Dekker Inc.,1975:102-142.
    [36]Sangkon Lee, Masami Shishibori. Passage segmentation based on topic matter. Computer Processing of Oriental Languages,2002,15(3):305-340.
    [37]Liu Ying. Computational linguistics [M]. Beijing:Tsinghua University Press, 2002.
    [38]罗桂琼,费洪晓,戴弋.基于反序词典的中文分词技术研究.计算机技术与发展.2008,18(1)：80-83.
    [39]Gerard Salton, A.Wong, C. S Yang, A Vector Space Model for Automatic Indexing [A], Communications of the ACM,1975,18(11).
    [40]中华人民共和国信息交换用汉字编码字符集基本集.中国国家标准总局.1985,05,01.
    [41]赵妍,侯汉青,耿金玉等.中文期刊论文自动标引加权设计研究.新世纪图书馆.2004(1)：40-43.
    [42]Edmundson H P. New methods in automatic abstracting extracting. Journal of the Association for Computing Machinery,1969,16 (2):264-285.
    [43]Kupiec J, Pedersen J & Chen F. A trainable document summarizer. In:E A Fox, P Ingwersen & R Fidel, ed. Proceedings of the 18th SIGIR Conference (TREC24), 1995.449-457:Gaithersburg, MD:NIST SP,500-236.
    [44]Hand T F. A proposal for task based evaluation of text summarization systems. In: Proceedings of the ACL'97 Workshop on Intelligent Scalable Text Summarization (ISTS"97) 1997.31-38.
    [45]Salton G, Singhal A, Mitra M.&Buckley C. Automatic text structuring and summarization. IP&M,1997,33(2):193-207.
    [46]DeJong G. An overview of the FRUMP system. In:Processings of the 5th International Joint Conference on Artificial Intelligence. Cambridge, MA:William Kaufmann,1982.16.
    [47]Chinchor N. MUC24 Evaluation metrics. In:Fourth Message Understanding Conference (MUC24). Proceedings of a Conference Held in McLean, Virginia June 16218, San Mateo, CA:Morgan Kaufmann,1992.22-29.
    [48]Chinchor N, Hirschman L,&Lewis D D. Evaluating message understanding systems:an analysis of the third Message Understanding Conference (MUC23). Computational Linguistics,1993,19(3):409-449.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700