文本自动摘要和信息抽取方法及其应用研究

英文题名：Study on Methods and Their Applications of Text Automatic Summarization and Information Extraction
作者：刘娜
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：文本挖掘 ; 自动摘要 ; 信息抽取 ; 谱聚类 ; 主题模型
英文关键词：Text Mining ; Automatic Summarization ; Information Extraction ; Spectral Clustering ; Topic Model
学位年度：2012
导师：鲁明羽
学科代码：081203
学位授予单位：大连海事大学
论文提交日期：2012-05-01

摘要

随着文本数据特别是网页信息的持续激增,如何快速、自动地抽取海量文本中蕴含的主要或重要信息,已成为人们关心的一个热点研究问题,由此刺激了面向文本的信息抽取技术的迅速发展。文本摘要技术能够抽取文本的篇章结构及主要信息,自动生成单篇文档或多篇文档的摘要,可以看成是信息抽取技术的一种。而通常意义上的信息抽取技术则主要是抽取文本中蕴含的用户所需的特定重要信息。
     本文面向循证医学(EBM)网页并结合其它类型的训练文本,重点研究文本的自动摘要和信息抽取方法,主要针对信息抽取结果不理想、主题划分不明确、段落聚类算法对初始值敏感、聚类数目需要人工设定等问题,提出一系列新颖的研究方法和模型。
     (1)提出一种段落特征与隐马尔可夫模型相结合的信息抽取方法。该方法与其它信息抽取方法的不同之处在于以段落而不是单词为研究对象。网页上的信息经过预处理以后,以段落为单位,保存成文本序列,每一个段落要转换成特定的字符串,这些字符串做为隐马尔可夫模型中的可观察变量。实验表明,无论是准确率还是召回率,以段落为观察序列的信息抽取结果都要优于以单词为观察序列的信息抽取结果。
     (2)对文档进行主题划分,为摘要的生成做准备。主题划分的过程是将文本中的段落表示成向量空间模型,利用互信息计算连续段落的关联程度,将关联程度较弱的段落作为划分的边界。考虑到算法中人工定义参数会对划分结果造成一定程度的不利影响,所以本文采用遗传算法对主题划分过程中出现的参数阈值进行优化。实验表明,互信息与遗传算法相结合的主题划分方法在准确率上取得了较好的结果。
     (3)对单词-文档谱聚类方法的基本步骤进行分析,找出其对初始值敏感的根本原因,提出一种基于模糊K-调和均值的单词-文档谱聚类方法。该方法包括两个方面,一是从矩阵相似的角度对谱聚类中的Laplacian矩阵进行处理,使其满足对初始值不敏感的条件。二是通过加入模糊的概念,用模糊K-调和均值算法代替K-均值算法,使聚类结果对初始值不敏感。实验表明,基于模糊K-调和均值的单词-文档谱聚类方法不仅使聚类结果对初始值不敏感,而且在一定程度上改进了数据的聚类结果。
     (4)利用形态学的方法确定聚类数目,并对单词-文档谱聚类方法进行改进。确定聚类数目主要分三个步骤,第一步将单词-文档谱聚类方法中产生的矩阵转换成VAT灰度图,第二步利用灰度形态学、图像二值化、距离转换等图像处理技术对VAT灰度图进行过滤,第三步对过滤后的VAT灰度图建立信号图,并进行平滑处理,通过平滑后的信号图的波峰波谷数目确定文档集的聚类数目。实验表明,该方法能够提高单词-文档谱聚类方法的聚类效果。
     (5)在LDA主题模型的基础上,提出了基于主题融合的多文档自动摘要算法Titled-LDA。考虑到文档的标题信息对摘要形成有很强的指示作用,因此为每篇文档分别建立标题和正文的主题模型,并对两个模型进行融合。融合过程中,根据两种形态的信息熵,进行自适应不对称学习,从而对不同形态的主题分布进行加权处理,融合后的模型适当地关联了标题和正文的信息,因此有助于摘要质量的提高。实验表明,Titled-LDA方法在DUC标准数据集上取得了较好的效果。
With continuous growth of text data especially of web information, how to quickly and automatically extract main or important information that mass text contains, has become a hot research issue of concern, thus stimulating to the rapid development of text information extraction technology. Text summarization technology can extract text discourse structure and main information; automatically generate a single document or multi-document summarization, which is considered as a kind of information extraction technology. In the usual sense, information extraction technologies are to extract specific or important information that text contains.
     Oriented Evidence-Based Medicine web page and other types of training text, this paper mainly focuses on method of text automatic summarization and information extraction. In view of unsatisfactory information extraction results, unclear topic segmentation, paragraphs clustering algorithm sensitive to initiation, the need of manual set for the number of clusters, this paper provides a series of novel research methods and models.
     (1) This paper puts forward a method of information extraction that incorporates paragraph feature and hidden Markov Model. The main difference between this method and other information extraction methods is that this proposed method takes paragraph sequence as research object instead of word sequence. Paragraph is a unit of text sequence saved from web pages after preprocessed. Every paragraph is converted into special tokens, and these tokens are the observation symbols of hidden Markov Model. The experiments show that, regardless of precision or recall, information extraction results on the paragraphs as the observed sequence is better than the results on the word as the observed sequence.
     (2) This paper denotes paragraphs as Vector Space Model, segment text into different semantic units by calculating Mutual Independence between continuous paragraphs. After that, considering the influence of thresholds, we use Genetic algorithm to optimize parameters. The experimental results show that the method can improve precision to some degree.
     (3) This paper analyses the main step of spectral co-clustering documents and words, finds out its cause of sensitivity to initialization, and presents a modified method of spectral co-clustering documents and words based on fuzzy K-harmonic means. This method consists of two steps. The first step constructs matrix which is insensitive to the initialization. The second step exploits fuzzy K-harmonic means algorithm instead of K-means algorithm to obtain clustering results. Fuzzy K-harmonic means algorithm uses fuzzy weight distance while calculating the distance between each data points and cluster centers. The experiments show that the proposed method not only is insensitive to initialization, but also can improve the clustering results.
     (4) This paper explores a method based on morphology for determining the number of clusters present in the given dataset and modifies spectral co-clustering documents and words. This method includes three main steps. First, the input matrix generated by spectral co-clustering documents and words is created into VAT gray image. Then, sequential image processing operations are used to filter the VAT image. These processing operations consist of gray morphology, image binarization, distance transform. Finally, we establish signal from filtered VAT image, from which we can extract the number of clusters by major peaks and troughs after smoothing signal. Experiments show that this method can improve the clustering results of spectral co-clustering documents and words.
     (5) Based on the LDA topic model, this paper proposed Titled-LDA algorithm for multi-document summarization by fusing topic model. In view of the strong indication effect of the title in the summarization, Titled-LDA established corresponding topic model for title and content of each document. In the fusing stage, the algorithm can do weight processing subject to different topics distribution in an adaptive asymmetric learning way based on two kinds of information entropies. In this way, the final model incorporated title information and content information appropriately, which helped the performance of summarization process. The experiments showed that the proposed algorithm achieved better performance compared the other state-of-the-art algorithms on DUC datasets.

引文

[1]Inderjeet Mani, Mark T. Maybury. Advances in Automatic Text Summarization.USA:MIT Press, 1999.
    [2]Ahmed A. Mohamed, Sanguthevar R.ajasekaran. Improving Query-Based Summarization Using Document Graphs.2006 IEEE International Symposium on Signal Processing and Information Technology, Vancouver,2006:408-410.
    [3]Sornil, O., Gree-ut, K. An Automatic Text Summarization Approach using Content-Based and Graph-Based Characteristics.2006 IEEE Conference on Cybernetics and Intelligent Systems, Singapore, 2006:1-6.
    [4]Huantong Geng, Peng Zhao, Enhong Chen, et al. A Novel Automatic Text Summarization Study Based on Term Co-Occurrence.2006 IEEE International Conference on Cognitive Informatics, Beijing, 2006:601-606.
    [5]Shiyan Ou, Christopher Soo-Guan Khoo, Dion H. Goh. Design and development of a concept-based multi-document summarization system for research abstracts. Journal of Information Science.2008,34(3): 308-326.
    [6]Tao Liao, Zongtian Liu. Research of Summarization Extraction in Multiple Topics Document.2009 IEEE International Conference on Autonomic and Secure Computing, Chengdu,2009:859-860.
    [7]Ying-Lang Chang, Jen-Tzung Chien. Latent Dirichlet learning for document summarization.2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei,2009:1689-1692.
    [8]Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka. A bottom-up approach to sentence ordering for multi-document summarization. Information Processing and Management.2010,46(1):89-109.
    [9]Sanda Harabagiu, Finley Lacatusu. Using topic themes for multi-document summarization. ACM Transactions on Information Systems.2010,28(3):13-49.
    [10]Shams, R., Hashem, M.M.A., Hossain, A., et al. Corpus-based web document summarization using statistical and linguistic approach.2010 International Conference on Computer and Communication Engineering, Kuala,2010:1-6.
    [11]Chandra, M., Gupta, V., Paul, S.K. A Statistical Approach for Automatic Text Summarization by Extraction.2011 International Conference on Communication Systems and Network Technologies, Katra, 2011:268-271.
    [12]Suanmali, L., Salim, N., Binwahlan, M.S. Fuzzy Genetic Semantic Based Text Summarization.2011 IEEE International Conference on Dependable, Autonomic and Secure Computing, Sydney,2011: 1184-1191.
    [13]Di Fabbrizio, G., Aker, A., Gaizauskas, et al. Multi-document Summarization of Service and Product Reviews with Balanced Rating Distributions.2011 IEEE International Conference on Data Mining Workshops, Miami,2011:67-74.
    [14]Hien Nguyen, Santos, E., Russell, J. Evaluation of the Impact of User-Cognitive Styles on the Assessment of Text Summarization. IEEE Transactions on Systems, Man and Cybernetics, Part A:Systems and Humans.2011,41(6):1038-1051.
    [15]Inouye, D., Kalita, J.K. Comparing Twitter Summarization Algorithms for Multiple Post Summaries. 2011 IEEE International Conference on social computing, Minnesota,2011:298-306.
    [16]Wang Dingding, Li Tao. Weighted consensus multi-document summarization. Information Processing & Management.2012,48(3):513-523.
    [17]He Ruifang, Qin Bing, Liu Ting. A novel approach to update summarization using evolutionary manifold-ranking and spectral clustering. Expert Systems with Applications.2012,39(3):2375-2384.
    [18]Yllias Chali, Sadid a. Hasan. Query-focused multi-document summarization:Automatic data annotations and supervised learning approaches. Natural Language Engineering.2012,18(1):109-145.
    [19]H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development.1958,2(8):159-165.
    [20]P.E.Baxendale. Machine-made index for technical literature--an experiment. IBM Journal of Research and Development.1958,12(4):354-361.
    [21]JH.P.Edmundson, V.A.Oswald, R.E.Wyllys. Automatic Indexing and Abstracting of Contents of Documents. California:Planning Research Corporation,1959.
    [22]B. A. Mathis, J. E. Rush. Abstracting. Encyclopedia of Computer and Tehcnology, Vol.1, New York: Marcel Dekker Inc.,1975,102-142.
    [23]J. J. Pollock, A. Zamora. Automatic abstracting research at chemical abstracts service. Journal of Chemical Information and Computer Sciences.1975,15(4):226-232.
    [24]T. Maeda. An Approach toward Functional Text Structure Analysis of Scientific and Technical Documents. Information Processing & Management.1981,17(6):329-339.
    [25]C. D. Paice. Constructing literature abstracts by computer:techniques and prospects. Information Processing & Management.1990,26(1):171-186.
    [26]Morris J, Hirst G. Lexical cohesion computed by thesaural relations as an indicator of the strutture of text. Computational linguisties.1991,17(1):21-43.
    [27]P. A. Jones, C. D. Paice. A 'Select and Generate' Approach to automatic abstracting. Proceedings of the 14th Information Retrieval Colloquium, Lancaster,1993:141-154.
    [28]Seiji Miike, Etsuo Itoh, Kenji Ono, et al. A Full Text Retrieval System with a Dynamic Abstract Generation Function.ACM SIGIR Forum, Dublin,1994:152-161.
    [29]K. Ono, K. Sumita, S. Miike. Abstract Generation Based on Rhetorical Structure Extraction. COL ING 94, Kyoto,1994:344-348.
    [30]R.Brandow, K. Mitze, L. F. Rau. Automatic Condensation of Electronic Publications by Sentence Selection. Information Processing & Management.1995,31(5):675-685.
    [31]T. Nomoto, Yuji Matsumoto. Data reliability and its effects on automatic abstracting. Proceedings of the Fifth Workshop on Very Large Corpora, Beijing,1997:113-126.
    [32]苏海菊,王永成.中文科技文献文摘的自动编写.情报学报.1989,8(6)：433-439.
    [33]莫燕,王永成.中文文献摘要的自动编制.现代图书情报技术.1993(3)：10-12.
    [34]王永成.中文信息处理技术及其基础.上海：上海交通大学出版社,1991.
    [35]王永成,徐慧.OA中文文献自动摘要系统.情报学报.1997,16(2)：128-132.
    [36]陈桂林,王永成Internet网络信息自动摘要的研究.高技术通讯.1999(2)：33-36.
    [37]姚天顺,等.自然语言理解-一种让机器懂得人类语言的研究.北京：清华大学出版社,1995.
    [38]吴立德.大规模中文文本处理.上海：复旦大学出版社,1997.
    [39]李小滨,徐越.自动文摘系统EAAS软件学报.1991,3(4)：12-18.
    [40]刘挺,吴岩,王开铸.基于信息抽取和文本生成的自动文摘系统设计.情报学报.1997,16(增刊)：24-29.
    [41]王建波,杜春玲,王开铸.基于篇章理解的自动文摘研究.中文信息学报.1995,9(3)：33-42.
    [42]杨小兰,宋帆,钟义信.基于选择生成文摘法的自动文摘系统研究与实现.全国第四届计算语言学联合学术会议论文集,北京,1997：313-318.
    [43]刘伟权.自然语言理解与汉语文本信息处理理论研究：[博士论文].北京：北京邮电大学,1997.
    [44]李蕾,郭祥昊,钟义信.面向特定领域的理解型中文自动文摘系统.计算机研究与发展.2000,37(4)：6-10.
    [45]薛翠芳,郭炳炎.汉语文本结构的自动分析.情报学报.2000,19(4)：319-325.
    [46]Yang Liu, Shasha Xie. Impact of automatic sentence segmentation on meeting summarization.2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Nevada,2008:5009-5012.
    [47]Xiaojun Wan, Jianwu Yang. Multi-document summarization using cluster-based link analysis. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Singapore,2008:299-306.
    [48]Tao Liao, Zongtian Liu. Research of Summarization Extraction in Multiple Topics Document.2009 IEEE International Conference on Autonomic and Secure Computing, Chengdu,2009:859-860.
    [49]Xiaojun Wan. Topic analysis for topic-focused multi-document summarization. Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong,2009:1609-1612.
    [50]Lei Li, Dingding Wang, Chao Shen, et al. Ontology-enriched multi-document summarization in disaster management. Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva,2010:819-820.
    [51]Yunqing Xia, Yonggang Zhang, Jianmin Yao. Co-clustering sentences and terms for multi-document summarization. Proceedings of the 12th international conference on Computational linguistics and intelligent text processing, Tokyo,2011:339-352.
    [52]Tengfei Ma, Xiaojun Wan. Multi-document Summarization Using Minimum Distortion.2010 IEEE International Conference on Data Mining, Sydney,2010:354-363.
    [53]Fei Liu, Yang Liu. Using spoken utterance compression for meeting summarization:A pilot study. 2010 IEEE International Conference on Spoken Language Technology Workshop, California,2010:37-42.
    [54]Dingding Wang, Tao Li. Many are better than one:improving multi-document summarization via weighted consensus. Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva,2010:809-810.
    [55]Chao Shen, Dingding Wang, Tao Li. Topic aspect analysis for multi-document summarization. Proceedings of the 19th ACM international conference on Information and knowledge management, Toronto,2010:1545-1548.
    [56]Jingxuan Li, Lei Li, Tao Li. MSSF:a multi-document summarization framework based on submodularity. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, Beijing,2011:1247-1248.
    [57]Yan Wang, Zhisheng Huang, Yi Zeng, et al. Interleaving Reasoning and Selection with Knowledge Summarization.2011 Seventh International Conference on Semantics Knowledge and Grid, Beijing,2011: 122-129.
    [58]刘挺,王开铸.基于篇章多级依存结构的自动文摘研究.计算机研究与发展.1999,36(4)：479-488.
    [59]宋今,赵东岩.基于语料库与层次词典的自动文摘研究.软件学报.2000,11(3)：308-314.
    [60]郭玉箐,万敏,罗振声.面向非受限领域的综合式自动中文文摘方法.清华大学学报(自然科学版).2002,42(1)：139-142.
    [61]秦兵,刘挺,李生.基于局部主题判定与抽取的多文档文摘技术.自动化学报.2004,30(6)：905-910.
    [62]秦兵,刘挺,李生.多文档自动文摘综述.中文信息学报.2005,19(6)：13-20.
    [63]秦兵,刘挺,陈尚林,李生.多文档文摘中句子优化选择方法研究.计算机研究与发展.2006,43(6)：1129-1134.
    [64]刘德喜,何炎祥,姬东鸿,杨华.一种基于演化算法进行句子抽取的多文档自动摘要系统SBGA.中文信息学报.2006,20(6)：46-53.
    [65]徐永东,徐志明,王晓龙.基于信息融合的多文档自动文摘技术.计算机学报.2007,30(11)：2048-2054.
    [66]徐永东,徐志明,王晓龙.基于信息融合的多文档自动文摘技术.计算机学报.2007,30(11)：2048-2054.
    [67]张妹,赵铁军,郑德权,等.基于浅层分析的多文档自动文摘技术.哈尔滨工业大学学报.2007,39(7)：1102-1105.
    [68]王志琪,王永成,刘传汉.基于互增强关系的自动文摘句子加权方法.上海交通大学学报.2007,41(8)：1297-1300.
    [69]张瑾,许洪波,程学旗.面向网络演化信息的动态文摘方法研究.计算机学报.2008,31(4)：696-701.
    [70]陶余会,周水庚,关佶红.一种基于文本单元关联网络的自动文摘方法.模式识别与人工智能.2009,22(3)：440-444.
    [71]贺瑞芳,秦兵,刘挺,等.基于宏微观重要性判别模型的时序多文档文摘.计算机研究与发展.2009,46(7)：1184-1191.
    [72]宋锐,林鸿飞.基于文档语义图的中文多文档摘要生成机制.中文信息学报.2009,23(3)：110-115.
    [73]刘茂福,李文捷,姬东鸿.基于事件项语义图聚类的多文档摘要方法.中文信息学报.2010,24(5)：77-84.
    [74]叶娜,蔡东风.一种面向查询的多文档摘要方法.中文信息学报.2010,24(6)：69-74.
    [75]刘美玲,赵铁军,郑德权,等.面向TDT的动态多文档文摘研究.哈尔滨工业大学学报.2010,42(11)：1767-1770.
    [76]黄承慧,印鉴,侯防.一种结合词项语义信息和TF-IDF方法的文本相似度量方法.计算机学报.2010,34(5)：856-864.
    [77]纪文倩,李舟军,巢文涵,等.一种基于LexRank算法的改进的自动文摘系统.计算机科学.2010,37(5)：151-154.
    [78]李芳,何婷婷.面向查询的多模式自动摘要研究.中文信息学报.2011,25(2)：9-14.
    [79]苗家,马军,陈竹敏.一种基于HITS算法的Blog文摘方法.中文信息学报.2011,25(1)：104-109.
    [80]龚书,瞿有利,田盛丰.基于维基语义的多文档文摘研究.南京大学学报(自然科学版).2011,47(4)：398-406.
    [81]商玥,林鸿飞,杨志豪.利用语义关系抽取生成生物医学文摘的算法.计算机科学与探索.2011,05(11)：1027-1036.
    [82]罗文娟,马慧芳,何清,等.权衡熵和相关度的自动摘要技术研究.中文信息学报.2011,25(5)：9-16.
    [83]刘美玲,郑德权,赵铁军,等.动态多文档文摘模型.软件学报,2012,23(2)：289-298.
    [84]韩永峰,许旭阳,李弼程,等.基于事件抽取的网络新闻多文档自动摘要.中文信息学报.2012,26(1)：58-66.
    [85]哈罗德·博科,查尔斯·L·贝尼埃合著,赖茂生,王知津合译.文摘的概念与方法.北京：书目文献出版社,1991.
    [86]Merchant, R, Okurowski, M. E.; and Chinchor, N. The Multilingual Entity Task (MET) Overview. In Proceedings, Tip ster Text Program (Phase Ⅱ).1996.
    [87]Gaizauskas R, Wilks Y, Information Extraction:Beyond Document Retrieval. Journal of Documentation,1997.
    [88]Sager N, Natural Language Information Processing, Reading, Massachusetts:Addison Wesley,1981.
    [89]Dejong G, An Overview of the FRUMP System. In:LEHNERT, W.,& RINGLE, M.h. (eds), Strategies for Natural Language Processing. Lawrence Erlbaum,1982,149-176.
    [90]Ping Luo, Fen Lin, Yuhong Xiong, Yong Zhao,et al. Towards combining web classification and web information extraction:a case study. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Paris,2009:1235-1243.
    [91]Daya C. Wimalasuriya, Dejing Dou. Ontology-based information extraction:An introduction and a survey of current approaches. Journal of Information Science.2010,36(3):306-323.
    [92]Mstislav Maslennikov, Tat-Seng Chua. Combining relations for information extraction from free text. ACM Transactions on Information Systems.2010,28(3):14-48.
    [93]Ying Chen, Sophia Yat Mei Lee, Chu-Ren Huang. A robust web personal name information extraction system. Expert Systems with Applications.2012,39(3):2690-2699.
    [94]Ji Haifeng, Yang Maria, Honda Tomonori. An approach to the extraction of preference-related information from design team language. Research in Engineering Design.2012,23(2):85-103.
    [95]Ropero Jorge, Gomez Ariel, Carrasco Alejandro, et al. A Fuzzy Logic intelligent agent for Information Extraction:Introducing a new Fuzzy Logic-based term weighting scheme. Expert Systems with Applications.2012,39(4):4567-4581.
    [96]杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法.软件学报.2008,19(2)：209-223.
    [97]曹冬林,廖祥文,许洪波,等.基于网页格式信息量的博客文章和评论抽取模型.软件学报.2009,20(5)：1282-1291.
    [98]刘伟,严华梁,肖建国,等.一种Web评论自动抽取方法.软件学报.2010,21(12)：3220-3236.
    [99]赵世奇,赵琳,刘挺,等.基于二元分类的复述搭配抽取.软件学报.2010,21(6)：1267-1276.
    [100]王宏志,樊文飞.复杂数据上的实体识别技术研究.计算机学报.2011,34(10)：1843-1852.
    [101]赵军,刘康,周光有,等.开放式文本信息抽取.中文信息学报.2011,25(6)：98-110.
    [102]黄九鸣,吴泉源,刘春阳,等.短文本信息流的无监督会话抽取技术.软件学报.2012,23(4)：735-747.
    [103]李芳,盛焕烨,张冬茉.多语种投资信息抽取系统的实现.上海交通大学学报.2004,38(1)：21-25.
    [104]H.-H. Chen,Y.-W. Ding, S.-C. Tsai, et al. Description of the NTU System Used for MET-2. Virginia, Seventh Message Understanding Conference,1998:1-9.
    [105]S. Yu, S. Bai, P. Wu, Description of the Kent Ridge Digital Labs System Used for MUC-7. Virginia, Seventh Message Understanding Conference,1998:1-16.
    [106]ZHANG Yimin, ZHOUJ F. A Trainable Method for Extracting Chinese Entity Names and Their Relations. Hong Kong, the 2nd Chinese Language Processing Workshop,2000:66-72.
    [107]刘海鹏.面向手机短信的命名实体识别研究(硕士学位论文).北京：北京邮电大学,2009.
    [108]李蕾,周延泉,王菁华.基于全信息的中文信息抽取系统及应用.北京邮电大学学报.2005,28(6)：48-51.
    [109]Douthat A, The Message Understanding Conference Scoring Software User's Manual, In Proceedings of the Seventh Message Understanding Conference,1998.
    [110]Leek, T.R. Information Extration Using Hidden Markov Models (Master's thesis). California:UC San Diego,1997.
    [111]Weischedel, R. Nymble:a high-performance learning name-finder. Fifth Conference on Applied Natural Language Processing, Washington,1997:194-201.
    [112]Seymore, A. McCallum, R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction, Orlando,1999:37-42.
    [113]Freitag D, McCallum A, Pereira F.Maximum entropy Markov models for information extraction and segmentation.Proceedings of The Seventeenth International Conference on Machine Learning, San Francisco,2000:591-598.
    [114]Souyma Ray, Mark Craven. Representing sentence structure in hidden markov models for information extraction. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Washington,2001:273-279.
    [115]刘云中,林亚平,陈治平.基于隐马尔可夫模型的文本信息抽取.系统仿真学报.2003,16(3)：507-509.
    [116]Freitag D. Machine Learning for Information Extraction in Informal Domains (PhD Thesis). Pittsburgh:Carnegie Mellon University,1998.
    [117]Lawrence E. Rabiner. A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition. Proceedings of the IEEE.1989,77(2):257-286.
    [118]Marti A. Hearst. TextTiling:segmenting text into multi-paragraph subtopic passages. Computational Linguistics.1997,23(1):33-64.
    [119]Jeffrey C.Reynar. An automatic method of finding topic boundaries. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico,1994:331-333.
    [120]Doug Beeferman, Adam Berger, John Lafferty. Statistical Models for Text Segmentation. Machine Learning.1999, (34):177-210.
    [121]Min Wan, Zhengsheng Luo. Study On Topic Segments Method in Automatic Abstracting System. Natural Language Processing and Knowledge Engineering (NLPKE) Mini Symposium of the 2001 IEEE International Conference on Systems, Man, and Cybernetics (SMC2001), Tucson,2001:10-16.
    [122]傅间莲,陈群秀.自动文摘系统中的主题划分问题研究.中文信息学报.2005,19(6)：28-35.
    [123]康恺,林坤辉,周昌乐.基于主题词频数特征的文本主题划分.计算机应用.2006,26(8)：1993-1995.
    [124]孔庆苹,刘宗田,廖涛.基于概念获取的多文档主题划分研究.计算机科学.2008,35(5)：131-133.
    [125]Kenneth Ward Church, Patrick Hanks. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics.1990,16(1):22-29.
    [126]Buyukkokten O, Garcia-Molina H, Paepcke A. Accordion summarization for end game browsing on PDAs and cellular phones. Proceeding of ACM Conference on Human Factors in Computing Systems, New York,2001:213-220.
    [127]钟彬彬,刘远超,徐志明.基于GA的文本主题切分中的参数优化研究.计算机工程与应用.2005,41(21)：97-99.
    [128]Luxburg U V. A tutorial on spectral clustering.Statistics and Computing.2007,17(4):395-416.
    [129]Ng A Y, Jordan M L, Weiss Y. On spectral clustering:Analysis and an algorithm. Advances in Neural Information Processing Systems 14, USA,2001:849-856.
    [130]Tian Z, Li X B, Ju Y W. Spectral clustering based on matrix perturbation theory. Science in China Series F:Information Sciences.2007,50(1):63-81.
    [131]Weiss Y. Segmentation using eigenvectors:A unified view. Proceedings IEEE International Conference on Computer Vision, Greece,1999:975-982.
    [132]Malik J, Belongie S, Leung T, et al. Contour and texture analysis for image segmentation. International Journal of Computer Vision.2000,43(1):7-27.
    [133]Hong Chang, Dit-Yan Yeung. Robust path-based spectral clustering with application to image segmentation.2005 IEEE International Conference on Computer Vision, Beijing,2005:278-285.
    [134]Alzate, C., Suykens, J.A.K. Multiway Spectral Clustering with Out-of-Sample Extensions through Weighted Kernel PCA. IEEE Transactions on Pattern Analysis and Machine Intelligence.2010,32(2): 335-347.
    [135]Celikyilmaz, A. Soft-Link Spectral Clustering for Information Extraction.2009 IEEE International Conference on Semantic Computing, Berkeley,2009:434-441.
    [136]Donghui Yan, Ling Huang, Michael I. Jordan. Fast approximate spectral clustering. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Paris,2009: 907-915.
    [137]Dimitrios Mavroeidis. Accelerating spectral clustering with partial supervision. Data Mining and Knowledge Discovery.2010,21(2):241-258.
    [138]Feiping Nie, Zinan Zeng, Tsang, I.W. et al. Spectral Embedded Clustering:A Framework for In-Sample and Out-of-Sample Spectral Clustering. IEEE Transactions on Neural Networks.2011,22(11): 1796-1808.
    [139]Xianchao Zhang, Quanzeng You. Cluster ability Analysis and Incremental Sampling for Nystrom Extension Based Spectral Clustering.2011 IEEE International Conference on Data Mining, Vancouver, 2011:942-951.
    [140]Yong Ma, Chang-chun Bao, Jia Liu. Speaker segmentation and clustering based on the improved spectral clustering.2011 IEEE International Workshop on Machine Learning for Signal Processing, Beijing, 2011:1-5.
    [141]Fanhua Shang, L. C. Jiao, Jiarong Shi, et al. Fast density-weighted low-rank approximation spectral clustering. Data Mining and Knowledge Discovery.2011,23(2):345-378.
    [142]Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka. A spectral approach to clustering numerical vectors as nodes in a network. Pattern Recognition.2011,44(2):236-251.
    [143]Hui Wu, Guangzhi Qu, Xingquan Zhu. Self-adjust local connectivity analysis for spectral clustering. Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining, Shenzhen,2011:209-224.
    [144]Chen Weifu, Feng Guocan. Spectral clustering:A semi-supervised approach. Neurocomputing.2012, 77(1):229-242.
    [145]Nie Feiping, Xu Dong, Li Xuelong. Initialization Independent Clustering With Actively Self-Training Method. IEEE Transactions on Systems, Man & Cybernetics:Part B.2012,42(1):17-27.
    [146]Fang Yixin, Wang Junhui. Selection of the number of clusters via the bootstrap method. Computational Statistics & Data Analysis.2012,56(3):468-477.
    [147]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类.软件学报.2007,18(10)：2412-2422.
    [148]徐森,卢志茂,顾国昌.基于矩阵谱分析的文本聚类集成算法.模式识别与人工智能.2009,22(5)：780-786.
    [149]钱鹏江,王士同,邓赵红,等.基于最小包含球的大数据集快速谱聚类算法.电子学报.2010,38(9)：2035-2041.
    [150]王娜,李霞.基于监督信息特性的主动半监督谱聚类算法.电子学报.2010,38(1)：172-176.
    [151]赵凤,焦李成,刘汉强,等.半监督谱聚类特征向量选择算法.模式识别与人工智能.2011,24(1)：48-56.
    [152]Prieto R, Jiang J, Choi C H. A New Spectral Clustering Algorithm for Large Training Sets. International Conference on Machine Learning and Cybernetics, China,2003:147-152.
    [153]Fern X Z, Brodley C E. Solving cluster ensemble problems by bipartite graph partitioning. Proceedings of the 21st International Conference on Machine Learning, New York,2004:281-288.
    [154]Sanguinetti G, Laidler J, Lawrence N. Automatic Determination of the Number of Clusters Using Spectral Algorithms. Proceedings of IEEE Machine Learning for Signal Processing, USA,2005:28-30.
    [155]Fischer I, Poland J. Amplifying the blockmatrix structure for spectral clustering. Proceedings of the 14th Annual Machine Conference of Belgium and the Netherlands, Switzerland,2005:21-28.
    [156]Fowlkes C. Belongie S, Chung F. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence.2007,26(2):217-225.
    [157]徐森,卢志茂,顾国昌.解决文本聚类集成问题的两个谱算法.自动化学报.2009,35(7)：997-1002.
    [158]John Nerbonne. Data-driven dialectology. Language and Linguistics Compass.2009,3(1):175-198.
    [159]Wieling M, Nerbonne J.Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology. Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, Singapore,2009:14-22.
    [160]Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.2000,22(8):888-905.
    [161]Dhillon I. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco,2001:269-274.
    [162]Zhang B, Hsu M, Dayal U. K-harmonic means-a data clustering algorithm. HP Laboratories Palo Alto,1999. http://www.hpl.hp.com/techreports/1999/HPL-1999-124.pdf.
    [163]Zhang B. Generalized K-Harmonic Means-Boosting in Unsupervised Learning. HP Labs Technical Reports,2000.http://www.hpl.hp.com/techreports/2000/HPL-2000-137.html.
    [164]Strehl A, Ghosh J. Cluster ensembles-a knowledge reuse framework for combining partitionings. The Journal of Machine Learning Research.2002,3:583-617.
    [165]Berry M, Do T, O'Brien G, Krishna V, Varadhan S. SVDPACKC (version 1.0) user's guide,2007. http://citeseer.ist.psu.edu/9643.html.
    [166]Pollito M, Perona P. Grouping and dimensionality reduction by locally linear embedding. Advances in Neural Information Processing Systems 14, England,2002:1255-1262.
    [167]Zheng X, Lin X Y. Automatic determination of intrinsic cluster number family in spectral clustering using random walk on graph.2004 International Conference on Image Processing (ICIP), Singapore,2004:3471-3474.
    [168]Li K, Liu Y S. A spectral clustering algorithm based on self-adaption.2007 International Conference on Machine Learning and Cybernetics (ICMLC), Hong Kong,2007:3965-3968.
    [169]CAI X Y, Dai G Z, Yang L B, et al. A self-adaptive spectral clustering algorithm. Proceedings of the 27th Chinese Control Conference, Kunming,2008:551-553.
    [170]卜德云,张道强.自适应谱聚类算法研究.山东大学学报(工学版).2009,39(5：22-26.
    [171]孔万增,孙志海,杨灿,等.基于本征间隙与正交特征向量的自动谱聚类.电子学报.2010,38(8)：1980-1985.
    [172]胡俊,黄厚宽,高芳.一种基于平行坐标度量模型的聚类算法及其应用.南京大学学报(自然科学).2009,45(5)：645-655.
    [173]Bezdek J C, Hathaway R J. VAT:A tool for visual assessment of (cluster) tendency. Proceedings of the International Joint Conference on Neural Networks, Piscataway,2002:2225-2230.
    [174]Soille P. Morphological image analysis:Principles and applications. USA:Springer-Verlag,1999, 1-391.
    [175]Otsu N. A threshold selection method from gray-level histograms. IEEE Transactions on System Man and Cybernetic.1979,9(l):62-66.
    [176]Ani Nenkova, Kathleen McKeown. Automatic Summarization.Foundations and Trends(?)in Information Retrieval.2011,5(2):103-233.
    [177]ChinYew Lin, Eduard Hovy. From Single to Multi-document Summarization:A Prototype System and its Evaluation.Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia,2002:25-34.
    [178]Dragomir R. Radev, Kathleen R. McKeovwn. Generating Natural Languages Summaries from Multiple On-Line Sources. Computational Linguistics.1998,24(3):21-29.
    [179]R. Radev, Hongyan Jing, Malgorzata Budzikowska. Centroid-based summarization of multiple documents:sentence extraction, utility-based evaluation, and user studies.Proceddings of the 2000 NAACL-ANLPWorkshop on Automatic summarization, Seattle,2000:21-29.
    [180]Naomi Daniel, Dragomir Radev, Timothy Allison. Sub-event based multi-document summarization.Proceedings HLT-NAACL 03 on Text summarization workshop, Canada,2003:9-16.
    [181]Endre Boros, Paul B. Kantor, David J. Neu. A Clustering Based Approach to Creating Multi-Document Summaries.In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,New Orleans,2001:1-35.
    [182]Pascale Fung, Grace Ngai. Combining Optimal Clustering and Hidden Markov Model for Extractive.Proceedings of the ACL 2003 workshop on multilingual summarization and question answering,Sapporo,2003:21-28.
    [183]Yohei Seki. Sentence Extraction by tf/ idf and Position Weighting from Newspaper Articles.Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, Tokyo,2002:55-59.
    [184]Rie Kubota Ando, Branimir K.Boguraev, Roy J. Byrd, et al. Multi-documentSummarization by Visualizing Topical Content.Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic summarization, Seattle,2000:79-98.
    [185]Tsutomu Hirao, Jun Suzuki, Hideki Isozaki, et al. Dependency-based Sentence Alignment for Multiple Document Summarization.Proceedings of 20th International Conference on Computational Linguistics, Stroudsburg,2004:446-452.
    [186]T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, California,1999:1-8.
    [187]David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research.2003,3:993-1022.
    [188]Zhongwu Zhai, Bing Liu, Hua Xu, et al. Constrained LDA for grouping product features in opinion mining. Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining, Shenzhen,2011:448-459.
    [189]Chongyang Zhang, Jingyu Yang. An improvement to matrix-based LDA. Proceedings of the 3rd international conference on Artificial intelligence and computational intelligence, Taiyuan,2011:562-568.
    [190]Hiroshi Fujimoto, Minoru Etoh, Akira Kinno, et al. Topic analysis of web user behavior using LDA model on proxy logs. Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining, Shenzhen,2011:525-536.
    [191]Xianghua Fu, Guo Liu, Yanyan Guo, et al. Multi-aspect Blog sentiment analysis based on LDA topic model and hownet lexicon. Proceedings of the 2011 International conference on Web information systems and mining, Taiyuan,2011:131-138.
    [192]He Tingting, Li Fang. Semantic Knowledge Acquisition from Blogs with Tag-Topic Model. China Communications.2012,9(3):38-48.
    [193]张小平,周雪忠,黄厚宽,等.基于词相似性与CRP的主题模型.模式识别与人工智能.2010,23(1)：72-76.
    [194]杨潇,马军,杨同峰,等.主题模型LDA的多文档自动文摘.智能系统学报.2010,5(2)：169-176.
    [195]张小平,周雪忠,黄厚宽,等.一种改进的LDA主题模型.北京交通大学学报.2010,34(2)：111-114.
    [196]李志欣,施智平,李志清,等.融合语义主题的图像自动标注.软件学报.2011,22(4)：801-812
    [197]张晨逸,孙建伶,丁轶群.基于MB-LDA模型的微博主题挖掘.计算机研究与发展.2011,48(10)：1795-1802.
    [198]徐戈,王厚峰.自然语言处理中主题模型的发展.计算机学报.2011,34(8)：1423-1436.
    [199]Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, et al.The Author-Topic Model for Authors and Documents.Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence, Arlington, 2004:487-494.
    [200]Hal Daume Ⅲ, Daniel Marcu. Bayesian Query-Focused Summarization.Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Sydney,2006:305-312.
    [201]Rachit Arora, Balaraman Ravindran. Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization.Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa,2008:713-718.
    [202]Aria Haghighi, Lucy Vanderwende. Exploring Content Models for Multi-Document Summarization.Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Colorado,2009:362-370.
    [203]Nenkova, L. Vanderwende. The impact of frequency on summarization. USA:Microsoft Research, 2005.
    [204]R Arora, B Ravindran. Latent Dirichlet Allocation Based Multi-document Summarization.Proceedings of the second workshop on Analytics for noisy unstructured text data. New York,2008:91-97.
    [205]Ying-Lang Chang, Jen-Tzung Chien. Latent Dirichlet Learning for Document Summarization. Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei,2009:1689-1692.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700