基于词共现的文本主题挖掘模型和算法研究

英文题名：Research on Terms Co-occurrence Based Models and Algorithms for Text Mining
作者：常鹏
论文级别：博士
学科专业名称：管理科学与工程
中文关键词：文本主题挖掘 ; 词共现 ; 文档聚类 ; 词聚类 ; 主题词抽取
英文关键词：Text Topic Mining ; Terms Co-occurrence ; Document Clustering ; Terms Clustering ; Keyword Extraction
学位年度：2010
导师：李敏强
学科代码：1201
学位授予单位：天津大学
论文提交日期：2009-12-01

摘要

随着信息技术的发展与社会信息化进程的加快,数字化的信息呈爆炸式的增长,已经远远超出了人类的理解与概括能力。利用计算机从大量的文本资料中自动发掘有价值的知识与信息,是解决这一难题的有效途径。本文以数据挖掘理论为基础,重点研究了文本主题挖掘的相关模型及算法。主要研究内容包括:
     首先,研究了文本的表示模型。通过分析词共现现象,从理论上证明了词共现现象与主题之间的相关关系,从而提出了基于词共现组合的文档表示模型(Co-occurrence Term Vector Space Model, CTVSM)。利用关联规则挖掘,抽取出文本集上的共现词组合集合,进而定义了基于CTVSM的文本表示向量,以及文本相似性的度量方法。
     其次,以CTVSM为基础研究了文本聚类问题,提出了基于CTVSM的文档层次聚类方法,将文档和文档的聚类表示为共现词组合的向量,利用文本相似性度量方法,设计了文档聚类之间的相似性度量方法。为了快速判断层次聚类过程中的最优划分层,定义了文档聚类的中心点,提出了基于聚类熵的最优划分层判断准则。实验证明,基于CTVSM的文档聚类取得了较好的效果。
     然后,研究了文本空间中的词聚类问题,根据文本集上的抽取出的共现词组合集合,定义了文本集上的词共现图,将词映射为图中的点,词与词的共现度映射为图中的连接两点的边,从而将词聚类问题转化为在图中划分点簇的问题。提出了基于图密度的词聚类方法,在聚类过程中,一个词加入一个词类的依据为该词的加入是否能显著提高该词类的图密度,直到所有词都被划分到词簇中。实验结果表明本文提出的方法与一般方法在算法复杂度(实验进行的时间)以及聚类效果上均有显著提高。
     最后,研究了文本集上挖掘出的主题在信息推荐与信息检索中的应用问题。以文本的主题抽取为例,利用文本空间中的主题信息,提高了文本主题抽取的质量。通过对文本主题的预测,确定文档所属的主题域,进而确定了该文本主题词抽取的领域词范围,据此对文档中的词的权重进行调整,从而使主题领域词汇得以较高的权重,保证了抽取出的主题词的主题精确度。实验证明,算法提高了文本主题词抽取的质量,特别是在词频权重区别度不明显的短文本中,抽取质量有显著提高。
There has been a phenomenal growth of information during past decades. The work of understanding the massive information has been a hopeless for human-beings. To obtain information automatically from the text information has become a key problem in our information research society. The main research work of this thesis is based on statistical machine learning methods with the usage of co-occurrence, especially the Text Mining models and algorithms. The main contents are as follows:
     First, a novel model of document is presented which is built with co-occurrence term, named co-occurrence term vector space model (CTVSM). The algorithm of mining associate rules is employed to extract the co-occurrence terms in the document space. Then the document model is defined with these co-occurrence terms and measurement of the similarity between two documents is defined further. Experimental results show that the distance of documents which are less similar is farther than distance in Euclidean space basis of VSM, and the distance of documents are more similar is closer than the one in Euclidean space.
     Second, on the basis of CTVSM, a novel document clustering algorithm is proposed. In this algorithm the document and cluster are presented by CTVSM and the measurement of different clusters is given according to the measurement of documents. In order to decide the optimal number of clusters, clustering gain as a measure for clustering optimality is advanced. It shows good performance producing intuitively reasonable clustering configurations in document clustering according to the evidence from experimental results.
     Third, another focus of this thesis is on using CTVSM to cluster large scale terms in document space. A map of co-occurrence terms is defined, in which words are mapped into dots and relationship between the co-occurrence words is mapped into edges. An algorithm of word clustering is proposed based on this map. It joints the word with the cluster on the basis of the change of the cluster’s density. It shows that this algorithm is better than the normal word clustering method in both performance and efficiency.
     Finally, an application of the topic map extracted from the document space is proposed. An algorithm of subject words extraction is improved by using topic map. Topics of a document are identified by means of estimation of statistical topic model. Thus the document’s topic term fields are identified. The weight of terms is adjusted according to the topic term fields. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.

引文

[1] Denning, P. J., Electronic junk, Communications of the ACM, 1982, 25(3): 163-165
    [2] Mooers, C. N., Zatocoding applied to mechanical organization of knowledge. American Documentation, 1951, 2, 20-32
    [3] Belkin, N. J., Croft, B. B., Information filtering and information retrieval: two sides of the same coin?, Communications of the ACM, 1992, 35: 29-38
    [4] Salton, G., The SMART retrieval system - Experiments in automatic document processing, Prentice-Hall, 1971
    [5] WordNet. http://wordnetweb.princeton.edu/perl/webwn
    [6]朱明.数据挖掘.合肥:中国科学技术大学出版社,2002.5.
    [7]约翰.奈比斯特.大趋势——改变我们生活的十个新方向.北京:中国社会科学出版社, 1985.
    [8]林鸿飞.基于概念的文本结构分析方法.计算机研究与发展, 2000, 37(3): 324-328.
    [9] Yin Zhong-hang, Wang Yong-cheng, Cai Wei, Han Ke-song. Extracting Subject from Internet News by String Match. Journal of Software. 2002, Vol.13, No.2: 159-167
    [10] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, USA(2001)
    [11] Sneath, P. H. A., Sokal, R. R., Numerical Taxonomy, Freeman, 1973
    [12] King, B., Step-wise clustering procedures, Journal of the American Statistical Association, 1967, 69: 86–101
    [13] P. Willett. Recent trends in hierarchic document clustering: a critical review [J]. In: Information Processing and Management, 1988, 24 (5): 577 - 597.
    [14] McQueen, J., Some methods for classification and analysis of multivariate observations, In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281–297
    [15] Rocchio J J. Document retrieval systems——Optimization and evaluation [PhD dissertation]. Harvard University, Cambridge, MA, 1966.
    [16] A. Likas, N. Vlassis, and J. J. Verbeek. The global k-means algorithm. Pattern Recognition[J]. Vol. 36, 2003, 451 - 461.
    [17] Agrawal, R., Gehrke, J., Gunopulos, D., et al, Automatic subspace clustering of high dimensional data for data mining applications, In: Proceedings of the 1998 ACMSIGMOD International Conference on the Management of Data, 1998, 94–105
    [18] Hinneburg, A., Keim, D., An efficient approach to clustering in large multimedia databases with noise, In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998
    [19] Ester, M., Kriegel, H. P., Sander, J., et al, A density-based algorithm for discovering clusters in large spatial databases with noise, In: Proceedings ofthe 2nd International Conference on Knowledge Discovery and Data Mining, 1996
    [20] Figueiredo, M., Jain, A. K., Unsupervised learning of finite mixture models, IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24: 381-396
    [21] Jain, A. K., Dubin, R., Mao, J., Statistical pattern recognition: A review, IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(1): 4-38
    [22] Quinlan, J.R., Induction of decision trees, Machine Learning, 1986, (1), 81-106
    [23] Freund, Y., Schapire, R.E., Experiments with a new boosting algorithm, In: Proceedings of the Thirteenth International Conference on Machine Learning, 1996, 148-156
    [24] Brin S. Extracting patterns and relations from the World Wide Web. In: Proc of Web DB Workshop at EDBT'98.Valencia, 1998.
    [25] Wang Ke, Liu Huiqing. Schema discovery from semi-structured data. In:Proc of the 3rd Intel Conf on Knowledge Discovery and Data Mining. Newport Beach,1997.
    [26] R. Feldman and I. Dagan. KDT-Knowledge Discovery in Texts. In Proc of the First Int. Conf. on Knowledge Discovery (KDD), 1995, 112–117.
    [27] Wang Zan, Y.C. TSIM, W.S. Yeung, K.C. Chan, Jinlan Liu. Probabilistic Latent Semantic Analyses (PLSA) in Bibliometric Analysis for Technology Forecasting. Journal of Technology Management & Innovation. 2007, Vol.2, Issue 1. 11-24
    [28] Salton G., McGill M J. Introduction to Modern Information Retrieval[M]. McGraw-Hill, 1983
    [29] Salton G, Wong A, Yang C S. A vector space model for automatic indexing[C]. Communications of the ACM, 1975, Vol. 18 (11): 613 - 620.
    [30] Jason J. Jung. Ontological framework based on contextual mediation for collaborative information retrieval. Information Retrieval. 2007, Vol.10: 85-109
    [31] Christopher Brewster, Kieron O’Hara. Knowledge representation with ontologies: Present challenges—Future possibilities. International Journal of Human-Computer Studies. 2007, Vol. 65: 563-568
    [32] Jacob K?hler, Stephan Philippi, Michael Specht, Alexander Rüegg. Ontology based text indexing and querying for the semantic web [J]. Knowledge-Based Systems. 2006, Vol.19: 744-754
    [33]高茂庭,王正欧.一种基于双词关联的文本特征选择模型.计算机工程与应用. 2007, Vol. 43(10): 183-185
    [34] Tom Mitchell. Machine Learning[M] . McCraw Hill , 1996.
    [35] Kenneth Ward Church and Patric K Hanks. Word association norms, mutual information and lexicography[C]. In: Proceedings of ACL27, Vancouver, Canada, 1989: 76– 83
    [36] T. E. Dunning. Accurate methods or the statistics of surprise and coincidence [C]. In: Computational Linguistics, 1993, Vol. 19(1): 61– 74.
    [37] Y. Yang. A Comparative Study on Feature Selection in Text Categorization[C].In: Proceeding of the Fourteenth International Conference on Machine Learning (ICMLp97), 1997: 412 - 420.
    [38]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报. 2004, Vol. 18, No.11: 26-32
    [39] Robertson S, Sparck-Jones K. Relevance Weighting of Search Terms[J]. Journal of American Society for Information Science, 1976, 3(27):129-146
    [40]曹恬,周丽,张国煊.一种基于词共现的文本相似度计算[J].计算机工程与科学. 2007, Vol.29, No13: 52-53
    [41] Chakrabarti S, Roy S, Soundalgekar M. Fast and accurate text classification via multiple linear discriminant projections. Int’l Journal on Very Large Data Bases, 2003, Vol. 12(2):170?185.
    [42] Wu H, Phang TH, Liu B, Li X. A refinement approach to handling model misfit in text categorization. In: Davis H, Daniel K, Raymoind N, eds. Proc. of the 8th ACM Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD-02). Edmonton: ACM Press, 2002. 207?216.
    [43] Wang J, Wang H, Zhang S, Hu Y. A simple and efficient algorithm to classify a large scale of text. Journal of Computer Research and Development, 2005, Vol. 42(1):85?93 (in Chinese with English abstract).
    [44] Tan S, Cheng X, Wang B, Xu H, Ghanem MM, Guo Y. Using dragpushing to refine centroid text classifiers. In: Ricardo ABY, Nivio Z, Gary M, Alistair M, John T, eds. Proc. of the ACM SIGIR-05. Salvador: ACM Press, 2005. 653?654.
    [45] Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 2004, Vol. 56(6):584?596.
    [46] Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 2004, Vol. 56(6):584?596.
    [47] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: Nedellec C, Rouveirol C, eds. Proc. of the 10th European Conf. on Machine Learning (ECML-98). Chemnitz: Springer-Verlag, 1998. 137?142.
    [48] Yang Y, Liu X. A re-examination of text categorization methods. In: Gey F, Hearst M, Rong R, eds. Proc. of the 22nd ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-99). Berkeley: ACM Press, 1999. 42?49.
    [49] Lewis DD, Li F, Rose T, Yang Y. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, Vol. 5, No.3: 361?397.
    [50] Forman G, Cohen I. Learning from little: Comparison of classifiers given little training. In: Jean FB, Floriana E, Fosca G, Dino P, eds. Proc. of the 8th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD-04). Pisa: Springer-Verlag, 2004. 161?172.
    [51] Kazama J, Tsujii J. Maximum entropy models with inequality constraints: Acase study on text categorization. Machine Learning, 2005, Vol. 60(1-3): 159?194.
    [52] Li R, Wang J, Chen X, Tao X, Hu Y. Using maximum entropy model for Chinese text categorization. Journal of Computer Research and Development, 2005, Vol. 42(1): 94?101.
    [53] Liu WY, Song N. A fuzzy approach to classification of text documents. Journal of Computer Science and Technology, 2003, Vol. 18(5): 640?647.
    [54] Bigi B. Using Kullback-Leibler distance for text categorization. In: Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03). Pisa: Springer-Verlag, 2003. 305?319.
    [55] Nunzio GMD. A bidimensional view of documents for text categorisation. In: McDonald S, Tait J, eds. Proc. of the 26th European Conf. on Information Retrieval Research (ECIR-04). Sunderland: Springer-Verlag, 2004. 112?126.
    [56] Lam W, Lai KY. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans. On Pattern Analysis and Machine Intelligence, 2003, Vol. 25(5): 628?633.
    [57] Muller KR, Mika S, Ratsh G, Tsuda K, Scholkopf B. An introduction to kernel-based learning algorithms. IEEE Trans. on Neural Networks, 2001, Vol.12(2): 181?202.
    [58] Zaragoza HH, Ralf. The perceptron meets Reuters. In: Proc. of the NIPS 2001 Machine Learning for Text and Images Workshop. 2001. http://citeseer.ist.psu.edu/456556.html
    [59] Sneath, P. H. A., Sokal, R. R., Numerical Taxonomy, Freeman, 1973
    [60] McQueen, J., Some methods for classification and analysis of multivariate observations, In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281–297
    [61] Agrawal, R., Gehrke, J., Gunopulos, D., et al, Automatic subspace clustering of high dimensional data for data mining applications, In: Proceedings of the 1998 ACMSIGMOD International Conference on the Management of Data, 1998, 94–105
    [62] Hinneburg, A., Keim, D., An efficient approach to clustering in large multimedia databases with noise, In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998
    [63] Ester, M., Kriegel, H. P., Sander, J., et al, A density-based algorithm for discovering clusters in large spatial databases with noise, In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996
    [64] Jorma J. Rissanen. Fisher Information and Stochastic Complexity. IEEE Transactions on Information Theory, 1996.2, VOL. 42, NO. 1, 40-47
    [65] A. Casillas, M. T. GonzálezdeLena and R. Martínez. Document clustering into an unknown number of clusters using a Genetic Algorithm [A ]. International Conference on Text Speech and Dialogue TSD, 2003.
    [66]张健沛,杨悦,杨静,张泽宝.基于最优划分的K-Means初始聚类中心选取算法.系统仿真学报. 2009-05, Vol. 21, No. 9: 2586-2590
    [67] A. Likas, N. Vlassis, and J. J. Verbeek. The global k-means algorithm. Pattern Recognition[J]. Vol. 36, 2003, 451 - 461.
    [68]赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报. 2007年3月, Vol. 21, No. 2: 58-62
    [69] Yunjae Jung, Haesun Park, Ding-Zhu Du and Barry L.Drake. A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering. Journal of Global Optimization. 2003, Vol. 25: 91–111.
    [70] Liu, X., Gong, Y., Xu, W., et al, Document clustering with cluster refinement and model selection capabilities, In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, 11-15
    [71] Nigam, K., McCallum, A. K., Thrun, S., et al, Text classification from labeled and unlabeled documents using EM, Machine Learning, 2000, 39(2/3): 103–134
    [72] Blei, D. M., Ng, A. Y., Jordan, M. I., Latent Dirichlet allocation, Journal of Machine Learning Research, 2003, 3:993-1022
    [73] Hofmann, T., Probabilistic latent semantic analysis, In: Proceedings of Uncertainty in Artificial Intelligence, UAI'99, 1999
    [74]韩客松,王永成.一种用于主题提取的非线性加权算法[J].情报学报. 2000-12, Vol.19, No.6: 650～653
    [75]王永成,顾晓明,王丽霞.中文文献主题的自动标引[J].情报学报. 1998, Vol.17, No.3, 219-225
    [76]郑家恒,卢娇丽.关键词抽取方法的研究.计算机工程. 2005.9, Vol.31, No.18, 194-196
    [77]张清军,朱才连.基于统计的中文文本主题自动提取研究.四川大学学报(工程科学版). 2004.5, Vol.36, No.3, 97-100
    [78] Written I H, Paynter G W, Frank E, et al . KEA: Practical automatic key phrase extraction. Proceedings of the Fourth ACM Conference on Digital Libraries. 1999 :254～255.
    [79] Tzeras K, Hartmann S. Automatic indexing based on Bayesian inference networks. Proceedings of International ACM SIGIR Conference Research and Development in Information Retrieval, Inference Networks. 1993 , 22～34.
    [80]耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学). 2006年3月, Vol.42, No.2: 156～162
    [81]陈炯,张永奎.一种基于词聚类的中文文本主题抽取方法[J].计算机应用. 2005-4, Vol.25, No.4: 754～756
    [82] Zhang Chengzhi, Zhang Qingguo. Topic Navigation Generation Using Topic Extraction and Clustering. 2008 International Symposium on Knowledge Acquisition and Modeling. 2008, 333-339
    [83] Hang Li, Kenji Yamanishi. Topic analysis using a finite mixture model. Information Processing and Management. 2003, Vol.39, 521–541
    [84] Deerwester, S., Dumais, G.W., Furnas, S. T., Landauer, T. K., &Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, Vol.41, 391–407.
    [85] Thomas Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis [J]. Machine Learning. 2001, Vol. 42: 177–196
    [86] Kazuhiro Morita, El-Sayed Atlam, Masao Fuketra, Kazuhiko Tsuda, Masaki Oono, Jun-ichi Aoe. Word classification and hierarchy using co-occurrence word information. Information Processing and Management. 2004, Vol.40: 957–972
    [87] Han-joon Kim, Sang-goo Lee. Building topic hierarchy based on fuzzy relations. Neurocomputing. 2003, Vol.51: 481– 486
    [88]乔亚男,齐勇,侯迪.一种高稳定性词汇共现模型.西安交通大学学报. 2009年6月, Vol.43, No.6: 24-27
    [89]袁里驰.一种基于互信息的词聚类算法.系统工程. 2008年5月, Vol.26, No.5: 120-122
    [90]周新媛,杜洁,何强.基于共现的词聚类的研究.长沙大学学报. 2007年3月, Vol.21, No.2: 83-87
    [91] R.Feldman and I.Dagan. KDT-Knowledge Discovery in Texts. In Proceeding of the First International Conference on Knowledge Discovery (KDD), 1995, 112–117.
    [92] Ah-Hwee Tan.Text Mining: The state of the art and the challenges. In Proceddings, PAKDD'99 workshop on Knowledge Disocovery from Advanced Databases, Beijing, April 1999, 65-70.
    [93] Blei, D. M., Jordan, M. I., Ng, A. Y., Hierarchical Bayesian models for applications in information retrieval, Bayesian Statistics, 2003, 7: 25-43
    [94] Agrawal R et al. Mining association rule between sets of items in large database. In: Proc the ACM SIGMOD International Conference on Management of Data, Washington, 1993, 207-216
    [95] Agrawal R et al. Fast algorithms for mining association rules. In: Proc the 20th International Conference on Very Large DataBases. Santiago de Chile, 1994, 478- 499
    [96] Jain, A. K., Dubes Algorithms for clustering data, Englewood Cliffs, New Jersey, Prentice Hall, 1988
    [97] McLachlan, G., Basford, K., Mixture models: Inference and application to clustering , New York, Marcel Dekker, 1988
    [98] McLachlan, G., Peel, D., Finite mixture models, New York, John Wiley & Sons, 2000
    [99] Dempster, A. P., Laird, N. M., Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 1977, 39: 1-38
    [100] Xu, L., Jordan, M., On convergence properties of the EM algorithm for Gaussian mixtures, Neural Computation, 1996, 8: 129-151
    [101] Chrétien, S., Hero III, A., Kullback proximal algorithms for maximum likelihood estimation, IEEE Trans on Information Theory, 2000, 46: 1800-1810
    [102] Jeff A.Bilmes,A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden MarkovModels,Technical Report TR-97-021, Computer Science Division Department of Electrical Engineering and Computer Science,U.C. Berkeley, April 1998
    [103] Petri T.Kontkanen,Petri J.Myllymaki and Henry R.Tirri,Comparing Bayesian Model Class Selection Criteria by Discrete Finite Mixtures, Information, Statistics and Induction in Science (Proceedings of the ISIS’96 Conference in Melbourne, Australia, August 1996), 1996, 364-374
    [104] Donna Trivison. Term co-occurrence in cited/citing journal articles as a measure of document similarity. Information Processing & Management. 1987, Volume 23, Issue 3, Pages 183-194
    [105] Jorma Rissanen. Stochastic Complexity in Statistical Ingniry[M]. Sigapore: World Scientific Publishing Co. 1989.
    [106]吴光远,何丕廉,曹桂宏,聂颂.基于向量空间模型的词共现研究及其在文本分类中的应用.计算机应用. 2003年6月, Vol. 23:138-145
    [107] Agrawal R et al. Mining association rule between sets of items in large database. In: Proc the ACM SIGMOD International Conference on Management of Data. Washington, 1993, 207-216
    [108]毛国君,刘椿年.基于项目序列集操作的关联规则挖掘算法[J].计算机学报. 2002年4月, Vol. 25, No. 4: 417-422
    [109] Xinghua Li, Xindong Wu, Xuegang Hu, et al. Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages. In 2008 IEEE International Conference on Data Mining Workshops. 744-751
    [110] Liuling Dai, Bin Liu, YuNing Xia et al. Measuring Semantic Similarity between Words Using HowNet. International Conference on Computer Science and Information Technology 2008. 601-605
    [111] Zamir O and Etzioni O. Web Document Clustering : A Feasibility Demonstration [A] . In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. 1998.46254.
    [112]杨风召.高维数据挖掘技术研究,南京:东南大学出版社, 2007
    [113] Deerwester S, Dumains S T, Furnas G W, Landauer T K. Indexing by Latent Semantic Analysis. Journal of American Society for information Science, 1990, Vol.41(6): 391-407
    [114] Sebastiani, F. Machine learning in automated text categorization. ACM Computing Survey, 2002, 34(1):1-47
    [115] Berry MW. Large-scale Sparse Singular Value Computations. International Journal of Super-Computer Application, 1992, 6(1): 13-49
    [116]约翰逊R A,威克恩D W.陆璇,葛余博,赵衡秀等译,实用多元统计分析,北京:清华大学出版社, 2005
    [117] Liu, X., Gong, Y., Xu, W., et al, Document clustering with cluster refinement and model selection capabilities, In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, 11-15
    [118] Nigam, K., McCallum, A. K., Thrun, S., et al, Text classification from labeled and unlabeled documents using EM, Machine Learning, 2000, 39(2/3):103–134
    [119] Hofmann, T., Probabilistic latent semantic analysis, In: Proceedings of Uncertainty in Artificial Intelligence, UAI'99, 1999
    [120] Blei, D. M., Ng, A. Y., Jordan, M. I., Latent Dirichlet allocation, Journal of Machine Learning Research, 2003, 3:993-1022
    [121] P. Willett. Recent trends in hierarchic document clustering: a critical review [ J ]. In: Information Processing and Management, 24 (5): 577 - 597, 1988.
    [122] Mark Sinka and David Corne. A Large Benchmark Dataset for Web Document Clustering[J]. In Soft Computing Systems: Design, Management and Applications, Vol. 87 of Frontiers in Artificial Intelligence and Applications, pages 881 - 890, 2002
    [123] Tom Mitchell. Machine Learning [M]. McCraw Hill, 1996.
    [124] T. E. Dunning. Accurate methods or the statistics of surprise and coincidence [C]. In : Computational Linguistics, Volume 19 :1, pages 61-74 , 1993.
    [125] Kenneth Ward Church and PatricK Hanks. Word association norms, mutual information and lexicography[C]. In: Proceedings of ACL27, pages 76-83, Vancouver, Canada, 1989.
    [126] A. Casillas, M. T. GonzálezdeLena and R. Martínez. Document clustering into an unknown number of clusters using a Genetic Algorithm [A]. International Conference on Text Speech and Dialogue TSD, 2003.
    [127] Tao Li. Document clustering via Adaptive Subspace Iteration [A]. In: proceedings of the 12th ACM Interna2tional Conference on Multimedia[C]. New York, USA, 364 - 367, 2004.
    [128] A. Likas, N. Vlassis, and J. J. Verbeek. The global k-means algorithm. Pattern Recognition[J]. Vol. 36, 2003, 451 - 461.
    [129]范金城,梅长林.数据分析[M ].科学出版社. 2002年7月第一版.
    [130] http://www.nlp.org.cn/docs/docredirect.php?doc_id=295,中文自然语言处理开放平台, 2009-09-03.
    [131] Sven Martin, Hermann Ney. Algorithms for bigram and trigram word clustering. Speech Communication, 1998, Vol.24: 19-37
    [132]陈浪舟,黄泰翼.一种新颖的词聚类算法和可变长统计语言模型[J].计算机学报. 1999年9月, Vol. 22, No. 9: 942-948
    [133] Van Rijsbergen, C. J., Information Retrieval , Butterworths, London,1979

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700