嵌入分布信息的Web文档聚类算法研究

英文题名：Research on Clustering Algorithm for Web Document by Incorporating Distribution Information
作者：孙春红
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：分布信息 ; 聚类 ; Web文档挖掘 ; 核函数
英文关键词：Distribution information ; Clustering ; Web document mining ; Kenerl function
学位年度：2008
导师：杨明
学科代码：081203
学位授予单位：南京师范大学
论文提交日期：2008-03-01

摘要

随着Internet的迅速发展,Web信息资源已涵盖了社会生活的各个方面,网络信息过载问题日益突出,这促使Web挖掘技术迅速发展。本文从Web文档聚类的角度,围绕文档分布信息表示及其相似性度量方法、多角度聚类及核理论在多角度学习中的应用三个方面展开研究,主要工作包括以下几个方面:
     1.提出一种嵌入分布信息的文档相似性度量方法。现有的Web挖掘技术大部分是基于传统的VSM(Vector Space Model)向量空间,虽然能达到一定的效果,但是忽略了Web文档中其它有用的信息。针对此问题,本文引入了文档中单词的分布信息,提出了新的相似性度量方法。实验结果表明,新相似性度量方法能较好的提高聚类效果。
     2.提出一种多角度学习算法。该方法在传统多角度Kmeans算法的基础上,采用经典及新的相似性度量,尝试在不同角度上使用不同的学习算法,可更好地反映出数据集中文档的分布特征。实验结果表明,本文提出的多角度学习算法取得了较好的效果。
     3.提出一种基于核方法的多角度聚类算法。核化理论主要是通过不同核函数在原空间中诱导出不同的距离。本文分别采用多项式核和高斯核,进行了大量实验,实验结果表明,核化后的多角度聚类算法性能得到了明显改善。
With the rapid development of the internet, the information resources on the Web have covered all the fields of the society, the issue of overloading information becomes more serious day by day, which boosts the development of the Web Data Mining Technique. In this paper, from the viewpoint of web document clustering, we do our research on the representation of distribution information of a document and the corresponding similarity measurement, and multi-views clustering, and kernel based multi-views learning. The main contributions of this paper are as follows:
     1. Propose a similarity measurement method which incorporates distribution information. Most of the existing Web Data Mining techniques are based on VSM, which only achieves some effects, and does not concern other useful information contained in the web document. In this thesis, we introduce a new similarity measurement method with the distribution information of the word contained in the document, which is an extension of the traditional similarity measurement. Experiments show that, the new similarity measurement in this thesis has better clustering performance than the traditional similarity method.
     2. Propose a new mult-view algorithm. In this method, different algorithms have been applied on various views, which can express the distributional features of the document in the data set more clearly. Experimental results show that the accuracy of the classification has been improved.
     3. Propose a kernel-based co-training clustering algorithm. The different kernel functions can induce different distances of the original samples in original space. In this thesis, plenty of tests have been performed by using Polynomial Kernel and Gaussian Kernel; the results show that after adopting the kernel methods, the multi-view algorithm of clustering have been apparently improved.

引文

[1] Zifeng Cui, Baowen Xu Weifeng, Weifeng Zhang, Junling Xu, Hua-Jun Zeng, Qi-Cai He. Web Documents Clustering with Interest Links, Service-Oriented System Engineering, SOSE 2005. IEEE International Workshop Oct 2005.
    [2] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma. Learning to cluster Web searh results. SIGIR'04. Sheffield, South YorKshire. UK July 2004, 210-217.
    [3] T.Joachims. Text categorization with support vector machines: Learning with many relevant features. ECML'98, 137-142.
    [4] R.E.Schapire, Y.Singer. BoostexterA boosting-based system for text categorization. Machine Learning, 2000, 39(2/3): 135-168.
    [5] D.D.Lewis. Naive(Bayes) at forty :The independence assumption in information retrieval. In:Proc. of 10th European Conf. on Machine Learning ECML-98. Heidelberg: Springer-Verlag, 1998, 4-15.
    [6] M.Sauban, B.Pfahringer, M.Craven, D.Freitag, D.Freitag. Text categorization using document profiling. In: Proc. of PKDD-03, 411-422.

    [7] Xiao-Bing Xue, Zhi-Hua Zhou, Distributional features for Text Categorization. In: Proc. of 17th European Conference on Machine Learning (ECML'06), Berlin, Germany, LNAI 4212, 2006, 497-508.
    [8] D.Yarowsky, Disambiguation rivaling supervised methods. In: Proc. of the 33rd Annual Meeting of the Association for Computational Linguistics(ACL'95). 189-193.
    [9] A.Blum, T.Mitchell, Combining labeled and unlabeled Data with co-training. In: Proc. of the 11th Annual Conferenceon on Computational Learning theory (COLT'98), 1998, 92-100.

    [10] M.Collins, S.Yoram. Unsupervised models for named Entity classification. In: Proc. of the Workshop on Computational Methods in Natural Language Processing, EMNLP, 1999, 100-110.

    [11] K.Nigam, R.Ghani. Analyzing the applicability and effectiveness of co-training. In: Proc. 9th International Conference on Information and Knowledge Management (CIKM'2000), 86-93.

    [12] S.Dasgupta, L.Michael, M.David. PAC generalization bounds for co-training. In Advances in Neural Information Processing Systems(NIPS), Vancouver, British Columbia, Canada, 2001, 375-382.
    [13] S.Abney. Bootstraping. In: Proc. of the 40th Annual Meeting of the Association for Computational Linguistics ACL 2002,360-367.
    [14]Steffen Bickel,Tobias Scheffer.Multi-View Clustering.In:Proc.of the IEEE International Conference on Data Mining,2004,19-26.
    [15]Zhi-Hua Zhou,M.Li.Semi-supervised learning with co-training style algorithm.IEEE Transactions on Knowledge and Data Engineering,2007,19(11):1479-1493.
    [16]S.Goldman,Y.Zhou.Enhancing supervised learning with unlabeled data.In:Proc.of the 17~(th)International Conference on Machine Learning(ICML'00),2000,327-334.
    [17]Zhi-Hua Zhou,Ming Li.Tri-training:Exploiting unlabeled data using three classifiers.IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541.
    [18]W.Wang,Zhi-Hua Zhou.Analyzing co-training style algorithms.In:Proc.of the 18~(th)European Conference on Machine Learning(ECML'07),Warsaw,Poland,2007,454-465.
    [19]K.R.Muller,S.Mika,G.Ratsch,K.Tsuda,B.Scholkopf.An introduction to kernel-based learning algorithms.IEEE Transaction on Neural Networks,2001,12(2):181-202.
    [20]David M.J.Tax,P.W.Duin.Data domain description using support vector.In:Proc.of European Symposium on Artificial Neural Networks'99,251-256.
    [21]A.Ben-Hur,D.Horn.Support Vector Clustering.Journal of Machine Learning Research,2001,2:125-137.
    [22]张莉,周伟达,焦李成.核聚类算法.计算机学报,2002,25(6):587-590.
    [23]S.C.Chen,D.Q.Zhang.Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure.IEEE Trans.Systems Man Cybernet.B,2004,34(4):1907-1916.
    [24]赵晖,荣莉莉.基于模糊核聚类的SVM多类分类方法.系统工程与电子技术,2006,28(5):770-774.
    [25]伍忠东,高新波,谢维信.基于核方法的模糊聚类算法.西安电子科技大学学报(自然科学版),2004,3 1(4):533-537.
    [26]R.Cooley,B.Mobasher,J.Srivastava.Web Mining:Information and Pattern Discovery on the World Wide Web.In:Proc.of the 9th IEEE International Conference on Tools with Artificial Intelligence(ICTAI'97),Newport Beach:IEEE Computer Society,1997,558-567.
    [27]宋爱波,董逸生,吴文明,孙志辉,Web挖掘研究综述.‘计算机科学.2001,(28):13-16.
    [28]S.K.Pal,V.Talwar,P.Mitra.Web Mining in soft Computing Framework:Relevance,State of the Art and Future Directions.Neural Networrks,IEEE Transaction.2002,13(5):1163-1177.
    [29]Osmar Rachid Zaiane.Resource and Knowledge Discovery from the Internet and Multimedia Repositories,Ph.D Dissertation,Simon Fraser University,March 1999.
    [30]O.R.Zaiane,J.Han,H.Zhu.Multimediaminer:A System Prototype for Multimedia Data Mining.In:Proc.ACM-SIFMOD Conf.on Management of Data,seattle,Washington,1998,581-583.
    [31]J.Srivastava,R.cooley,M.Deshpande.Web Usage Mining:Discovery and Application of Usage Patterns from Web Data.ACM SIGKDD,Jan.2000,1(2):12-23.
    [32]S.Kamvar,T.Haveliwala,G.Golub.Adaptive Methods for the Computation of PageRank.In:Linear Algebra and its Applications.2004,51-65.
    [33]J.Kleinberg.Authoritative sources in a hyperlinked environment.In:Proc.ACM-SIAM,1999,46(5):604-632.
    [34]http://www.wm23.com/resource/R04/4009.htm.
    [35]S.Brin.Extracting patterns and relations from the World Wide Web.In:Proc.of the 6~(th)International Conference on Extending Database Technology(EDBT'98),Workshop on the Web and Databases,172-183.
    [36]Wang Ke,Liu Huiqing.Schema discovery for semi-structured data.In:Proc.of the 3rd Intl Conf.on Knowledge Discovery and Data Mining(KDD'97),1997,271-274.
    [37]R.Feldman,I.Dagan.Knowledge discovery in textual databases(KDT).In:Proc of the 1~(st)Int'l Conf on Knowledge Discovery.Montreal,1995,112-117,
    [38]Zifeng Cui,Baowen Xu,Weifeng Zhang,Junling Xu.Web Documents Clustering with Interest Links.Service-Oriented System Engineering,2005,SOSE 2005.IEEE Internation Workshop Oct.2005,111-116.
    [39]Hua-Jun Zeng,Qi-Cai He,Zhen Chen,Wei-Ying Ma,Jinwen Ma.Learning to Cluster Web Searh Results.SIGIR'04.Sheffield,South YorKshire.UK July 2004:210-217.
    [40]F.Sebastiani.Machine learning in automated text categorization.ACM computing survey,2002,34(1):1-47.
    [41]T.Joachims.Text categorization with support vector machines:Learning with many relevant features.In:Proc.of ECML-98,Chemnitz,Germany,1998,137-142.
    [42]R.E.Schapire,Y.Singer.Boostexter:A boosting-based system for text categorization.Machine Learning,2000,39(2-3):135-168.
    [43]林鸿飞,基于混合模式的文本过滤模型.计算机研究与发展,2001,38(9):1127-1131.
    [44]S.Soderland.Learning information extraction rules for semistructured and free text,Machine Learning,1999,34(1-3):233-272,
    [45]边肇祺,张学工.模式识别.清华大学出版社,2000.
    [46] Zhi-Hua Zhou. Learning with unlabeled data and its application to image retrieval. In: Proc. of the 9th Pacific Rim International Conference on Artificial Intelligence (PRICAI'06), Guilin. China. LNAI 4099, 2006, 5-10.
    [47] O.Chapelle, B.Scholkopf, A.Zien. Semi-supervised Learning. Cambridge, Ma: MIT Press, 2006.

    [48] X.J.Zhu. Semi-supervised learning literature survey. (Technical Report 1530), Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI, Apr. 2006.

    [49] D. J.Miller, H.S.Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. Advances in: Neural Information Processing Systems 9, 1997,571-577.

    [50] A.Blum, S.Chawla. Learning from labeled and unlabeled data using graph mincuts. In: Proc. of the 18th International Conference on Machine Learning (ICML'01), San Francisco, CA, 2001, 19-26.

    [51] X.Zhu, Z.Ghahramani, J.Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In: Proc. of the 20th International Conference on Machine Learning (ICML'03), Washington, DC, 2003, 912-919.

    [52] M.Belkin, P.Niyogi, V.Sindwani. On manifold regularization. In: Proc. of the 10th International Workshop on Artificial Intelligence and Statistics(AISTATS'05), Savannah Hotel, Barbados, 2005, 17-24.

    [53] V.N.Vapnik. Statistical Learning Theory. New York: Wiley, 1998.
    [54] T.Joachims. Transductive inference for text classification using support vector machines. In: Proc. of the 16th International Conference on Machine Learning (ICML'99), Bled, Slovenia, 1999, 200-209.

    [55] Kamal Nigam, Andre McCallum, Sebastian Thrun, Tom Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000: 39(2/3): 103-134.
    [56] R.Ghani, Combining labeled and Unlabeled data for multi-class text categorization, In: Proc. of the 19th International Conference on Machine Learing. 2002, 187-194.
    [57] P.Berkhin. Survey of clustering data mining techniques. Unpublished manuscript, available from accrue.com, 2002. Technical report, Accrue Software, San Jose, CA, 2002.

    [58] A.P.Dempster, N.M.Laird, D.B.Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B, 1977, 39: 1-38.
    [59]Jidong Wang,Hua-Jun Zeng,Zheng Chen,Hongjun Lu,Li Tao,Wei-Ying Ma.ReCom:Reinforcement clustering of multi-type interrelated data objects.In:Proc.of the 26~(th)Annual International ACM SIG1R Conference on Research and development in information retrieval(SIGIR'03),2003,274-281.
    [60]K.Kailing,H.Kriegel,A.Pryakhin,M.Schubert.Clustering multi-represented objects with noise,In:Proc.8~(th)the Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD'04),2004,394-403.
    [61]Cristiannin,J.Shawe Taylor.支持向量机导论.李国正等译.北京:电子工业出版社.2004.
    [62]N.Cristianini,J.S.Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.England:Cambridge University Press,2000.4
    [63]Vladimir N.Vapnik.Statistical Learning Theory.Wiley,New York,1998.
    [64]S.Amari,S.Wu,Imporoving support vector machine classifiers by modifying kernel functions,Neural Networks,1999,12:783-789.
    [65]王珏,石纯一.机器学习研究.广西师范大学学报(自然科学版),2003,21(2):1-15.
    [66]B.Scholkopf.Statistical learning and kernel methods.M SRTR 2000-23,Microsoft Research,2000.
    [67]B.Scholkopf,A.Smola.Learning with Kernels.Cambridge:MIT Press,2002.
    [68]J.Shaw-Taylor,N.Cristianini.Kernel Methods for Pattern Analysis.Beijing:China Machine Press,2005.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700