基于信息论的特征加权和主题驱动协同聚类算法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Information Theoretic Feature Weighting and Topic-Driven Co-Clustering for Text Dataset
作者：吴彪
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：协同聚类 ; 信息论 ; 主题驱动 ; 特征加权 ; 维基百科
英文关键词：co-clustering ; information theory ; topic-driven ; feature weighting ; wikipedia
学位年度：2008
导师：叶允明
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2008-12-01

摘要

文本数据常用文档-词二维共现矩阵表示,大多数传统聚类算法属于单向聚类,即要么是对样本进行聚类,要么是对特征进行聚类,没有考虑到样本和特征之间自然存在的相互关系。尤其对高维、稀疏、带噪声数据,传统单向聚类方法在精度上很难满足实际需求。
基于信息论的协同聚类算法从信息论的角度捕获了行列之间自然关系,同时从行向和列向进行聚类,相互协助、相互约束,对高维、稀疏数据也能起到高效聚类的效果。但该方法也存在一些不足,如没有考虑特征的重要性,另外该方法是一个无监督的学习过程,聚类后簇的可解释性不强,在聚类精度上也有提高的空间等。本文在基于信息论的协同聚类算法以及参考已有研究方法的基础上,做了两点探索性改进,即在原有无监督聚类的基础上,引入了主题知识,并对特征进行了加权处理。提出了无监督的特征加权的协同聚类算法和半监督的主题驱动的协同聚类算法两个改进算法。特征加权协同聚类算法用互信息计算特征权值,突出有效特征的重要性,在聚类精度和运行时间上得到了提高。在主题驱动的协同聚类算法中,首先建立了一个基于维基百科和开放分类目录的主题语料库,该语料库中定义了每个主题的描述和层次;然后通过协同聚类的方法将主题知识传播到文本聚类过程中,我们的目标是将相同主题下的文档聚在一起。通过实验证明,在聚类精度上我们提出的两个改进算法能得了更好的聚类结果。
Text samples usually are represented by document-word co-occurrence tables, and most traditional clustering algorithms have no consideration of the nature relationship between samples and features, which are single dimension clustering to cluster samples or features independently. Most of them are hardly to reach the demand of real application, especially when meeting high dimension, sparse and noise data.
Information theoretic co-clustering captures the nature relationship between rows and columns from mutual information aspect. It is simultaneously clustering rows and columns, and at all stages, the row cluster prototypes incorporate column clustering information, and vice versa, it has a high performance on clustering precision. But it also has some defects, for example, it has no consideration of the importance of features, and it is an unsupervised method in nature, the explanation of the clustering result is not so good and the precision may be improved when some prior knowledge are available, etc. In this thesis, we present two exploratory improvement methods based on Information Theoretic Co-clustering and other former researches, one is an unsupervised clustering method called Feature Weighting Information Theoretic Co-clustering (FWITCC), and the other is a semi-supervised clustering method called Information Topic-driven Theoretic Co-clustering for Text Dataset (TDITCC). We build a topic thesaurus repository as prior knowledge based on Wikipedia and Open Directory Project, which defines the hierarchy and description of each topic, and then use co-clustering methodology as a bridge to propagate the topic knowledge from topic repository to text dataset. Our general goal is to organize each document to its corresponding topic. The experimental analyses show that our methodology can produce high quality clustering results.

引文

1. N. Grira, M. Crucianu, and N. Boujemaa. Unsupervised and semi-supervised clustering: a brief survey. In A Review of Machine Learning Techniques for Processing Multimedia Content, Report of the MUSCLE European Network of Excellence. 2005.
    2. B. Scholkopf and A. Smola, Learning with Kernels. MIT Press: page 352. 2002.
    3. K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In Proceedings of 18th International Conference on Machine Learning (ICML-2001): pages 577-584, 2001.
    4. Zhengdong Lu, Todd K. Leen. Semi-supervised Clustering with Pairwise Constraints: A Discriminative Approach. JMLR W&CP: Volume 2: AISTATS 2007. 2007.
    5. S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised Clustering by Seeding, Proceedings of the 19th International Conference on Machine Learning (ICML-2002): pages 19-26, Sydney. 2002.
    6. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory. 1998.
    7. D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University. 2003.
    8. M. Bilenkoin , S. Basu and R. J. Mooney. Integrating Constraints and Metric Learning in Semi-Supervised Clustering. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada. 2004.
    9. X. Ji, W. Xu and S. Zhu. Document Clustering with Prior Knowledge. SIGIR’06. 2006
    10. A Hotho, A Maedche and S Staab. Ontology-Based document clustering. In Proceedings of the IJCAI-2001. 2001.
    11. A Hotho, S Staab and .G. Stumme. Wordnet improves text document clustering. In Proceedings of the Semantic Web Workshop at SIGIR’03. 2003.
    12. Zhang, X., Jing, L., Hu X., Ng M. and Zhou, X. A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering. 2007.
    13. WordNet: http://wordnet.princeton.edu/
    14. HowNet: http://www.keenage.com/
    15. A.dryan B, Schuh R. Gene-Ontology-based clustering of gene expression data. Bioinformatics. 2004 Nov 1;20(16): pages 2851-2852. 2004.
    16. Ying Zhao, George Karypis. Topic-driven Clustering for Document Dataset. In Proceeding SIAM Data Mining Conference. 2005.
    17. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic Co-clustering. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003.
    18. A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, D. S. Modha. A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation. Journal of Machine Learning Research: pages 1919-1986. 2007.
    19. Hartigan JA. Direct clustering of a data matrix. Journal of the American Statistical Association: pages 67:123. 1972.
    20. ChengY, Church GM. Biclustering of expression data. In Proceedings of the eighth international conference on intelligent systems for molecular biology: pages 93–103. 2000.
    21. Bryan K, Cunningham P, BolshakovaN. Biclustering of expression data using simulated annealing. In: Proceedings of the 18th IEEE symposium on computer-based medical systems: pages 383–388. 2005.
    22. Yang J, Wang W, Wang H, Yu P. Clusters: capturing subspace correlation in a large data set. In: Proceedings of the 18th IEEE international conference on data engineering: pages 517–528. 2002.
    23. Yang J,WangW,Wang H,Yu P. Enhanced biclustering on expression data. In Proceedings of the third IEEE conference on bioinformatics and bioengineering. pages: 321–327. 2003.
    24. Banerjee A, Dhillon IS, Ghosh J, Merugu S, Modha DS, Generalized maximum entropy approach to Bregman co-clustering and matrix approximations. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining: pages 509–14. 2004.
    25. K. Kummamuru, A. Dhawale, and R. Krishnapuram. Fuzzy coclustering of documents and keywords. In Twelfth IEEE International Conference on Fuzzy Systems, vol. 2: pages 772–777. 2003.
    26. W. C. Tjhi and L. Chen. A partitioning-based algorithm to fuzzy cocluster documents and words. Pattern Recognition Letters, vol. 27: pages 151–159. 2006.
    27. M. M. Shafiei , E. E. Milios. Model-based Overlapping Co-Clustering. In Proceedings of the Fourth Workshop on Text Mining. 2005.
    28. W. C. Tjhi, Lihui Chen. Robust Fuzzy Co-clustering Algorithm. In Proceedings of ICICS 2007. 2007.
    29. S. C.Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1: pages 24–25. 2004.
    30. S.Busygina, O. Prokopyevb, P. M. Pardalos. Biclustering in data mining. Computers & Operations Research 35 (2008): pages 2964–2987. 2008.
    31. Zhao Shi-Qi, Liu Ting, Li Sheng. A Topical Document Clustering Method. Journal of Chinese Information Processing. Vol. 21 , No. 2 Mar. 2007.
    32. Jiangtao Qiu Changjie Tang. Topic Oriented Semi-supervised Document Clustering. Proceedings of SIGMOD2007 Ph.D. Workshop on Innovative Database Research 2007(IDAR2007), Beijing, China. 2007.
    33. Wanyuan Dai, GuiRong Xue, Qiang Yang and Yong Yu. Co-clustering based Classification for Out-of-domain Documents. KDD 2007. 2007
    34. Rui Xu and Donald Wunsch. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, VOL.16, NO.3. 2005.
    35. I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: pages 269–274. 2001.
    36. H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the 4th SIAM International Conference on Data Mining: pages 114–125. 2004.
    37. D. Freitag. Trained named entity recognition using distributional clusters. EMNLP 2004: pages 262–269. 2004.
    38. R. Rohwer and D. Freitag. Towards full automation of lexicon construction. HLT-NAACL 2004, Workshop on Computational Lexical Semantics: pages 9–16. 2004.
    39. G. Qiu. Image and feature co-clustering. In Proceedings of the International Conference on Pattern Recognition: pages 991–994. 2004.
    40. R. Cai, L. Lu, and L. Cai. Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering. In Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Processing: pages 1073–1076. 2005.
    41. T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information Systems: pages 89–115. 2004.
    42. L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics: pages 200–217. 1967.
    43. P Berkhin. Survey of clustering data mining techniques. Grouping Multidimensional Data, Springer Press: page 59. 2002.
    44. L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensinal data: A review. ACM SIGKDD Explorations: pages 90–105. 2004.
    45. M. Deodhar Joydeep Ghosh.A Framework for Simultaneous Co-clustering and Learning from Complex Data. IDEAL2007. 2007.
    46. Saeed Hassanpour. Computational Complexity of Bi-clustering. A Master of Mathematics thesis presented to the University of Waterloo. 2007.
    47. Feng Pan, Xiang Zhang, Wei Wang. A General Framework for Fast Co-clustering on Large Datasets Using Matrix Decomposition. ICDE 2008. 2008.
    48. Hichem Frigui and Raghu Krishnapuram. A robust algorithm for automatic extraction of an unknown number of clusters from noisy data. Pattern Recognition Letters Volume 17, Issue 12: pages 1223-1232. 1996.
    49. JZ Huang, MK Ng, H Rong, Z Li. Automated Variable Weighting in k-Means Type Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol.27 No.5: pages 657-668. 2005.
    50. Yang Yiming , Pedersen J O1 A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. 1997.
    51. K.Lang. Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning: pages 273-280. 1995.
    52. E. Gabrilovich and S. Markovitch. Feature Generation for Text Categorization Using World Knowledge. In IJCAI’05. 2005.
    53. E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. AAAI’06. 2006
    54. Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen. Enhancing Text Clustering by Leveraging Wikipedia Semantics. SIGIR’08 . 2008.
    55. M. de Buenaga Rodr?guez, J. M. G. Hidalgo, and B. D?az-Agudo. Using WordNet to complement training information in text categorization. In Recent Advances in Natural Language Processing, Vol. 189. 2000.
    56. D. M. P. Kushal Dave, Steve Lawrence. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In WWW’03. 2003
    57. Ure?na L′oez, M., & Hidalgo, J. M. G. Integrating linguistic resources in TC through WSD. Computers and the Humanities, Vol.35 No.2: pages 215–230. 2001.
    58. ODP:http://www.domz.org
    59. Wikipedia:http://www.wikipedia.org
    60. Giles,J. Internet Encyclopaedias:Go Head to Head. Nature: pages 900-901. 2005.
    61. Jiawei Han, Micheline. Data Mining Concepts and Techniques. MIT Press. page223. 2005.
    62.张杰. Wiki百科全书中非父子语义关系抽取研究.上海交通大学,硕士学位论文.页码12-13. 2007.
    63. I. S. Dhillon, J. Fan, and Y. Guan. Efficient clustering of very large document collections. Data Mining for Scientific and Engineering Applications: pages 357–381. Kluwer Academic Publishers, 2001.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700