基于图模型的聚类算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网技术的高速发展和网络资源信息的日益丰富,海量互联网网页同时涵盖文本、图像、视频和音频等媒体数据以及多种语言并存,呈现出跨媒体的特性。如果缺乏一套有效的检索机制,从海量的跨媒体资源中搜寻信息无疑是大海捞针,因此,研究海量信息的跨媒体检索机制至关重要。一般而言,用户需要的检索信息并不直接存在于被检索资源中,而是需要搜索引擎对潜在的检索结果作诸如摘要生成、分类、排重、聚类等智能处理,才能满足用户的检索需求。
     本文在广泛阅读相关文献、深入了解聚类算法的原理与应用的基础上,主要针对基于图模型的聚类算法,在算法的改进、应用上做了如下工作:
     1.结合基于因数图的仿射传播聚类算法和k-中值多数投票算法的优点,提出了使用松弛多根最小生成树分配算法的投票分割式仿射传播聚类算法,实验结果证实了该算法的有效性。
     2.提出了一种基于随机分块和投票聚类融合策略的聚类大规模数据集合的算法框架,并使用该算法框架对仿射传播聚类算法进行了扩展,使其能够处理任意形状的更大规模的数据集合,并验证了该扩展的可行性和有效性。
     3.对聚类分析在图像搜索领域的应用进行了探索,提出了一个基于投票分割式仿射传播聚类算法的图片搜集模型。
     本文的贡献和创新主要体现在算法的改进和应用上:
     1.提出了分割式仿射传播聚类算法,它在实际聚类个数大于本质聚类个数时能在本质聚类上产生随机的划分,从而使其满足投票策略对聚类生成器的随机性的要求。
     2.提出了松弛多根最小生成树分配算法,在基本保持聚类结果的随机性的基础上,减小了误分配的概率。
     3.将使用松弛多根最小生成树算法的分割式仿射传播聚类算法和投票策略结合在一起,并结合划分一致性索引指标讨论了如何选取合适的域值参数的问题。
     4.提出了一种将时间复杂度和空间复杂度较高的聚类算法扩展到大规模数据集合应用上的算法框架——随机分块再融合,并用其对仿射传播聚类算法进行了扩展,使其能处理更大规模的、任意形状分布的数据集合。
     5.提出了一个基于投票分割式仿射传播聚类算法的图片搜集模型,使开发人员能够基于此模型构造一个应用程序来帮助用户从现有图片搜索引擎上方便的获取相关主题的图片资料集合。
With the fast development of internet technique, more and more kinds of meta-data are displayed in mass web pages such as text, pictures, video and audio, which indicates a trend of cross-media. If there is not an effective search mechanism, it is very hard for us to gain useful information from such a large scaled web based database. Therefore, it is essential for us to study methods for searching in large scale of cross-media information. Generally, the information which user searches for is not stored in the database in a straightforward form. To meet the search demand of customers, some intelligent process is needed for the search engine, such as abstract generating, classifying, duplication removing and clustering.
     Based on survey of related papers and further study of clustering algorithms, the following work has been done in improvement and application of clustering algorithm based on graph model.
     1. We present an algorithm called voting partition affinity propagation which combines voting scheme with affinity propagation clustering algorithm. It is composed of three parts: relaxed multi-root minimum spanning tree assign dicipline, partition affinity propagation clustering algorithm and voting clustering ensemble scheme. It has advantages of both affinity propagation and voting k-means. Experiments on both simulated and real data set approve its validity.
     2. We display a scheme that can deal with problem of clustering large scaled data set. It was based on voting clustering ensemble algorithm and random partition. And this framework is used to extend affinity propagation clustering algorithm to deal with larger scaled and arbitrary shaped data set. Experiment results validate the feasibility and effectiveness of this extension.
     3. We propose a model for image collection which is based on voting partition affinity propagation.
     The main contribution and innovation of this paper are displayed on the improvement and application of the existing algorithms:
     1. Proposing partition affinity propagation clustering algorithm. The algorithm can generate random partition within the substantial cluster when output number of clusters is larger than the intrinsic one, thus to meet the demand for randomicity of voting clustering ensemble scheme.
     2. Presenting relaxed multi-root minimum spanning tree data point assign algorithm. The discipline can reduce the possibility of mis-assign while mostly keep the randomicity of clustering result.
     3. Proposing to combine affinity propagation clustering algorithm with voting clustering ensemble scheme through partition affinity propagation using relaxed multi-root minimum spanning tree assign discipline. The problem of how to choose an appropriate threshold for the voting scheme to get a good consistent clustering result is also discussed under the measurement of partition consistent index.
     4. Displaying a framework that can extend clustering algorithms of relative high time and space complexity so that they can deal with larger scaled data set. Firstly, it divides the data set into several sub-blocks randomly and clusters these sub-blocks. Such operation is done for certain times. Then, the above clustering results are ensembled using voting scheme. Affinity propagation clustering algorithm is extended by this framework to deal with larger scaled and arbitrary shaped data set.
     5. Introducing a model for image collection which is based on voting partition affinity propagation. Developers can build an application based on this proposed framework to facilitate people to get image resources of certain subject from existing image search engines more conveniently.
引文
[1] Thomas G Dietterich. Machine Learning Research[J]. AI Magazine, 1997, 18(4): 97-136
    [2] S. Chakrabarti, B. Dom, S.R. Kumar, P. Raghavan, et al. Experiments in Topic Distillation[C]// SIGIR workshop on Hypertext IR, 1998
    [3] Oren Etzioni, Michele Banko, Michael J. Cafarella. Machine Reading[C]// Proceedings of the 21st National Conference on Artificial Intelligence, 2006
    [4] Dawn J.L., W. Bruce Croft. Generating Hierarchical Summaries for Web Searches[C]// SIGIR, 2003
    [5] H-J. Zeng, Q-C. He, Z. Chen, et al Learning to Cluster Web Search Results[C]// SIGIR, 2004
    [6] H. Zhao, W. Meng, Z. Wu, et al. Fully Automatic Wrapper Generation for Search Engines[C]//WWW, 2005
    [7] Hawking D, Craswell N. Very Large Scale Retrieval and Web Search[M]. MIT Press, 2005
    
    [8] Kaufman, L., Rousseeuw, P.. Finding Groups in Data[M]. New York: Wiley, 1990
    [9] Rui Xu, Donald Wunsch II. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks[J], 2005,16(3): 645-678
    [10] A. Jain, R. Dubes. Algorithms for Clustering Data[M]. Englewood Cliffs, NJ: Prentice-Hall, 1988
    
    [11] Kleinberg. An Impossibility Theorem for Clustering[C]// Proceedings of Advances in Neural Information Processing Systems, 2002,1:463-470
    
    [12] X.Z. Fern, C.E. Brodley. Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach[C]// Proceedings of 20th International Conference on Machine Learning, 2003: 186-193
    
    [13] A. Fred, A.K. Jain. Data Clustering Using Evidence Accumulation[C]// Proceedings of 16th International Confonference on Pattern Recognition, 2002: 276-280
    [14] A. Fred. Finding Consistent Clusters in Data Partitions[C]// Proceedings of 2nd International Workshop on Multiple Classifier Systems, 2001
    [15] Evgenia Dimitriadou, Andreas Weingessel, Kurt Hornik. Voting in Clustering and Finding the Number of Clusters[R]. SFB Adaptive Information Systems and Modelling in Economics and Management Science Report
    [16] Alexander P. Topchy, Martin H.C. Law, A.K. Jain, et al. Analysis of Consensus Partition in Cluster Ensemble[C]// Proceedings of the 4th IEEE International Conference on Data Mining, 2004
    [17] A.K. Jain, M.N. Murty, P.J. Flynn. Data Clustering: A Review[J]. ACM Computing Surveys, 1999,31(3)
    [18] Charles T. Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters [J]. IEEE Transactions on Computers, January 1971, C-20(1)
    [19] G. Celeux, G. Govaert. A Classification EM Algorithm for Clustering and Two Stochastic Versions[J]. Computational Statististics & Data Analysis, 1992, 14: 315-332
    [20] Miranda Maria Irene. Clustering by Density Estimation and Mode Seeking[OL]. http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/1999/clustering/node19.html
    [21] T. Kohonen. Self-Organization and Associative Memory. 3rd ed. Springer information sciences series [M]. Springer-Verlag, New York, NY
    [22] G Carpenter, S. Grossberg. ART3: Hierarchical Search Using Chemical Transmitters in Self-organizing Pattern Recognition Architectures[J]. Neural Networks, 1990, 3,129-152
    
    [23] Mercer, D.P. Clustering large datasets[R]. Linacre College, 2003
    [24] Kaufman, L., Rousseeuw, P.. Finding Groups in Data[M]. New York: Wiley, 1990
    [25] Frey B J, Dueck D. Clustering by Passing Messages between Data Points[J]. Science, 2007, 315(5814): 72-976
    [26] J.F. Frey, D. Dueck. Mixture Modeling by Affinity Propagation[OL]. http://books.nips.cc/papers/files/nips18/NIPS2005 0799.pdf,2007
    [27]Pelleg D,Moore A.X-means:Extending K-means with Efficient Estimation of the Number of Clusters[C]//Langley P,ed.Proceedings of the 17th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers,2000:727-734
    [28]Michele Leone,Sumedha,Martin Weigt.Clustering by Soft-constraint Affinity Propagation:Clustering by Soft-constraint Affinity Propagation[J].Bioinformatics Advance Access,September 25,2007
    [29]王开军,张军英,李丹,et al.自适应仿射传播聚类[J].自动化学报,2007,33(12):1242-1246
    [30]R.C.Prim.Shortest Connection Networks and some Generalizations[J].Bell System Technical Journal,1957,36:1389-1401
    [31]C.L.Blake,C.J.Merz.UCI Repository of Machine Learning Databases[DB/OL].University of California,Irvine,Department of Information and Computer Sciences,1998
    [32]马修军.多媒体数据库与内容检索[M].北京:北京大学出版社,2007.7
    [33]K.Jain.Fundamentals of Digital Image Processing[M].Englewood Cliffs:Prentice Hall,1989
    [34]M.Stricker,M.Orengo.Similarity of Color Images[J].Storage and Retrieval for Image and Video Databases Ⅲ,1995,2185:381-392
    [35]G.Pass,R.Zabih.Histogram Refinement for Content-based Image Retrieval[C]//IEEE Workshop on Applications of Computer Vision,1996:96-102
    [36]J.R.Smith,S.F.Chang.Automated Binary Texture Feature Sets for Image Retrieval[C]//Proceedings of IEEE International Conference on Acoust.,Speech,and Signal Processing,1996
    [37]R.M.Haralick,K.Shanmugam,I.Dinstein.Texture Features for Image Classification[J].IEEE Transaction on System Management and Cyber,1973,SMC-3(6):610-621
    [38]H.Tamura,N.Yokoya.Image Database Systems:A Survey[J].Pattern Recognition,1984,17(1):29-43
    [39] A. Laine, J. Fan. Texture Classification by Wavelet Packet Signatures[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1993, 15(11): 1186-1191
    [40] T. Chang, C.C. Jay Kuo. Texture Analysis and Classification with Tree-structured Wavelet Transform[J]. IEEE Transaction on Image Processing, 1993, 2(4): 429-441
    [41] Jing Liu, Bin Wang, et al. Dual Cross-Media Relevance Model for Image Annotation[C]//ACM Multimedia, 2007

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700