数据挖掘中半监督K-均值聚类算法的研究与改进

英文题名：Research and Improvement for Semi-supervised K-means Clustering Algorithm in Data Mining
作者：刘方
论文级别：硕士
学科专业名称：生物信息学
中文关键词：数据挖掘 ; 半监督学习 ; K-均值聚类
英文关键词：Data Mining ; Semi-supervised Learning ; K-means Clustering
学位年度：2010
导师：梁艳春
学科代码：071010
学位授予单位：吉林大学
论文提交日期：2010-04-01

摘要

数据挖掘技术是当前机器学习、模式识别、计算机科学、智能计算技术、应用数学、统计学习方法以及智能机器人研究中的重要课题。它能从已有的数据中分析、提炼和挖掘出隐含的、先前未知的、对决策有潜在应用价值的知识。
     本文围绕数据挖掘领域的聚类分析问题,展开了算法与应用研究:在传统的K-均值聚类算法的基础上,为了提高算法的效率,提出了两种基于数据分段技术选取初始聚类中心的改进的K-均值聚类算法,将上述算法应用到我国各地区城镇居民家庭收支基本情况统计数据中,取得了较好效果;结合半监督学习方法,提出了半监督K-均值聚类算法,并针对初始聚类中心的选取提出了两种改进的半监督K-均值聚类算法,并将改进前后的算法应用于我国男性与女性身高和体重的统计数据中,取得了较好的效果。
Data mining technology is an important subject in the current machine learning, pattern recognition, computer science, intelligent computing technology, applied mathematics, statistical learning methods, and intelligent robotics research. Data mining techniques are applied in the database, statistics, optimization techniques, artificial knowledge, pattern recognition, parallel computing, machine learning, neural networks, data visualization, information retrieval, image and signal processing and spatial data analysis.
     With the rapid development of modern computer technology, information technology and communication technology, how to analyze, refining and digging out the implicit, previously unknown, novel, potential applications for decision-making knowledge from the available data, has been an problem that is urgent need to address.
     This focus on data mining field, for which the problem of the cluster analysis, expands the algorithm and research application. Basing on the traditional K-means clustering algorithm, in order to improve the efficiency of the algorithm, presents a improved K-means clustering algorithm which based on data segmentation to select the initial cluster centers, the above algorithm is applied to statistics of our country various regions urban residents household income and expenditure basic situation and achieved good results; Combine to semi-supervised learning method, proposed semi-supervised K-means clustering algorithm, for the choice of the initial cluster centers, proposed an improved semi-supervised K-means clustering algorithm and the algorithm is applied to statistical data of our men and women’s height and weight, obtained better results.
     The main contribution and research findings of this paper are as follows:
     1. Provide an overview on data mining research.
     Introduced and summarized the significance of data mining, the main content and applications, discussed the current problems in data mining, and points out the future research and development direction. Data mining technology is the rise of a cross-disciplinary in late 20th century, 80s. The current development state of Data-mining capabilities and product is database, information retrieval, statistics, algorithms and machine learning multi-disciplinary multi-impact results. With the rapid development of modern information technology, communications technology and computer technology, the scope, depth and scale of database applications are expending. Most of the traditional information system is query-driven, database as a historical knowledge base for the average query process is effective, but when the size of data and the database increase sharp, the traditional database management systems query retrieval mechanisms and statistical analysis methods can not meet the real needs, automatically, intelligent and quickly dug out useful information and knowledge from the database is an urgent requirement. In general, data mining work can be divided into two categories: descriptive data mining and predictive data mining. Data mining in financial data analysis, research of gene sequences composition, retail data analysis, telecommunications and other areas all have a wide range of applications. Where there is data, where there is data mining.
     2. Introduce and analyze the related theory and methods of clustering problems in data mining.
     Clustering problem is to identify classes which implicit in the data. Category refers to data sets with similar properties. As the different similarities that can have different clustering methods, for example, described the similarity with the distance. Generally, describe the manner of similarity given by the user or expert. A good clustering method can produce good clustering, in order to ensure the less similarity between class and class, and a high similarity in each class internal. Clustering algorithm can be divided into two major categories of hierarchical methods and classification methods. This paper introduces the hierarchical algorithm, described the division algorithm, in the division algorithm, in particular pointed out the K-means clustering algorithm, and gives a brief description of the relevant example of the solution process of the algorithm. Finally, as compared to the relevant algorithms, in which K-means clustering algorithm in the space complexity and time complexity are the smallest.
     3. Introduce and analyze semi-supervised learning methods.
     In the traditional supervised learning, the training device marked by a large number of data to learn in order to build models to predict the unmarked data. But to get the data marked is often difficult, expensive and very time-consuming, often requires experienced researchers to mark. With the rapid development of the data collection and storage technologies, unlabeled data collected is very easy, but using only unlabeled data clustering results could have a tremendous error.
     Obviously, if using only a small amount of "expensive" marked data without using the large numbers of "cheap" unmarked data, the data is a great waste of resources. Semi-supervised learning method is a way of learning which is used to handle a large number of unlabeled data and a small amount of marked data. Semi-supervised learning combines a small amount of "expensive" marked data and the large number of "cheap" unmarked data, avoiding a tremendous waste of data resources, in the theoretical research and practical applications are of great significance. In this paper, semi-supervised classification is given in five kinds of learning methods, in the semi-supervised clustering is given the icon description.
     4. Study the K-means clustering algorithm, propose two improved algorithms and Semi-supervised algorithm and two improved Semi-supervised algorithms.
     K-means algorithm is a sure means algorithm for k-center. Its idea is that if a class is confirmed, then the class centers of the data points within the class of the geometric mean. When the initial choice of the initial cluster centers, K-means clustering algorithm the initial centers are randomly selected, randomly selected, results will lead to less efficient clustering algorithm, that algorithm for more iterations, CPU running time than the long. To this end we propose an improvement of the initial point selection algorithm, called the improved K-means clustering algorithm. The algorithm uses data segmentation, data collection of the sample points were divided into k-paragraph, take a center within each segment as the initial center. This approach avoids the choice of the initial center too close. In this paper, experiments show that the algorithm is effective. This combination of semi-supervised learning another idea presents a semi-supervised K-means clustering algorithm, the initial cluster centers by expanding the choice of methods to be used for semi-supervised learning. In the semi-supervised K-means clustering algorithm, the choice of markers is very important, its results of clustering had a significant influence. This algorithm is applied to the two-dimensional data clustering, it examined the effectiveness of the algorithm.
     The results of this research enriched the clustering problem in data mining theoretical and applied research. This cluster analysis, K-means clustering, as well as K-means clustering semi-supervised learning research, possesses some theoretical and application value.

引文

[1]袁玉波,杨传胜,黄廷祝,徐成贤.数据挖掘与最优化技术及其应用[M].科学出版社,2007: 51-52.
    [2]高滢.多关系聚类分析方法研究[D].吉林大学. 2008.
    [3]彭丽.数据挖掘中几种划分聚类算法的比较及改进[D].大连理工大学. 2008.
    [4]张博锋,白冰,苏金树.基于自训练EM算法的半监督文本分类[J].国防科技大学学报,2007,29(6): 65-69.
    [5] Xiaojin Zhu. Semi-Supervised Learning Literature Survey[R]. Computer Sciences TR 1530. University of Wisconsin–Madison, June 24, 2007.
    [6]宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘[J].软件学报,2002,13(8): 1508-1514.
    [7]孙广玲,唐降龙.基于分层高斯混合模型的半监督学习算法[J].计算机研究与发展,2004,41(1): 156-161.
    [8] Luca Didaci and Fabio Roli. Using Co-training and Self-training in Semi-supervised Multiple Classifier Systems[C]. in SSPR&SPR 2006,LNCS 4109,2006: 522-530.
    [9]周志华.半监督学习中的协同训练风范[D].南京大学. 2003.
    [10]赵英刚,陈奇,何钦铭.一种基于支持向量机的直推式学习算法[J].江南大学学报,2006,5(4): 441-444.
    [11] Li Maokuan, Zhao Honghai. Semi-Supervised Support Vector Machines for Data Classifation[J]. Journal of Qingdao University, 2004, 17(4): 44-48.
    [12]祁亨年,杨建刚,方陆明.基于多类支持向量机的遥感图像分类及其半监督式改进策略[J].复旦学报(自然科学版),2004,43(5): 781-784.
    [13]鲁珂,赵继东,叶娅兰,曾家智.一种用于图像检索的新型半监督学习算法[J].电子科技大学学报,2005,34(5): 669-671.
    [14]钟清流,蔡自兴.基于支持向量机的渐近式半监督式学习算法[J].计算机工程与应用,2006,19(4): 19-22.
    [15]王张琦,曹渠江.基于马尔可夫链的半监督分类器[J].上海理工大学学报,2007,29(1): 51-54.
    [16]杨剑,王珏,钟宁.流形上的Laplacian半监督回归[J].计算机研究与发展,2007,44(7): 1121-1127.
    [17] Hyunjung(Helen)Shin, N.Jeremy Hill, Gunnar Ratsch. Graph Based Semi-Supervised Learning with Sharper Edges[C]. ECML 2006, LNAI 4212, 2006: 402–413.
    [18] Wagstaff K, Cardi C, Rogers S, Schroedl S. Constrained K-means with Background Knowledge [C]. Proceedings of the 18th International Conference on Machine Learning, 2001: 577-584.
    [19] Basu S, Banerjee S, Mooney R. Semi-supervised clustering by seeding [C]. Proceedings of the 19th International Conference on Machine Learning, 2002: 19-26.
    [20] Halkidi M, Gunopulos D, Kumar N, Vazirgiannis M, Domeniconi C. A Framework for Semi-Supervised Learning based on Subjective and Objective Clustering Criteria [C]. Proceedings of the International Conference on Data Mining, 2005.
    [21] Ceccarelli M, Maratea A. Semi-supervised Fuzzy c-Means Clustering of Biological Data [C]. in WILF 2005, LNAI 3849, 2005: 259-266.
    [22] Christakou C, Leonidas L., Vrettos S, Stafylopatis A. A Movie Recommender System Based on Semi-supervised Clustering[C]. Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce(CIMCA-IAWTIC’05), 2005: 897-903.
    [23] Qian Y T, Du X X, Wang Q. Semi-supervised Hierarchical Clustering Analysis for High Dimensional Data [J]. International Journal of Information Technology, 2006, 12(3): 54-64.
    [24] Bouchachia A. Learning with partly labeled data [J]. Neural Comput & Applic. 2007, 16: 267-293.
    [25]司文武,钱沄涛.一种基于谱聚类的半监督聚类方法[J].计算机应用, 2005, 25(6): 1347-1348.
    [26]邱磊,李国辉,代科学.遥感图像的半监督的改进FCM算法[J].计算机应用研究,2006,36(7): 252-253.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700