聚类集成研究及其在玉米品种选择中的应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

聚类集成研究及其在玉米品种选择中的应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Clustering Agregation and Its Application in Corn Breed Selection
作者：杨草原
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：聚类 ; 聚类集成 ; 约束查询 ; 品种选择
英文关键词：Clustering ; Clustering aggregation ; Constraint query ; Breed selection
学位年度：2011
导师：刘大有
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2011-05-01

摘要

本文主要依托国家863高科技项目“数字农业知识网格项目技术研究及应用”。首先利用聚类集成技术提出两个算法分析了玉米品种属性之间的关系,然后结合了web服务,设计和实现了基于B/S架构的玉米品种选择系统。
     聚类在数据分析中起着重要的作用,它主要是根据数据模式间的相似度(或距离)来划分数据集,将其分成几个不相交的簇。借助于聚类分析,人们可以从大量数据中挖掘出一些有用的信息模式,并将其应用到各种领域的研究和实际应用中。虽然目前已经出现了很多聚类算法,但它们存在一些缺陷,使其聚类时不能很好地划分数据。聚类集成是对聚类的加强,它首先使用聚类算法得到数据集的多个数据划分,然后把这些数据划分当作对数据集的了解,重新计算相似度和分配数据模式,得到一个和大多数数据划分较一致的聚类结果。
     本文在研究聚类和聚类集成的基础上,提出了两个改进算法：c-means聚类算法和CAVM聚类集成方法。c-means算法是在k-means算法的基础上经改进得到的,它能够对有分类属性和数值属性的混合数据集进行聚类。c-means和k-means的不同之处有三点：(1)簇数目和初始簇心的选取,根据分类属性的数目来确定；(2)数据模式间距离的计算,不使用欧几里得距离,而是采用向量的形式来表示数据模式和簇心的距离,兼顾了分类属性和数值属性；(3)收敛函数的计算,也采用了向量的形式表示,计算出所有簇中的数据模式和簇心的距离之和,若其不再变化,则可以结束聚类过程。CAVM算法是在c-means算法的基础上提出的,它是一个基于投票的集成方法,可用于对混合属性的数据集进行聚类分析。CAVM算法使用了两次投票来修正聚类过程,第一次采用多数投票修正聚类成员,第二次在集成时采用联合投票修正最终的聚类结果。和一般的聚类集成方法不同,CAVM算法不需要提前生成多个数据划分,它首先根据分类属性的不同属性值产生初始划分,在此基础上使用c-means算法对数据集进行聚类,在聚类的过程对数据进行投票,最后根据投票结果重新划分数据集。
     本文一共收集了122个适宜在吉林省种植的玉米品种,在观察和分析了这些品种数据的属性之后,发现其中某些属性之间可能存在一些关联。为找出并验证这些关联,我们对122个玉米品种的属性信息进行了降维和标准化处理后,对数据集使用了c-means算法和CAVM算法,挖掘出两条关联规则：“生育期—品种类型”关联规则和“种植区域—品种类型”关联规则。这两条关联规则和农业知识大致相符,这说明c-means算法和CAVM算法能较好地识别数据集的内在结构。此外,为了验证算法的准确度,我们将c-means算法、CAVM算法和k-means算法进行了实验对比,结果表明c-means算法和CAVM算法在处理混合数据集时有明显的优势。
     虽然当前大部分农业网站都有详细的农作物知识,并有在线专家解答版块,不仅为用户提供相关知识,还能解决疑难问题。但这些网站有两个缺陷：(1)在选择品种时,需要用户自己去逐一了解和分析品种的属性,并结合用户当地的实际情况,对各方面进行综合比较之后才能做出选择,这增加了用户的负担；(2)在线专家虽然能够使用户得到比较权威的答案,但专家不在线或者用户需要等待才能得到解答,这可能使问题失去了时效性。
     为了改变这种现状,且鉴于当前并没有一个比较完整的玉米品种选择系统,本文以收集的122个玉米品种为依托,设计和实现了一个玉米品种选择系统。在系统的设计和实施过程中,需要用到Microsoft Visual Studio 2008、IIS技术和SQL Server 2005,涉及了C#和ASP.NET编程语言、基于web的互联网服务、数据库表的设计和管理等。值得注意的是,系统将发掘出的两条关联规则应用到系统中,将用户的输入转化成查询约束,更好地指导用户进行选种工作。最终的品种选择系统分为两部分：初级查询和高级查询。初级查询的中用户的输入比较少,这主要针对需求较少的用户使用；高级查询的约束条件较多,用户可以根据自己的实际情况,输入各种需求和约束,’系统给出满足用户约束的玉米品种。系统还能给出品种的详细信息,除了玉米品种名,株高、生育期、品种类型、种植条件、种植区域之外,还有用户比较关心的产量、抗病性、抗倒伏性以及种植要点,这些不仅加深了用户对品种的了解,还在一定程度上指导用户更好地种植玉米。
     本文设计的玉米品种选择系统和一般的农业网站相比,显得更为方便和准确。它不需要用户自己去分析属性并综合考虑种植地域、当地的风力及丰产性等要求,用户只需要输入各种需求之后,系统能够快速地给出满足用户需求的品种。当然,系统也有不足之处,例如还不够智能,不能处理一些特殊情况,其中的推理机制、约束规则太少,仅仅包含了两个约束规则,不能很好地处理一些特殊的品种。
This paper is based on National High Technology Research and Development Program of China (Digital Agricultural Knowledge Grid Technology Research and Application). We first use clustering aggregation technology to propose two algorithms to analyze the relationship among properties of corn breed, and then combines web service, and finally design and implement a corn breed selection system based on B/S structure.
     Clustering plays an important role in data analysis. It partitions the data set mainly based on the similarity (or distance) between data patterns, and divides the data set to several disjoint clusters. By cluster analysis, we can mine some useful information from the large data and apply these information to various research fields and practical applications. Although there are a lot of clustering methods, they have some shortcomings so that they can't make a good partition on data set. Clustering aggregation is the enhancement of clustering. It first gets many partitions of the data set by using various clustering algorithms, and regards these partitions as the understanding of the data set, and then calculates the similarity between data patterns and redistributes these patterns, and finally gets a clustering result which is consistent with most of the partitions.
     This paper studies clustering and clustering aggregation, and proposes two algorithms: c-means clustering algorithm and CAVM clustering aggregation method. c-means is an. improved algorithm based on k-means. It can cluster mixed data set with categorical and numeric attributes. There are three different points between c-means and k-means:(1) the number of clusters and the choice of initial cluster center, mainly depend on the number of categorical attributes; (2) the calculation of the distance between data patterns, is not using the Euclidean distance, but using the form of vector to express the distance between patterns and the cluster center, taking both noun and numeric attributes into account; (3) the calculation of the convergence function, is also in the form of vector, by calculating all the distances between patterns and the cluster center in every cluster to get the sum, and if the sum does not change, you can end clustering process. CAVM is proposed based on the c-means algorithm, which is a integration method based on voting. It can be used for mixed attributes data set to cluster analysis. CAVM algorithm uses voting twice to fix the clustering process, the first to amend cluster members, and the second to amend the final clustering results in the combination. Be different from the general cluster aggregation methods, CAVM algorithm first gets a initial partition by the different values of noun attribute, and then use c-means to cluster data sets, and votes for patterns simultaneously, and finally partitions the data set based on the voting result.
     We totally collect 122 corn breeds which are suitable for planting in Jilin Province. By observation and analysis of the properties of the breeds data, we find some correlations between attributes. To identify and verify these correlations, we reduce the dimension of the 122 corn breeds'properties information, and make standardization, and then use c-means and CAVM algorithms on the data set, and mine two association rules. They are:"the growth period-breed style" association rule and "planting regions-breed style" association rule. The two association rules are almost consistent with agricultural acknowledge, which shows c-means algorithm and CAVM algorithm can identify the internal structure of the data set. In addition, in order to verify the accuracy of the two algorithms, we compare c-means, CAVM with k-means algorithm by experiment, and the experimental results show that c-means and CAVM have obvious advantages to handle mixed data set.
     Although the most current agricultural websites have the detailed information of crops and online experts answer pages, which not only provide knowledge to users, but also solve difficult problems. However, these sites have two shortcomings:(1) when selecting breed, the user needs to understand and analyze the properties of breeds and take the local actual situations into account, and makes a choice after a comprehensive comparison, which increases the user's burden; (2) online expert could provide more authoritative answer to the user, but if the expert is not online, or the user's questions are answered after some time, which makes answers lose the timeliness.
     To change this situation, and there is still not a correlative and complete corn breed selection system, this paper designs and implements a corn breed selection system based on 122 corn breeds. In the process of designing and implementing the system, we used Microsoft Visual Studio 2008, IIS technique and SQL Server 2005, and involved the C# and ASP.NET programming language, Internet-based web services, design and management on database table. It is noteworthy that the system used the two mined association rules to translate the user's input into query constraints, which helps to guide the user to select breed. The final breed selection system has two parts:the primary query and advanced query. The primary query does not need many user's input and mainly for the users with smaller needs. The advanced query has more constraints, and users can input various needs and constraints according to their actual situations. After inputting the needs and constraints, the system gives corn breeds that meet the user's constraints. The system also gives detailed information about breed. In addition to breed name, height, growth period, breed style, planting conditions, planting regions, there are yield, disease resistance, lodging resistance and the planting points which the users concern about. These information not only deepen user's understanding about the breed, but also guide the user to better plant corn.
     The corn breed selection system we design, compared with the general agricultural websites, is much more convenient and accurate. It does not require the user to analyze the properties and consider a lot of situations such as planting region, the local wind and high yield and so on. The user only needs to input his demands, and the system can quickly give some breeds which meet the user's demands. Certainly, the system has some shortcomings. For example, it is not intelligent enough, and can not handle some special cases. It includes too few reasoning mechanism with only two constraint rules, which can not handle some special breed.

引文

[1]Jain A K, Murty M N, Flynn P J. Date clustering:A review[J]. ACM Computing Surveys, 1999,31(3):264-323.
    [2]Han J, Kamber M. Data mining:concepts and techniques, Second Edition[M]. San Francisco:Morgan Kaufmann Publishers,2005.
    [3]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1)：48-61.
    [4]唐伟,周志华.基于Bagging的选择性聚类集成[J].软件学报,2005,16(4)：496-502.
    [5]Bhatia S K, Deogun J S. Conceptual clustering in information retrieval[J]. IEEE Transactions on Systems, Man, and Cybernetics,1998,28(3):427-536.
    [6]Frigui H, Krishnapuram R. A robust competitive clustering algorithm with applications in computer vision[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999,21(5):450-465.
    [7]Guha S, Rastogi R, Shim K. CURE:An efficient clustering algorithm for large database[C]. In Proceedings of ACM SIGMOD international conference on Management of data. Seattle,1998,73-84.
    [8]Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]. In Proceedings of the 2nd ACM SIGKDD Conference. Portland,1996,226-231.
    [9]Wang W, Yang J, Muntz R. STING:A Statistical information grid approach to spatial data mining[C]. In:Proceedings of the 23rd VLDB Conference. Athebns,1997,186-195.
    [10]Strehl A, Ghosh J. Cluster ensembles:A knowledge reuse framework for combining multiple partitions [J]. Journal of Machine Learning Research,2002(3):583-617.
    [11]Gionis A, Mannila H, Tsaparas P. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data,2007,1(1):1-30.
    [12]Topchy A, Jain A K, Punch W. Clustering Ensembles:Models of Consensus and Weak Partitions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005, 27(12):1866-1881.
    [13]Leisch F. Bagged clustering[M/OL]. Technical report, Vienna University of Economics and Business Administration,1999. http://epub.wu-wien.ac.at/dyn/virlib/wp/eng/mediate/ epub-wu-01_1b0.pdf?ID=epub-wu-01_1b0.
    [14]Fern A. Finding consistent clusters in data partitions[C]. In Proceedings of the 2nd International Workshop on Multiple Classifier Systems.2001,309-318.
    [15]Zhou Zhihua, Tang Wei. Clusterer ensemble[J]. Knowledge Based Systems,2006,19(1): 77-83.
    [16]Fred A, Jain A K. Data clustering using evidence accumulation[C]. In Proceedings of the 16th International Conference on Pattern Recognition.2002,276-280.
    [17]Wang Xi, Yang Chunyu, Zhou Jie. Clustering aggregation by probability accumulation[J]. Pattern Recognition,2009,42(5):668-675.
    [18]He Zengyou, Xu Xiaofei, Deng Shengchun. A cluster ensemble method for clustering categorical data[J]. Information Fusion,2005(6):143-151.
    [19]Kuncheva L I, Hadjitodorov S T. Using diversity in cluster ensembles[C]. In Proceedings of IEEE International Conference on Systems, Man, and Cybernetics.2004,1214-1219.
    [20]Hadjitodorov S T, Kuncheva L I, Todorova L P. Moderate diversity for better cluster ensembles[J]. Information Fusion,2006,7(3):264-275.
    [21]Fred A, Jain A K. Evidence accumulation clustering based on the k-means algorithm[C]. In the Proceedings of the International Workshops on Structural and Syntactic Pattern Recognition.2002,442-451.
    [22]Minaei-Bidgoli B, Topchy A, Punch W F. A comparison of resampling methods for clustering ensembles[C]. In the International conference on Machine Learning, Models, Technologies and Applications.2004,939-945.
    [23]Topchy A, Minaei-Bidgoli B, Jain A K, et al. Adaptive Clustering Ensembles[C]. In the Proceedings of the 17th International Conference on Pattern Recognition.2004,272-275.
    [24]Fern X Z, Brodley C E. Random projection for high dimensional data clustering:A cluster ensemble approach[C]. In Proceedings of the International Conference on Machine Learning.2003,186-193.
    [25]Fred A, Jain A K. Combining multiple clusterings using evidence accumulation[J]. IEEE Transaction on Pattern and Machine Intelligence,2005,27(6):835-850.
    [26]Ayad H, Kamel M. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbours [C]. In Preceedings of the Fourth International conference on Multiple classifier systems. Springer-Verlag Berlin, Heidelberg,2003,166-175.
    [27]Fern X Z, Brodley C E. Solving cluster ensemble problems by bipartite graph partitioning[C]. In Preceedings of the 21st International Conference on Machine Learning. Banff, Alberta, Canada:ACM Press,2004,36-44.
    [28]玉米百科,http://baike.baidu.com/view/1243.htm.
    [29]陈学军.吉林省农作物品种志[M].北京：科学出版社,2003.
    [30]Luo Huilan, Kong Fansheng, Li Yixiao. Clustering mixed data based on evidence accumulation[C]. In Proceedings of Advanced Data Mining and Applications, LNCS 4093,2006. Heidelberg-Berlin:Springer,2006,348-355.
    [31]方华,李青松,郭玉伟,等.中国玉米品种生育期的研究[J].河北农业科学,2010,14(4)：1-5.
    [32]B/S结构百科,http://baike.baidu.com/view/268862.htm.
    [33]赵伟权,孙兵,黄兴华等人.玉米品种选择对产量的影响[J].吉林农业,2007(8)：18.
    [34]陈仲江.如何选择玉米品种[J].河北农业科技,2005(5)：37.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700