一个增量式粮食单位信息聚类分析系统和实现

英文题名：The Design and Implement an Increment Grain Company Information Cluster System
作者：王晓涛
论文级别：硕士
学科专业名称：计算机软件与理论
学位年度：2004
导师：左万利
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2004-04-01

摘要

本文所实现的部分是吉林省科技厅重大科技资助项目――“粮食管理信息智能决策支持系统”中的一部分。本系统主要以吉林省粮食工作为背景，利用数据仓库，数据挖掘，统计分析，知识推理等多方面的知识手段，针对实际中粮食工作的各个环节如粮食收购，粮食运输，粮食轮换，粮食调拨以及粮食业务报表生成，数据查询分析等方面进行了粮食数据仓库的建立，并在数据仓库的基础上建立了数据多维分析、粮食保管决策支持、粮食轮换决策支持、粮食调拨决策支持、粮库信息聚类分析、粮食预警预测以及GIS查询分析等模块，本文实现的就是其中的粮库信息聚类分析模块。
    本文利用数据挖掘中的聚类分析方法对各级粮食单位的各种信息指标包括自然指标（粮库面积，职工人数，运输线路情况等）和经营指标（年收购数量，烘晒数量，轮换数量等）进行聚类，根据自然条件和经济条件把这些粮食单位分为类间差异较大而自身特点比较显著的一些类，进而通过对聚类结果的统计分析得出一个所有粮食企业单位发展概况的分析，其中包括粮食企业发展的层次情况，各层次中企业的详细情况描述数据，不同层次的企业地理分布情况，还可以通过方差分析等方法分析对粮库层次分类影响较大的指标项。通过提供以上这些分析数据为粮食行业领导在总结粮食行业过去发展过程中的政策实施效果，调整粮食行业未来发展战略，制定地区粮食行业政策等方面提供决策支持。
    本文使用了常用的层次聚类分析法和分裂聚类分析法，层次聚类分析法采用凝聚式的聚类方法，每次聚类把距离最近的两个样本或小类聚成一类，直到所有的样本都聚成一类或者达到用户指定的类数，这种方法计算周密，结果准确而且用户可以通过对整个聚类过程的分析来了解整个样本集的结构特点；分裂聚类分析法又称k均值聚类分析法，它的特点是速度较快但是结果会受其他因素的影响比如初始中心点的选择等等，因此有时结果并不理想。这两种方法实际工作中需要结合数据量实际情况两种方法配合使用，一般情况下数据量较少的情况下可以直接运用层次聚类分析法，但是粮食数据仓库数据量一般都是很大的，因此用户可以先用层次聚类分析法对数据进行一次聚类，然后分析层次聚类的过程，找出一个合适的聚类结果数目，然后在以后的一段时间内可以用分裂聚类算法按照这个结果数目进行聚类。本文还详细讨论并实现了聚类算法中的一些关键问题，比如聚类过程中


    的距离计算方法，异常数据的处理，聚类限制条件的处理，分裂聚类算法初始中心点的选择等等。
    粮食行业数据仓库数据量大而且更新较快，只使用层次聚类和分裂聚类效率和结果准确性不是很理想，而且很多时间内要做新的聚类过去的结果却利用不上，针对这种情况，本文提出了一种增量式的层次聚类分析法，用基于距离的方法在已有的层次聚类结果或者分裂聚类结果基础上进行聚类。增量聚类的工作主要分为三部分，第一部分是获得增量数据，本系统中采用针对聚类工程建立一张增量数据表，通过建立触发器的方法在更新业务数据的同时更新增量数据表；第二部分是对已有结果根据增量数据进行增量修改，包括添加，删除，修改，并根据修改结果调整类结构；第三部分是对增量修改结果结果进行层次聚类，所用方法和层次聚类算法是相同的。通过实际测试显示这种方法既能有效利用已有的聚类结果，又能提高聚类速度，而且聚类结果比较理想，在数据仓库数据量较大，更新较频繁的时候，这种方法可以得到很好的聚类效果。
    在聚类结果的分析展示方面，本文提供了用来展示层次聚类过程的“聚类过程展示图”、用来统计聚类结果中各类各指标均值的“各类指标均值柱状展示图”、用来比较各类之间指标差异的“类间指标变化比较图”，还有结合ＧＩＳ地理信息系统查询模块用来直观展示聚类结果地理分布的“ＧＩＳ地理分布图”。本文结合不同的展示方法，力争为用户展示一个多角度，更直观更易理解的聚类分析结果，为用户思考决策提供更好的参照。
    粮食智能决策支持系统目前提供的聚类分析只有对粮库信息方面的聚类分析，但在设计和开发过程中开发者力争实现一个适应面广,可配置的粮食行业聚类分析工具，粮食专家可以用这个聚类分析工具可以建立不同方面的聚类工程，选择数据来源，指定和题目相关并且用户感兴趣的指标，选择合适的聚类方法等等。这样用户在使用的时候只要选择自己感兴趣的聚类工程进行聚类，然后用不同的结果分析方法来查看结果就可以了，要达到这个目标，该系统还需要进一步完善。
What have been realized in this article is a part of “Intelligent Decision Support System of Grain”(IDSSG), which is one of main technological projects supported by JiLin technology department. The system is on the background of actual work of grain enterprises in JiLin province. It used many ways such as Data-Warehouse, Data-Mining, Statistic analysis , acknowledge illation and so on to establish the Grain Data-Warehouse that concerned to every grain business steps, which involved grain purchasing, grain transportation, grain rotating, grain scheduling, making grain report forms , data inquiring and analysing. The system established six models on the base of Grain Data-Warehouse , these modules are multidimensional data analysis model, grain-keeping decision support model , grain-rotating decision support model, grain-scheduling decision support model, grain company information cluster analysis model and grain pre-warning model. What will be talk about is the grain company information cluster analysis model.
    In this article , I used cluster analysis ways of Data-Mining to cluster on every kinds of grain company information , which involved natural information (grain depot area , number of employee and so on) and business information (purchase number of a year , drying number of a year , rotating number of a year and so on), which can divide all the data items into several groups and there are great difference between groups but characteristic of a group is obvious. So customers can make analysis on this cluster result to get some information about all the grain companies’ development status. They can get these information in this model: grain companies’ classify status, detailed information of every class of grain company, geography distributing of each class , and can also know which information item has more influence on the classification. And these information can provide decision support for grain company leaders on summarizing the effect of grain policy, making or adjusting grain business tactic and making different policies on different area.
    Level cluster algorithm and split cluster algorithm were made use of in the system, and they are distance-based. Level cluster algorithm is a converge-style one, in every cluster cycle the program will select two data


    items whose distance is nearest and converge them into one group. It will be done like this until all the data items converge into one group or the group number reaches what customers have appointed before hand. This algorithm computes exactly and customer can analysis its cluster process to know about the structure of all the data items. Split cluster algorithm is also called K-Means cluster algorithm. It computer more rapidly than Level cluster algorithm but its result is not precise enough sometimes because it will be influenced by some parameters , for example the selection of the initial center point . So actually these two algorithms often be selected according to the data number and customer’s require for the cluster result . In the state that data number is small level cluster algorithm can be used , but Grain Data-Warehouse’s data number often is very large , so level cluster algorithm can be used firstly to get the level cluster process , then analysising it to get the appropriate result number . So customer can use split cluster algorithm with this result number in sometime later . Some important questions are discussed in this article such as how to compute the distance between data items , how to process the exceptional data items , how to realize the limit conditions of cluster and how to select the initial center points of split cluster algorithm and so on .
    Grain Data-Warehouse has a large data number and frequently update rate, so only use level cluster algorithm and split cluster algorithm can not get a good result.
    And in many times the results got in former clusters can’t be used . So a increment cluster algorithm was brought forward and realized in this article . This algorithm do cluster on the base of a former

引文

[1] 黄梯云，智能决策支持系统，电子工业出版社，2001.
    [2] Er,M.C.Decesion Support Systems: A summary, Problems and Future Trends.Decision Support Systems, 1988.4
    [3] Niwa, K.A.Knowledge_Based Human_Compater Cooperation System For ILL Structure Management Domains.IEEE Tran.Syst.ManCybern, 1986.
    [4] 崔宝灵，黄梯云，一个面向模糊问题的IDSS结构，决策与决策支持系统，1997.
    [5] B.R.Gaines,M.L.G.Shaw.Concept Maps as Hypermedia Components.Int.J. Human Computer Studies, 1995.
    [6] Inmon W H..Building Data Warehouse.Second Edition.John Wiley,1996.
    [7] CHEN Ning，ZHOU An，ZHOU Long-xiang，An Incremental Grid Density-Based Clustering Algorithm，软件学报，2002
    [8] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 1997.
    [9] 薛薇，统计分析与SPSS的应用，中国人民大学，2001.
    [10] 白红涛，孙吉贵，王晓涛，左万历，构建粮食行业数据仓库，计算机应用研究，已经录用
    [11] Panos Vassiliadis，Alkis Simitsis，Spiros Skiadopoulos，Conceptual modeling for ETL processes，ACM Press，2002.
    [12] Moriarty，Terry and Hellwege，Database Programming & Design; March 1998; Discusses criteria that can be used to ETL tools.
    [13] Chinrungrueng Chedsada.Evaluation of Heterogeneous Architectures for Artificial Neural Networks[D].PhD thesis,University of California at Berkeley, 1993-05.
    [14] LIoyd S.P.Least squares quantiztion in PCM[J].IEEE Transactions on Information Theory,1982,(2);129-137.
    [15] Mody J.Darken C.Fast learning in network of locally turned processing units[J].Neural Computation,1989,1;281-294
    [16] Rumelhart.D.E.Zipser D.feature discovery by competitive


    learning [J].Cognitive Science,1985,9:75-112.
    [17] Kohonen T . Self-organized formation of topologically correct feature maps[J].Biological Cybernetics,1982,43:59-69
    [18] Johan Vesanto , Esa Alhoniemi . Clustering of the Self-Organizing Map[J].IEEE Transactions on Neural Nerworks,2000,11(3);586-599.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700