用户名: 密码: 验证码:
一种基于联合熵的聚类边界检测技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术和数据库技术的不断发展,数据库中存储的数据种类和数量急剧增加,使得如何从海量数据中快速有效地提取有价值的信息变得至关重要。数据挖掘技术应运而生。适当的数据挖掘方法,使得生物学家可以发现大量的遗传信息,也使得地理学家可以发现对陆地气候有显著影响的极地和海洋大气压力模式。聚类技术是数据挖掘中的重要技术之一,人们对聚类技术已经有深入的研究,出现了许多种聚类算法,但对聚类边界的研究刚刚起步。聚类边界是一种模式,在实际应用中有着广泛的用途。在图像检测中,聚类的边界代表物体的轮廓,而在临床医学中,聚类的边界代表具有某种疾病特征的健康人群。所以,对聚类的边界的研究具有重要的价值。
     本文针对现有算法的不足,提出了基于联合熵的聚类边界检测算法(EDGE)和基于梯度二值化的聚类边界检测算法(BAGB)。
     EDGE算法采用网格技术和联合熵技术相结合的方法来提取聚类边界点。网格技术用于快速查找数据集中聚类边界所在的网格范围,这样就缩小了查找范围,提高了算法效率。联合熵技术用于在边界落入的网格范围内准确地识别聚类的边界点,这样提高了算法的精度。实验结果表明,该算法能够准确识别不同形状、大小和密度的数据集中聚类的边界,可以有效去除噪声,算法的时间复杂度是输入数据集点数的线性函数,在大型数据集上执行时间优势更明显。
     BAGB算法采用将网格技术和梯度算子相结合方法来提取聚类的边界点。网格技术用于用于提高数据处理的速度。prewitt梯度算子用于计算梯度,计算时采用的方法是在某网格周围3×3区域内从八个方向来计算梯度,取最大值为中心网格的梯度。梯度用于判断网格是否是边界网格,边界网格中的点即为边界点。此方法是把图像处理中处理图像边界的方法用于处理聚类的边界,为研究聚类边界提供了新思路。实验结果表明,该算法能够在含有噪声点/孤立点的数据集上,有效的检测出聚类的边界,运行效率高。
     本文的创新之处是:(1)提出了将网格技术和联合熵技术结合来检测聚类边界的思想,给出了EDGE算法;(2)将网格和梯度算子结合实现了聚类边界检测,提出了BAGB算法。
With the increasing development of information technology and database technology, the data category and quantity in the database demonstrates a sharp growth. Therefore, how to obtain effectively the valuable information from the mass data is of great significance. Meanwhile, in order to satisfy the requirement, data mining technology arises at the historic moment. Suitable data mining methods make it possible for biologists to discover the massive genetic information and geographers to discover atmospheric pressure pattern in term of polar region and the sea which has a remarkable influence on land climate. As one of the important data mining, clustering technology has already been done thorough research thus presenting many kinds of cluster algorithms. However, the research on cluster boundary's algorithms is still in its infancy. Cluster boundary, a kind of mode, is widely applied in practice. For instance, it represents the outline of the object in the image detection and healthy people having some potential diseases in the clinical medicine. Therefore, the research on cluster boundary is of great importance.
     To better the existing algorithm, this paper brings forth an Efficient boundary points Detecting alGorithm based on joint Entropy (EDGE) and Boundary points detecting clusterring Algorithm based Gradient Binarization (BAGB).
     The EDGE algorithm withdraws the cluster boundary points in combination of grid technology and joint entropy technology. Grid technique is used to fast search the scope of grids where clusters boundary exists, thus shrinking search scope and improving algorithm efficiency. Joint entropy is applied to detect boundary points of clusters in these grids and therefore increases algorithm precise. The experimental results show that EDGE can precisely detect boundary of clusters in terms of different shapes, sizes and density of the data sets and effectively eliminating noises. Meanwhile, the time complexity of algorithm is linear function of input the data set, so the superiority of execution time is more obvious in the large-scale data set.
     The BAGB algorithm, combining grid technology and gradient operator, withdraws the cluster boundary points. In this algorithm the grid technology is used to enhance the speed of the data processing, and prewitt gradient operator is applied to calculate gradient in 3×3 grid region from eight directions, the maximum being the central grid gradient. The gradient is used to judge whether the grid is the boundary grid or not, and a point in the boundary grid is a boundary point.Putting the method of image boundary processing into the practice of processing cluster boundary is a fresh idea for the research on cluster boundary. The experimental results indicate that this algorithm can effectively detect boundary of clusters in datasets with noises/outliers and improve operating sufficiency.
     The innovations of this paper are to put forward, for one thing, the idea of detecting cluster boundary based on the combination of networking and the union entropy technology and then provide EDGE algorithm, for another, the thought of detecting cluster boundary through combining grid and gradient operator and furnish BAGB algorithm.
引文
[1]Han Jiawei, Kamber Micheline. Data Mining:Concepts and Techniques,2nd Edition[M]. New York:Morgan Kaufmann,2006:384-385
    [2]贾海英等.聚类分析法在鼻腔鼻窦恶性肿瘤微血管检测中的应用[J].南方医科大学学报,2010,30(10):2357-2359
    [3]Xia Chenyi,Hsu Wynne, Lee Mong Li, et al. Border:Efficient Computation of Boundary Point[J].IEEE Transactions on Knowledge and Data Engineering.2006,18(3):289-303
    [4]Qiu Bao-Zhi, Yue Feng, Shen Jun-Yi.BRIM:An Efficient Boundary Points Detecting Algorithm[C]. Proc. of Advances in Knowledge Discovery and Data Mining. Heidelberg: Springer,2007:761-768
    [5]邱保志,刘洋等.基于网格熵的边界点检测算法[J].计算机应用,2008,28(3):732-734
    [6]邱保志,岳峰.基于引力的边界点检测算法[J].小型微型计算机系统,2008,29(2):279-282
    [7]李丰军,吾守尔·斯拉木等.一种基于角度的边界点检测算法[J].计算机工程,2008,34(11):203-205
    [8]薛丽香,邱保志.基于变异系数的边界点检测算法[J].模式识别与人工智能,2009,22(5):799-802
    [9]Liu D, Nosovskiy GV, Sourin O. Effective clustering and boundary detection algorithm based on Delaunay triangulation[J]. Pattern Recognition Letters,2008,29(9):1261-1273
    [10]Baozhi Qiu, Lixiang Xue. A boundary points detection algorithm based on entropy of grid[J]. Journal of information&computational science.2009,6(1):15-22
    [11]邱保志,余田.一种基于网格核密度的自适应边界点检测算法[J].小型微型计算机系,2008,29(5):837-840
    [12]吾守尔·斯拉木,李丰军,陶梅.IBORA:一种改进的有效的边界点检测[J].小型微型计算机系统,2008,29(10):1845-1848
    [13]何思谦,刘绍学,朱元森.数学辞海·第二卷[M].中国科学技术出版社,2002
    [14]同济大学数学系.高等数学(第六版)[M].高等教育出版社,2007年
    [15]Ester M,Kriegel H P,Sander J.A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C].Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland,Oregon.1996:226-231
    [16]张西芝,姬波,邱保志.基于网格的多密度聚类算法[J].微计算机信息(管控一体化),2005,21(12):101-103
    [17]Adimir Estivill-Castro, Ickjai Lee. Automatic Clustering Via Boundary Extraction for Mining Massive Point-data Sets[C].Proceedings of the 5th International Conference Geo Computation University of Greenwich, United Kingdom,2000,23-25
    [18]Harsha Nagesh, Sanjay Goil, Alok Choudhary. Adaptive Grids for Clustering Massive Data Sets[C], SIAM Conference on Data Mining,2001
    [19]C. Xia, H. Lu, B.C. Ooi, and J. Hu. Gorder:An Efficient Method for KNN Join Processing[C],Proc. Int'l Conf. Very Large Data Bases,2004
    [20]邱保志,琚长涛.具有聚类功能的边界检测技术的研究[J].计算机工程与应用,2010,46(20):138-141
    [21]邱保志,沈钧毅.基于扩展和网格的多密度聚类算法[J].控制与决策,2006,21(9):1011-1014
    [22]邱保志,余田.基于网格梯度的边界点检测算法的研究[J].微电子学与计算机,2008,25(3):77-80
    [23]V. Nosovskiy, Dongquan Liu, Olga Sourina.Automatic Clustering & Boundary Detection Algorithm Based on Adaptive Influence Function[J]. Pattern Recognition.2008,41(9): 2757-2776
    [24]陈阵,于炯.FRINGE:边界点的有效检测[J].新疆大学学报(自然科学版),2008,25(3),263-268
    [25]许倡森.基于混合网格划分的子空间高维数据聚类算法[J].计算机技术与发展,2010,20(10):150-153
    [26]刘小强.基于QoS服务的网格计算分析与研究[J].科技信息,2010,21:52-53
    [27]李亦农,李梅.信息论基础教程[M].北京:北京邮电大学出版
    [28]邱保志,张西芝.基于网格的参数自动化聚类算法[J].郑州大学学报(工学版),2006,27(2):91-93
    [29]Chih-MingHsu, Ming-Syan Chen. Subspace clustering of high dimensional spatiall data with noises[C]. PAKDD 2004, LNAI 3056.31-40
    [30]George Karypis,Eui-Hong(Sam)Han, Vipin Kumar. CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic Modeling[J]. IEEE Computer,1999, 32(8):68-75,August
    [31]冈萨雷斯.数字图像处理(MATLAB版)[M],电子工业出版社,2005,289
    [32]章毓晋.图像工程[M].北京:清华大学出版,2000
    [33]杨道普,马秋禾,石磊.边缘检测Prewitt算子的改进算法[M].测绘科学,2008,33卷增刊
    [34]Adimir Estivill-Castro, Ickjai Lee. Automatic Clustering Via Boundary Extraction for Mining Massive Point-data Sets[C].Proceedings of the 5th International Conference Geo Computation University of Greenwich, United Kingdom,23-25 August 2000

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700