基于流形学习的数据约简方法研究与应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息化技术的不断发展,在大量的科学研究中,有时会遇到具有高维特性的数据集,数据的高维特性为获取数据内在规律和结构带来了很大的困难。因此,需要采用适当的数据约简方法对这些数据集进行约简处理。数据约简也被称为维数约简或数据降维,现有的降维方法对于不同的数据集具有不同的处理效果。从数据所呈现的结构出发,基于流形学习的数据约简方法可以分为两大类:线性方法和非线性方法。线性降维方法可以对具有线性结构的数据集或者高斯数据集进行有效的处理,非线性降维方法可以对嵌入在高维空间中的数据进行投影,将其映射到低维空间坐标中,从而可以进一步探索数据的内在几何结构。流形学习将样本集内的数据几何信息通过运用数据分析技术呈现出来,即将高维复杂的数据用简洁的低维结构来表示。流形学习的主要目的是寻求嵌入在高维空间中数据的内在分布规律,目前已成为机器学习等相关领域的研究热点。
     本文通过对基于流形学习的数据约简方法进行一定程度的研究,分别从邻域参数的选择、新增数据点的处理方面对流形学习方法进行了研究和详细的阐述,将改进后的方法有效的应用在文本聚类中,并通过实验验证了方法的有效性和可行性。主要工作总结如下:
     1.提出了一种判别邻域参数选择合适性的方法。方法采用核主成分分析方法对数据误差进行重构,然后对重构后的数据误差进行聚类,根据聚类的个数判断邻域选择的合适性。之所以采用核主成分分析方法是因为它属于非线性方法,是在主成分分析的基础上产生的,它采用核函数来代替数据向量内积,同时具有主成分分析方法的特性。利用非线性函数把原始数据映射到高维特征空间中进行处理,需要进行内积计算,通过计算原始数据的核函数来代替内积计算,那么相应的计算量就会大大减小。在对误差进行聚类效果的评价方面,采用AIC信息准则对聚类个数进行判断。当数据误差被聚为一类时,则说明所选的邻域参数没有引起误差结构的变化,此时邻域值是合适的;当数据误差的聚类的个数多于一类时,则说明所选的邻域参数导致误差结构发生了严重的变化,此时邻域值是不合适的。
     2.探讨了一种新的降维方法。从目前的研究来看,局部切空间排列方法使用比较少,经过分析可知,之所以研究较少是因为该方法在某些情况下存在一些缺陷。比如,在处理样本较大的数据集的时候会出现数据内在结构扭曲或者不完整现象,由此可知局部切空间排列方法对于新增数据样本点的处理并不是很理想。优化的线性判别方法是一种线性降维方法,是将原始线性判别方法中的Fisher准则进行优化,使方法执行起来更加方便。文中将优化的线性判别方法与局部切空间排列方法相结合,利用经过优化的Fisher准则对类内和类间投影矩阵进行求解变形,最后得到数据的最优投影矩阵。通过两种方法的结合,可以有效的对新增数据点进行处理。
     3.探讨了基于流形学习的降维方法在文本聚类中的应用。一般情况下,对文本信息的获得是通过将文本中出现的词条信息频率构造成相应的矩阵,这些矩阵呈现高维特性。若想进一步探究文本数据的内在规律,就需要运用适当的降维方法,近年来数据约简技术已经逐步被应用在文本聚类中。文中运用基于优化线性判别的局部切空间排列方法对高维文本数据信息进行降维处理,将低维空间中的局部坐标对齐,进而表示出全局坐标,获取数据的局部邻域和局部切空间向量坐标,通过使局部误差最小化来对齐局部和全局切空间向量坐标。为了得到良好的可视化效果,用k均值方法对处理后的数据进行聚类分析,同时使用熵值对聚类质量进行评价。
With the continuous development of information technology, in a lot ofscientific studies, we may meet some data sets with high dimension characteristicsometimes, the characteristics of the high dimensional data will bring a great deal ofdifficulties for data inner laws and structures.
     Therefore, we should use appropriate methods of data reduction to processthese data sets. Data reduction is also called dimension reduction or data dimensionreduction; the existing dimension-reduction methods can produce different treatmentresults for different data sets. From the structure of the present data to see, datareduction methods that based on the manifold learning can be classified into twocategories: linear methods and nonlinear methods. The linear dimension reductionmethods can process the data sets and Gaussian data sets effectively, the nonlineardimension reduction methods can project the data that embedded in the highdimension space, and map them to the low dimensional space coordinates, so we canexplore the data inherent geometric structure further . Manifold learning will showthe data inherent geometric structure by the data analysis technology. The conciselow dimensional structure can show the complex data of high dimensional. The mainpurpose of Manifold learning is to seek the data internal distribution that embeddedin the high dimension space. In recent years, Manifold learning has become the hotresearch in the field of the machine learning and other researches.
     In the article, we study the method of manifold reduction in a certain extent,and discuss the manifold learning algorithm research and detail respectively fromtwo aspects, the neighborhood parameter selection and the processing new datapoints. And apply the improved method to the text clustering effectively; also use theexperimental results to verify the feasibility and effectiveness of the method. Themain work summarized as follows:
     1. put forward a method of the fit discrimination of neighborhood parameterselection. Use the method of kernel principal component to reconstruct the data error,take the reconstruct data error together, and judge the fit of neighborhood choiceaccording to the number of clustering. Because the kernel principal component analysis method belongs to nonlinear method, it is produced based on the principalcomponent analysis, with the nuclear function instead of inner product data vector,and it also has the characteristics of the principal component analysis method. Whenuse the nonlinear function to map the original data to the high dimension spacefeatures, it need the inside accumulate computation. With the kernel functioncalculation of the original data instead of the inside accumulate computation, thecorresponding calculation is reduced greatly. We use the AIC information criterionto judge the cluster numbers in the clustering effect evaluation. When the data errorsare gathered for a class, the selected of neighborhood parameters don’t cause thechange of the structure of error, so we say that the neighborhood value is the right;When the data errors are gathered more than a class, the selected of neighborhoodparameters change the error structure seriously, we say that the neighborhood valueis not appropriate.
     2. Discusses a new method of dimension reduction. From the current studies,the method of Learning Technology Systems Architecture has many defects in somecases, so it use rarely. For example, the inner structures will distortion or incompletewhen treatment the large data sets, it can be seen that the method of LearningTechnology Systems Architecture method is not ideal in processing the new datasample points. The optimization of the linear discriminate analysis method is a lineardimension reduction method, it optimizes the Fisher criterion of the original method,and it makes the method more conveniently. In the article, we combine theoptimization of the linear discriminate analysis method and the Learning TechnologySystems Architecture method, and use the optimized Fisher criterion to solute theprojection matrix within and between classes, finally get the optimal projectionmatrix of data. By using the combination of the two methods, we can process thenew data points effectively.
     3. Discuss the application of the dimension reduction method based on themanifold learning in text clustering. In general, by structuring matrix of thefrequency of information, we can get the text information, and these matrixes alsohave high dimension characteristics. We should use the proper dimension reductionmethod to explore the inner rules of text data further. In recent years, the datareduction technology has been used in text clustering gradually. In the article, we usethe Learning Technology Systems Architecture method which based on the optimallinear discriminative to process the high dimensional text data, align the localcoordinate in the low dimension space, and show the global coordinates, we can get the local neighborhood of the data and Local cut space vector coordinates. Throughminimize the local error to align the global and local cut space vector coordinates. Inorder to get good visual effect, we use k-means method to analysis the dataclustering, and use the entropy value to evaluate the cluster quality.
引文
[1]王自强,钱旭,孔敏.流形学习算法综述[J].计算机工程与应用,2008,44(35):9-12.
    [2]高小方.流形学习方法中的若干问题分析[J].计算机科学,2009,36(4):25-28.
    [3]周红,吴炜,滕奇志,杨晓敏,李旻,陶德元.流形学习中的算法研究[J].计算机应用研究,2007,24(7):214-217.
    [4]曾宪华,罗四维.局部保持的流形学习算法对比研究[J].计算机工程与应用,2008,44(29):1-7.
    [5] Kambhatla N, Leen TK. Dimension Reduction by Local Principal Component Analysis.Neural Computation, 1997, 9(7):1493–1516.
    [6] SCHOLKOPF B, SMOLA A J MULLER K R.Nonlinear component analysis as a kemeleigenvalue problem. Neural Computation.1998.10 (5).1299-1319.
    [7] ROWEIS S, SAUL L. Nonlinear Dimensionality Reduction by Locally Linear Embedding[J]. Science, 2000, 290(5500):2323–2326.
    [8] TENENBAUM J B, SLVA V D, LANGFORD J C. A Global Geometric Framework forNonlinear Dimensionality Reduction [J]. Science, 2000, 290(5550): 2319-2323.
    [9] SLVA V D, TENENBAUM J B, Global Versus Local Methods in NonlinearDimensionality Reduction[J].Neural Information Processing Systems,2003,15:705-712.
    [10] X Geng,D.C.Zhan,Z.H.Zhou.Supervised nonlinear dimensionality reduction forvisualization and classification[J].IEEE Transactions on Systems, Man and CybemeticsPart B:Cybemetics ,2005.35 (6):1098-1107.
    [11] Law M H C, Zhang N, Jain A K.Nonlinear manifold learning for data stream[C].Proceedings of SIAM Data Mining 2004. Orlando, Florida,US,2004.33-44.
    [12]赵连伟,罗四维,赵艳敞,刘蕴辉.高维数据流形的低维嵌入及嵌入维数研究[J].软件学报,2005,16(8):1423-1430.
    [13]张军平,王珏.主曲线研究综述[J].计算机学报,2003.26(02):129-146.
    [14]何力,张军平,周志华.基于放大因子和延伸方向研究流形学习算法[J].计算机学报,2005,28(12):2000-2009.
    [15] DONOHO D L, GRMES C E, Hessian Eigen maps: New Locally Linear EmbeddingTechniques for High-dimensional Data [J].Proceedings of the National Academy ofSciences, 2005, 102(21):7426-7431.
    [16] Zhang Chang-shui,Wang Jun,Zhao Nan-yuan,et al.Reconstruction and Analysis ofMultipose Face Images Based on Nonlinear Dimensionality Reduction[J].PatternRecognition,2004,37(1):325-336.
    [17]马瑞,王家珏,宋亦旭.基于局部线性嵌入(LLE)非线性降维的多流形学习[J].清华大学学报,2008,4(48):582-585.
    [18]黄景涛,谈书才,赵会.一种改进的局部线性嵌套算法[J].计算机工程.2010,17(36):36-38.
    [19]张振跃,查宏远.非线性低次逼近与非线性降维[J].中国科学A辑:数学2005 ,35(3):273-285.
    [20]Min WL, Lu L,X.H,Locality pursuit embedding[J].Pattern Recognition, 2004,37(4):781–788.
    [21]杨剑,李伏欣,王珏.一种改进的局部切空间排列算法[J].软件学报,2005,16(9):1584-1590.
    [22] JUNPING ZHANG, LIJUE WANG. Manifold learning and Applications in Recognition [C]Intelligent Multimedia Processing with Soft Computing. Heidelberg: Springer-Velar, 2004.
    [23]徐蓉,姜峰,姚鸿勋.流形学习概述[J].智能系统学报,2006.3(1):45-51.
    [24]李小丽,薛清福.几种流形学习算法的比较研究[J].电脑与信息技术,2009,17(3):14-18.
    [25]闫志敏,刘希玉.流形学习及其算法研究[J].计算机技术与发展,2011,21(5):99-102
    [26]黄启宏,刘钊.流形学习中非线性维数约简方法概述[J].计算机应用研究,2007,24(11):19-25.
    [27]张军平.流形学习及应用[D].北京:中国科学院自动化研究所,复杂系统与智能科学重点实验室,2003.
    [28]孙明明.流形学习理论与算法研究[D],南京理工大学博士论文,2007.
    [29]刘小明.数据降维及分类中的流形学习研究[D],浙江大学博士论文,2007.
    [30] Benfield J D. and Raftery A E.Ice floe identification in satellite images using mathematicalmorphology and clustering about principal curves. Journal of the American StatisticalAssociation, 1992, 87(417): 7-16.
    [31] Kegl B, Krzyzak A. et al Learning and design of principal curves, IEEE Transactions onPattern Analysis and Machine Intelligence,2000, 22(3): 281-297.
    [32] Hermann T, Meinicke P, Ritter H.Principal curve sonification.In:Proceedings ofInternational Conference on Auditory Display, Atlanda,USA,2000.81-86.
    [33]何力.维数约简中的若干问题[D].复旦大学博士论文,2010.
    [34]曾宪华.流形学习的谱方法相关问题研究[D].北京交通大学博士论文,2009.
    [35]王路,王磊,卓晴,王文渊.基于二维主成分分析的运动目标检测[J].计算机科学,2008,35(8):206-207.
    [36]刘忠宝,王士同.一种改进的线性判别分析算法MLDA[J].计算机科学,2010,37(11):239-242.
    [37]王庆刚.流形学习算法及若干应用研究[D].重庆大学博士论文,2009.
    [38]王勇,吴翊.Isomap的最优嵌入维数的估计算法[J].系统仿真学报,2008,22(20):6066-6069.
    [39]姜伟,杨炳儒.基于流形学习的维数约简算法[J].计算机工程,2010.6(12):25-27.
    [40]曾宪华,罗四维.全局保持的流形学习算法对比研究[J].计算机工程与应用,2010,46(15):1-6.
    [41]黄景涛,谈书才,赵会.一种改进的局部线性嵌套算法[J].计算机工程,2010,17(36):36-38.
    [42]吴晓婷,闫德勒.数据降维方法分析与研究[J].计算机应用研究,2009,26(8):2832-2835.
    [43]黄启宏,刘钊.流形学习中非线性维数约简方法概述[J].计算机应用研究,2007,24(11):19-25.
    [44] Yang L.Building k edge disjoint spanning trees of minimum total length for isometric dataembedding[J].IEEE Transactions on Pattern Analysis and MachineIntelligence,2005,27(10):1680-1683.
    [45]邵超,张斌,万春红.流形学习中邻域大小参数的合适性判定[J].计算机工程与应用,2010,46(20):172-175.
    [46]于雪莲.基于核方法和流形学习的雷达目标距离像识别研究[D],电子科技大学博士论文,2007.
    [47]王树云,宋云胜.线性模型下基于AIC准则的Bayes变量性选择[J].山东大学学报(理学版),2010,45(6):43-45.
    [48]文贵华,江丽君,文军.邻域参数动态变化的局部线性嵌入[J].软件学报,2008,19(7):1666-1673.
    [49]冯海亮,王丽,李见为.局部切空间排列算法及其在人脸识别中的应用[J].沈阳建筑大学学报(自然科学版),2009.25(03):595-599.
    [50]马平,靳敬永,孙玉胜.改进的线性判别分析及人脸识别[J].计算机与数字工程,2009.1(37):135-137.
    [51]范燕,郑宇杰,吴小俊,杨静宇.对称LDA及其在人脸识别中的应用[J].计算机工程,2010,36(1):201-205.
    [52]高茂庭,陆鹏.基于投影寻踪降维的文本特征可视化[J].计算机应用,2008,28(6):1411-1413.
    [53]冯燕,王洪元,程起才,刘爱萍.基于LLE-k均值方法的中文文本聚类[J].计算机与数字工程,2010,11(10):10-12.
    [54]高茂庭.文本聚类分析若干问题研究[D].天津大学博士学位论文,2006.
    [55]毛嘉丽.文本聚类中的特征降维方法研究[J].西华师范大学学报(自然科学版),2009,30(04):365-368.
    [56] LEW IS D. Reuters-21578 text categorization text collection 1. 0 [EB/OL].[2007-02-15].http: //www. research. at.t com/lew is.
    [57]孙越恒,侯越先,何丕廉.非线性维数约减算法在文档聚类中的应用[J].计算机应用,2008,28(2):488-490.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700