基于特征和实例的海量数据约简方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息和计算机技术的不断发展,出现了大量的海量数据。在对这些海量的数据进行复杂的数据分析时,通常需要耗费大量的时间。虽然目前已经提出了一些通过减少海量数据集中实例的特征数和数据集中的实例数的方法来降低处理海量数据集所消耗的时间和存储海量数据需要的空间,但是这些方法都具有一定的局限性,因此,探索具有一定普适性的减少特征数和实例数的数据约简方法,对提高海量数据处理的效率有着重要的理论意义和现实意义。
     针对现有的基于实例选择的数据约简方法所存在不足的基础上,结合Hausdorff距离在度量两个数据集相似度方面展示出来的优越性,利用Hausdorff距离作为原始数据集中实例是否被选择的标准,从原始数据集中选取具有“代表”性的实例。同时考虑到Hausdorff距离的计算复杂度较高,借助K-NN (K-Nearest Neighbor)搜索算法将数据集进行均匀切割,在此基础之上,提出了一种基于局部Hausdorff距离的面向实例的数据约简方法。
     围绕多特征海量数据集的高效约简,从增强目标实例邻近数据的均匀度、减少近邻表示误差及提高降维效果的角度出发,提出了一个基于中心点的K-NN搜索算法,该算法能够保证对某一实例进行近邻搜索时得到的近邻实例最大限度的均匀分布在该实例的周围。为了保证K值的变化符合非均匀数据集的特性,提出了根据数据集局部均匀度的变化动态调整K值的方法,制定了满足非均匀数据集中LLE (Locally Linear Embedding)算法降维需要的K值变化规则。以此为基础,提出了一种基于LLE算法的变K值的数据约简方法。
     考虑到数据集中实例的变化一般会影响数据集分类的效果和数据集上的统计特征,建立了利用分类方法和空间统计的相关技术对约简方法进行综合评价的方法。为了实现对分类方法效果的度量,在分析了类半径、类间距和类实例数对分类精度影响的基础上,给出了一种综合评价分类精度的计算方法。通过对数据集中实例的频数分布、数据集的分位数和数据集之间的距离三个统计特征的分析,对约简前后数据集之间的相似性进行了评价。同时,基于对Moran'sI参数的作用分析,提出了利用Moran'sI对数据集约简前后的自相关性的变化进行度量,以实现对约简效果的评价。
     通过对基于特征选择、实例选择的数据约简方法以及约简效果评价方法的研究,取得了若干研究成果,对提高海量数据的处理效率、快速获取有效信息具有积极意义。
With the development of the computer technology, information recording and spreading become more and more convenient, and information collecting technology is more and more advanced; this fact generated massive data. It consumes more time to process the massive data; In order to reduce the instances and the characters of the instance, many methods have been proposed to reduce processing time and storage, but those methods have their own limitation and boundary. So, researching a universal method to reduce the characters and instances and to give a reasonable evaluation is of great theoretical and practical significance.
     By analyzing the limitation of current instance selecting method and the advantage of measuring the similarity of two data sets, this paper proposed a data reduction method based on local Hausdorff distance. The method uses the Hausdorff distance as the criteria to select the representative instances. At the same time, in order to reduce the complexity of computing Hausdorff distance, this paper uses a K-NN search method to split the original data set into smaller datasets, and then uses the Hausdorff distance to select the representative instance in smaller data sets.
     In order to overcome the shortcomings of traditional LLE algorithm in processing the un-uniform data set for dimensionality reduction, the paper analyzed the impact of the error between an instance and its adjacent representation. In order to improve the result of dimensionality reduction, the paper introduces a self-adapting algorithm to calculate the parameter K to reduce the error between the instance and its adjacent representation, and a LLE-based data reduction schema by using the variant parameter K. The paper also proposes a schema to adjust the K dynamical with the variance of the local uniformity, which can reduce the error between the instance and its adjacent representation and satisfy the needs of dimensionality reduction for un-uniform data set. At last, this paper gives out a search algorithm based on center point to improve the uniformity of the near neighbors of the instance and to reduce the.
     Due to the fact that changes on the instance in data set will affect the result of the classification and the statistical nature of the data set, the paper proposes a schema by using the classification and spatial statistics to evaluate the result of the data reduction schema. By analyzing the affect of class radius and the distance between classes and number of instance in the data set for classification accuracy, the paper gives a method to compute the classification, which can evaluate the result of the classification schema. According to the analysis on the frequency distribution, quintile fractals and distance between the instances, this paper proposes a schema to evaluate the similarity between the data sets. Since spatial autocorrelation can reflect the distribution of the instance in data set, Moran'sl is used to measure the autocorrelation and to evaluate the change of autocorrelation of the original data set and reduction data sets.
     The research work on data reduction based on characters selection, instance selection and reduction evaluation archieves theoretical and practical values. These archievements also have a positive significance to improve the efficiency of processing massive data and obtaining the valuable information.
引文
[1]中投顾问,2011-2015年中国图书出版发行业投资分析及前景预测报告,http://www.ocn.com.cn,2010
    [2]中国互联网络信息中心,第27次中国互联网络发展状况统计报告,http://www. cnnic.cn,2011
    [3]赵海青,李社宗,周幸福等.数据库中的知识发现及其在气象中的应用.河南气象,2002(2):35-36
    [4]邱声春.数据挖掘和数据融合技术在天气预报和气象服务中的应用研究.山西气象,2007(2):34-36
    [5]张丽艳,周儒荣,蔡炜斌等.海量测量数据简化技术研究.计算机辅助设计与图形学报,2001,13(11):1019-1023
    [6]胡峰,张杰,刘静等.一种基于Rough集的海量数据属性约简方法.重庆邮电大学学报(自然科学版),2009,21(4):1-6
    [7]张江陵,冯丹.海量信息存储.北京:科学出版社,2003:1-262
    [8]杨际祥,谭国真,王荣生.并行与分布式计算动态负载均衡策略综述.电子学报,2010,38(5):1122-1130
    [9]Jason C. Megainduction:Machine Learning on Very Large Databases. PhD Thesis. Sydney:Univeristy of Sydney,1991
    [10]Chan P K, Stolfo S J. Meta Learning for Multistrategy and Parallel Learning. In Proceeding of Second International Workshop on Multistrategy Learning, Berlin: Springer,1993.150-165
    [11]Manish M, Rakesh A, Jorma R. Sliq. A fast Scalable Classifier for Data Mining. In: Proceeding of the Fifth InternationalConference on Extending Database Technology (EDBT). Avignon:Morgan Kaufmann,1996.18-32
    [12]John C, Shafer, Rakesh A, Manish M. SPRINT:A Scalable Parallel Classifier for Data Mining. San Jose, California:IBM Almaden Research Center,1996
    [13]Alsabti K, Ranka S, Singh V. Clounds:A Decision Tree Classifier for Large Datasets [EB/OL]. (1998-12-15)[2009-03-12]. http://www.aaai.org/Papers/KDD/1998/KDD 98-001.pdf
    [14]Gehrke J, Ramakrishnan R, Ganti V. RainForest:A Framework for East Decision Tree Construction of Large Datasets. Data Mining and Knowledge Discovery,2000, 4(2-3):127-162
    [15]Qin Z R, Wang G Y, Wu Y, et al. A Scalable Rough Set Knowledge Reduction Algorithm. In:Proceedings of Rough Sets and Current Trends in Computing, Berlin: Springer-Verlag,2004.445-454
    [16]Breiman L. Bagging Prediction. Machining Learning,1996,24(2):123-140
    [17]Freund Y. Boosting a Weak Learning Algorithm by Majority. Information and Computation,1995,121(2):256-285
    [18]Doningos P. Knowledge Acquisition from Examples via Multiple Models. In: Proceeding of the 14th International on machine Learning. Avignon. France:Morgan Kaufman.1997.98-106
    [19]Prodromidis A L, Chan P K, and Stolfo S J. Meta learning in distributed data mining systems:Issues and approaches. In:Proceeding of Advances in distributed and parallel knowledge discovery, AAAI Press.2000.81-114
    [20]Wu X D, Zhang S C. Synthesizing High-frequency Rules from Different Data Sources. IEEE Transaction on Knowledge and Data Engineering 2003,15(2):353-367
    [21]Han E H, Karypis G, Kumar V. Scalable Parallel Data Mining for Association Rules. In:Proceedings of the ACM SIGMOD Conference, Piscataway, NJ, USA:IEEE Educational Activities Department,1997.420-431
    [22]Pawlak Z. Rough Set. International Journal of Computer and Information Sciences, 1982,11:341-356
    [23 ]王国胤.Rough集理论与知识获取.西安:西安交通大学出版社,2001.235-264
    [24]Nguyen S H, Skowron A,Synak P,et al. Knowledge Discovery in Databases:Rough Set Approach, in Proceeding of IFSA'97. Prague:Czech Republic.1997.205-209
    [25]覃政仁,吴渝,王国胤.一种基于Rough Set的海量数据分割算法.模式识别与人工智能,2006,19(2):249-256
    [26]Hu F, Wang G Y, Huang H, et al. Incremental Attribure Reduction Based on Elementary Sets, In:Proceedings of RSFDGrC'2005, Part1, LNAI3641. Berlin, ALLEMAGNE:Springer,2005:185-194
    [27]胡峰,代劲,王国胤.一种决策表增量属性约简算法.控制与决策,2007,22(3):268-272
    [28]Zhang Z, Wang G Y. RRIA:A Rough Set and Rule Tree Based Incremental Knowledge Acquisistion Algorithm. Fundamenta Informaticae,2004,59(2-3): 299-314
    [29]王利,王国胤,吴渝.基于可变精度粗糙集模型的增量式规则获取算法.重庆邮电大学学报(自然科学版),2005,17(6):709-713
    [30]Qin Z R, Wang G Y, Wu Y, et al. A Scalable Rough Set Knowledge Reduction Algorithm. In:Proceedings of Rough Sets and Current Trends in Computing. Sweden:Springer,2004 445-454
    [31]刘少辉,盛球戬,吴斌等Rough集理论高效算法的研究.计算机学报,2003,5(26):524-529
    [32]Murai T, Nakata M, Sato Y.A Note on Filtration and Granular Reasoning. New Frontiers in Artificial Intelligence, LNAI 2253, Springer,2001.385-389
    [33]Yao J T, Yao Y Y.Induction of Classification Rules by Granular Coumputing. In: Proceedings of RSCTC2002, LNAI2475, Berlin:Springer,2002.10-13
    [34]Zadeh L A. Fuzzy Logic Computing with Words.IEEE Transactions on Fuzzy System,1996,4(1):104-111
    [35]Zadeh L A. Some Reflections on Soft Computing, Granular and Their Roles in the Conception, Design and Utilization of Information/Intelligence System, Soft Computing,1998,2(1):24-25
    [36]Qinghua W, Youyun Z, Lei C, et al.Fault diagnosis for diesel value trians based on no-negative matrix factorization and neural network ensemble. Mechanical Systems and Signal Processing,2009,23(5):1683-1695
    [37]Behrens T,Zhu A X, Schmidt K,et al.Multiscale digital terrain analysis and feature selection for digital soil mapping.Geoderma,2010,155(3-4):175-185
    [38]Camacho J, Pic J, Ferrer A.Data understanding with PCA:Structural and Variance Information plots.Chemometrics and Intelligent Laboratory Systems,2010, 100(1):48-56
    [39]Lipovetsky S. PCA and SVD with nonegative loadings.Pattern Recognition,2009, 42(1):68-76
    [40]Radulovic J, Rankovic V. Feedforward neural network and adaptive network-based fuzzy inference system in study of power lines. Expert System with Applications, 2010,37(1):165-170
    [41]Khosravi A, Nahavandi S, Creighton D. A prediction interval-based approach to detemine optimal structures of neural network metamodels.Expert Systems with Applications,2010,37(3):2377-2387
    [42]Guangyng Y, Shanxiao Y. Study of Electromyography Based on informax ICA and BP Neural Network. In:Proceedings of the Second International Symposium on Networking and Network Security (ISNNS'10), China:IEEE Computer Society, 2010.242-245
    [43]Esa Ollila. The deflation-based FastIC A estimator:statistical analysis revisited. IEEE Transactions on Signal Processing,2010,58(3):1527-1541
    [44]Wim De Clercq; Vergult, A.; Vanrumste, B.Canonical Correlation Analysis Applied to Remove Muscle Artifacts from the Electroencephalogram. IEEE Transactions on Biomed Engineer,2006,53(12):2583-2587
    [45]王自强,钱旭.基于流形学习和SVM的Web文档分类算法.计算机工程,2009, 15:38-40
    [46]Aleksandr Nikolaevich Gorban, Donald C. Wunsch. Principal manifolds for data visualization and dimension reduction. Springer,2008.172-183
    [47]Pez M M, Ram Rez J, G Rriz J M, et al. SVM-based CAD system for early detection of the Alzheimer's disease using kernel PCA and LDA.Neuroscience Letters,2009,464(3):233-238
    [48]Shi Q,Ronghuan Y,Yingmei W,et al.Gaussian Process Latent Variable Models for Inverse Kinematics.Journal of Multimedia,2011,6(1):48-55
    [49]Rasti J, Monadjemi A, Vafaei A.Color reduction using a multi-stage Kohonen Self-Organizing Map with redundant features. Expert Systems with Applications,2011,38(10):13188-13197
    [50]John Aldo L, Amaury L, Michel V. Nonlinear projection with curvilinear distances: Isomap versus curvilinear distance analysis. Neurocomputing,2004,57:49-76
    [51]肖传乐,曹槐.基于流行学习的基因表达谱数据可视化.生物信息学,2009,1:47-51
    [52]Rong J, Li G, Chen Y-P. Acoustic feature selection for qutomatic emotion recognition from speech.Information Processing&Management,2009,45(3):315-328
    [53]Sam T Roweis, Lawrence K Saul.NonLinear Dimensionlity Reduction by Locally Linear Embedding.Science,2000,290:2323-2316
    [54]Mikhail Belkin, Partha Niyogi.Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation,2003,15(6):1373-1396
    [55]Ronald R. Stephane Lafon. Diffusion Maps.Applied and Computational Harmonic Analysis,2006,21(1):5-30
    [56]D. L. Donoho and C. E. Grimes. Hessian eigenmaps:locallyembedding techniques for high-dimensional data. Proc. NatlAcad. Sci. USA,2003,100:5591-5596
    [57]Hou C, Zhang C, Wu Y, et al.Stable local dimensionality reduction approaches. Pattern Recognition,2009,42(9):2054-2066
    [58]Killan Q, Lawrence K. Saul. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In:Proceedings of the 21st national conference on Artifical Intelligence (AAAI'06),2006,2:1683-1686
    [59]Gashler M, Ventura D, Martinez T.Manifold Learning by Graduated Optimization. IEEE Transactions Syst Man Cybern B Cybern,2011.5:1-13
    [60]Jarkko Venna, Samuel Kaski. Local multidimensional scaling.Neural Network,2006, 19(6-7):889-899.
    [61]Pascal Vincent, Hugo Larochelle, Yoshua Bengio, et al. Extracting and composing robust features with denoising autoencoders. In:Proceeding of the 25th International conference on Machine learning,2008:1096-1103
    [62]Brouwer R K, Kamloops BC. Permutation clustering using the proximity matrix. In: Proceeding of IEEE International Conference on Fuzzy Systems.2009.441-446
    [63]Devijer P A, Kitter J.Pattern Recongnition:a Statistical Approach, London:Prentice Hall,1982.33-45
    [64]Wilson D L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data Sets.IEEE Transactions on Systems, Man and Cybernetics,1972,2:408-421
    [65]Tomek I. Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics,1976,6:769-772
    [66]Koplowitz J, Brown T A. On the Relation of Performance to Editing in Nearest Neighbor Rules.Pattern Recognition Letters.1981,13:251-255
    [67]Barandela R, Gasca E.Decontamination of Training Data for Supervised Pattern Recognition Methonds.Adcances in Pattern Recognition, Lecture Notes in Computer Science,2000,1876:621-630
    [68]Kohonen T. Self-Organizing Maps.Berlin:Springer.2001.67-81
    [69]Sanchez J S,Pla F, Ferri F J.Prototype Selection fro Nearest-Neighbour Rule Through Proximity Graphs.Pattern Recogniton Letters,1997,18(6):507-513
    [70]Hattori K, Takahashi M.A New Edited k-Nearest Neighbor Rule in the Pattern Classifiction Problem.Pattern Recognition,2000,33:521-528
    [71]Aha D W, Kibler D, Alber M. Instance-based Learning Algorithms. Machine Learning,1991,6(1):37-66
    [72]Dasarathy B V. Minimal Consistent Subset (MCS) Identification for Optimal Nearest Neighbor Decision Systems Design.IEEE Transactions on Systems, Man and Cybernetics,1994,24:511-517
    [73]Toussaint G T, Bhattacharya B K, Poulsen R S. The Application of Voronoi Diagrams to Nonparametric Desision Rules.Computer Science and Statistics:The Interface, 1985:97-108
    [74]Hart P E. The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory,1968,14(5):505-516
    [75]Chen C H, Jozwik A. A Sample Set Condensation Algorithm for the Class Sensitive Artificial Neural Network.Pattern Recognition Letters,1996,17:819-823
    [76]Sanchez J S. High Training Set Size Reduction by Space Partitioning and Prototype Abstration.Pattern Recognition,2004,37(7):1561-1564
    [77]Huttenlocker D P, Kalnderman G A, Rucklidge W A.Comparing Images Using the Hausdorff Distance.IEEE Transactions on Pattern Analysis and Machine Intelligence,1993,15(3):850-863
    [78]Dubuisson M P, Jain A K. Modified Hausdorff distance for object matching. In: Proceeding. of IAPR Int. Conf. on Pattern Recognition (ICPR'94),1994:566-568
    [79]Sim D G, Kwon O K, Park R H. Object Matching Algorithms using Robust Hausdorff Distance Measures. IEEE Transactions on Image Processing,1999,8(3):425-429
    [80]Sanchez J S, Barandela R, Marques A I, et al.Analysis of new techniques to obtain quality training sets.Pattern Recognition Letters,2003,24:1015-1022
    [81]Lozano M, Sotoca J M, Sanchez J S, et al. Experimental study on prototype optimisation algorithms forprototype-based classification in vector spaces.Pattern Recognition,2006,39:1827-1838
    [82]Bellman R. Adaptive Control Process:A Guided Tour.Princeton University Press, Prinecton,1961
    [83]冯少荣,肖文俊DBSCAN聚类算法的研究与改进.中国矿业大学学报.2008,37(1):105-111
    [84]黄启宏.流形学习方法理论研究及图像中应用:博士论文。保存地点:电子科技大学图书馆,2007
    [85]陈友,沈华伟,李洋等.一种高效的面向轻量级入侵检测系统的特征选择算法.计算机学报,2007,30(8):1398-1407
    [86]Harman, Donna. Evaluation issues in Information Retrieval.Information Processing & Management.1992,28(4):439-440
    [87]Yiming Y, Xin L. A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval.1999:42-49
    [88]Japkowicz N, Japkowicz. Learning from imbalanced data sets:A comparison of various strategies. In:Learning from Imbalanced Data Sets, AAAI Workshop at the 17th Conference on AI (AAAI-2000), Austin,2000:10-15
    [89]王丽娜,徐巍,刘铸.基于相似度聚类分析方法的异常入侵检测系统的模型集实现.小型微型计算机系统.2004,25(7):1333-1336
    [90]Gnanadesikan R, Wilk M B, Probability plotting methods for the analysis of data, Biometrika Trust.1968,55(1):1-17