用户名: 密码: 验证码:
混合值差度量及其在MDS中的应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
多维尺度分析(Multidimensional Scaling,MDS)是一种传统的多元统计方法,自提出以后的数十年来,随着研究的不断深入,应用范围越来越广泛。目前,学界对MDS的应用研究仍旧处在一种非常活跃的状态。MDS已经被广泛应用于经济学、管理学、心理学、社会学、考古学、生物学、医药、化学、网络分析等众多不同领域之中,并取得了较好的经济效益和社会效益。在这方面,国外的研究走在前列。
     MDS的处理对象一般是一组对象之间的两两相似性度量,这种相似性度量通常以对象之间的距离为标准,选取合适的距离计算方法在比较大的程度上影响着MDS的处理效果,当对象采用混合类型的属性进行描述时更是如此。目前,MDS通常以欧氏距离(Euclidean Distance)为基础。然而由于欧氏距离具有与各指标的量纲有关、不考虑各指标间的相关性等特点,MDS的处理效果将会受到一定影响。尤其是,欧氏距离对名义属性并不是一种直接的处理方式。在处理名义属性时,欧氏度量方法通常先将名义属性值用数值进行代替,然后以数值型属性的处理方式进行处理,这就从根本上否定了名义属性的固有特点,从而造成信息丢失。
     另一方面,MDS分为度量性MDS(Metric MDS)和非度量性MDS(Non-metricMDS),其中前者用于定量处理,后者用于定性处理。由于非度量性MDS对对象间的相异(似)性与对象间的距离关系要求不算严格,只需满足单调的顺序等级关系,不需要定量地表示出来,因此,非度量性MDS对定序数据是比较有效的。而对于名义数据(Nominal Data),度量性MDS未必有效。考虑到度量性MDS是进行的定量分析,要比进行定性分析的非度量性MDS更能精确地揭示数据的内在结构,因此对一些内容完整、含名义类型数据的数据集,我们可以考虑在优化名义数据预处理效果的基础上,采用度量性MDS进行计算。
     因此,考虑到欧氏距离的局限性以及MDS本身的特点,我们在根据实际问题进行修改的基础上,采用了混合值差度量(Heterogeneous Value DifferenceMetric,HVDM)来进行数据的预处理,以提高MDS对名义数据计算精确度。在UCI的Abalone数据集上进行的实验表明,这种方法有比传统的数量化方法在重构能力、重构精确度方面都有更好的表现。
     现实世界中,对象的特点需要从多方面进行描述,所以含名义数据的混合类型属性对象距离的计算较为常见。因此,我们的工作将对此提供一定的支持。
Multidimensional Scaling is a traditional multivariate statistical method, andwith the deepening of the study, the range of its application has been becoming moreand more extensive since being proposed several decades before. At present, theacademic applied research on MDS is still very active. MDS has been widely used inmany different areas, such as economics, management science, psychology, sociology,archeology, biology, medicine, chemistry, network analysis, and good economic andsocial benefits have been achieved. In this regard, foreign researchers are in theforefront.
     MDS is run on (dis)similarity matrix, which is obtained by the calculation of thedistance between different objects on the nondimensionalized data. The method forcalculating the distance has a great impact on the output of MDS, especially when theobjects are described by mixed attributes. In general, MDS uses Euclidean distance tomeasure the (dis)similarity of objects. But due to some characteristics of Euclideandistance, such as its relationship with the dimension of attributes, and ignorance thecorrelation of different attributes, the output of MDS will be affected to some extent.In particular, if objects have nominal attributes, such as sex or color, common practiceis digitizing first and then applying Euclidean distance. Obviously, this approach isnot reasonable, for it basically negates the inherent characteristics of the nominalattributes, resulting in loss of information.
     On the other hand, the MDS has two types, metric MDS for quantitativeprocessing and non-metric MDS for qualitative. Metric MDS creates a configurationof points whose inter-point distances approximate the given dissimilarities. Instead oftrying to approximate the dissimilarities themselves, non-metric MDS approximates anonlinear, but monotonic, transformation of them. So the non-metric MDS worksbetter on ordinal data, but doesn’t necessarily on nominal data. Taking the fact intoconsideration that metric MDS is quantitative, which can reveal the internal structureof data more accurately than non-metric MDS, we prefer to adopt metric MDS on a complete data set that contains nominal data, on the hypothesis that the nominal datacan be preprocessed appropriately.
     Therefore, considering the limitations of the Euclidean distance and thecharacteristics of the MDS itself, we apply Heterogeneous Value Difference Metric(HVDM), a distance metric computing distance for nominal attributes differently fromEuclidean distance, to MDS to improve its reasonableness on nominal attributes.Experimental results on UCI Abalone dataset shows that the proposed method givespromising results on both reconstruction ability and accuracy.
     In the real world, the characteristics of the object needs to be described fromdifferent aspects, so the distance calculation of the mixed attributes object containingnominal data is more common. Therefore, our work will provide some support forthis.
引文
[1] Wojciech Basalaj. Proximity Visualization of Abstract Data.http://www.pavis.org/essay/multidimensional_scaling.html#SECTION004100, 2010-4-5.
    [2]Buja, A., Swayne, D., Littman, M., Dean, N., H.Hofmann, and L.Chen (2008). Datavisualization with multidimensional scaling. Journal of Computational and GraphicalStatistics, 17(85), 444–472.
    [3]Borg, I. and Groenen, P. (2005). Modern Multidimensional Scaling:Theory and Applications.Springer, New York.
    [4]徐雪琪.基于统计视角的数据挖掘研究[D].浙江工商大学, 2007: 34-35, 88-94.
    [5]D. Randall Wilson and Tony R. Martinez. Improved Heterogeneous Distance Functions.Journal of Artificial Intelligence Research, Vol. 6, No. 1, pp. 1-34, 1997.
    [6]张尧庭,方开泰.多元统计分析引论[M].北京:科学出版社, 1982: 397-401.
    [7]谢信喜.符号聚类新方法的研究及应用[D].江南大学, 2008: 34-35, 8-11, 30-31.
    [8]Chaoqun Li, Hongwei Li. One Dependence Value Difference Metric. Knowledge-BasedSystems, 24, no5(2011).
    [9]Matthew S. Spencer, Samantha C. Bates Prins, and Margaret S. Beckom. HeterogeneousDistance Measures and Nearest-Neighbor Classification in an Ecological Setting. Missouri J.Math. Sci. 22, no2 (2010):108-123.
    [10]赵燕.属性值测量系统分析的方法研究与应用[D].天津:天津大学,2010.
    [11]张忠平,夏炎,李立宁.数值和名义属性混合数据空间上的轮廓体查询方法[J].小型微型计算机系统, 2011, 32(6):1157-1162.
    [12]Welchew DE, Honey GD, Sharma T, Robbins TW, Bullmore ET (2002) Multidimensionalscaling of integrated neurocognitive function and schizophrenia as a disconnexion disorder.Neuroimage 17:1227–1239
    [13]Groenen, P.J.F., Winsberg, S., Rodriguez, O., Diday, E. (2006). I-Scal: Multidimensionalscaling of interval dissimilarities. Computational Statistics and Data Analysis, 51, 360-378.
    [14]余肖生,周宁.高维数据降维方法研究[J].情报科学, 2007, 25(8):1248-1249.
    [15]谭璐.高维数据的降维理论及应用[D].长沙:国防科学技术大学, 2005:23-24.
    [16]李卫东.《应用多元统计分析》[M].北京:北京大学出版社, 2008: 313 - 330.
    [17]De Leeuw, J. & Groenen, P.J.F. (1997). Inverse multidimensional scaling. Journal ofClassification, 14, 3-21.
    [18]Patrick J. F. Groenen. Past, Present, and Future of Multidimensional Scaling.http://people.few.eur.nl/groenen/
    [19] P. Diaconis, S. Goel, and S. Holmes. Horseshoes in Multidimensional Scaling and KernelMethods. Annals of Applied Statistics, submitted.
    [20]范金城,梅长林.《数据分析》[M].北京:科学出版社, 2002: 205- 210.
    [21]何晓群.《多元统计分析》[M].北京:中国人民大学出版社, 2004: 58- 63.
    [22] Welchew DE, Honey GD, Sharma T, Robbins TW, Bullmore ET(2002) Multidimensionalscaling of integrated neurocognitive function and schizophrenia as a disconnexion disorder.Neuroimage 17:1227–1239
    [23] Young FW, Harris DF (1990) Multidimensional scaling: procedure ALSCAL. In SPSS basesystem user’s guide. SPSS, Chicago, pp397–461
    [24]余锦华,杨维权.《多元统计分析与应用》[M].广州:中山大学出版社, 2005: 168 - 169.
    [25]The MathWorks,Inc.Multidimensional Scaling [Z].Statistics Toolbox of Matlab 7.6,2010-2-20/2010-4-4.
    [26] Wikipedia.Multidimensional scaling[Z].http://en.wikipedia.org/wiki/Multidimensional_scaling, 2010-3-13.
    [27]李志华,王士同.异构属性数据的量子聚类方法研究[J].计算机工程与应用,2009,45(23):63-66.
    [28]龚静,王翰虎.基于流数据的模糊聚类算法[J].计算机应用与软件, 2008, 25(2):250.
    [29]Shepard, R. N.“Multidimensional Scaling, Tree-Fitting, and Clustering.”Science, 210,no.4468 (1980), 390-398.
    [30]Kruskal, J. B.“Multidimensional Scaling by Optimizing Goodness of Fit to a NonmetricHypothesis.”Psychometrika, 29, no.1(1964), 1-27.
    [31] Kruskal, J. B.“Non–metric Multidimensional Scaling: A Numerical Method.”Psychometrika, 29, no.1(1964), 115-129.
    [32] Kruskal, J. B. and M. Wish.“Multidimensional Scaling.”Sage University Paper Series onQuantitative Applications in the Social Science, 07-011. Beverly Hills and London: SagePublications, 1978.
    [33]彭鑫.无线传感器网络中基于多维标度的节点定位算法[D].长沙:湖南大学,2008:27-28.
    [34]张润楚.《多元统计分析》[M].北京:科学出版社, 2006: 288 - 311.
    [35]Richard A. Johnson, Dean W. Wichern.“Applied Multivariate Statistical Analysis”[M].北京:清华大学出版社, 2008: 551 - 557.
    [36]James M. Lattin, J. Douglas Carroll, Paul E. Green.“Analyzing Multivariate Data”[M].北京:机械工业出版社, 2003: 206 - 256.
    [37] Tzagarakis C, Jerde TA, Lewis SM, Ugurbil K, Georgopoulos AP. (2009) Cerebral corticalmechanisms of copying geometrical shapes: a multidimensional scaling analysis of fMRIpatterns of activation. Exp Brain Res 194:369–380.
    [38]Cox, M.A.A.(2009). Multidimensional Scaling as an Aid for the Analytic Network andAnalytic Hierarchy Processes.Journal of Data Science 7(3): 381-396.
    [39]J. Tenreiro Machado, Goncalo Monteiro Duarte, Fernando B. Duarte.Identifying EconomicPeriods and Crisis with the Multidimensional Scaling. Nonlinear Dynamics 63,4(2010)611-632.
    [40]Jan de Leeuw. A Horseshoe for Multidimensional Scaling. Department of StatisticsPapers:2011-08.
    [41]肖玲,李仁发,罗娟.基于非度量多维标度的无线传感器网络节点定位算法[J].计算机研究与发展,2007,44(3):399-405.
    [42]张荣磊,刘琳岚,舒坚,周之平.基于多维定标的无线传感器网络三维定位算法[J].计算机应用研究, 2009, 26(8):1300-1304.
    [43] C. Zhu, J. Yu. Nonmetric multidimensional scaling corrects for population structure inassociation mapping with different sample types Genetics, 182 (2009), pp. 875–888
    [44] De Leeuw, J. & Mair, P. 2009. Multidimensional scaling using majorization: SMACOF in R.Journal of Statistical Software 3: 1–30.
    [45]郝智勇,贺明科,谭文堂,张建东.基于多维标度法的专利文本可视化聚类研究[J].计算机应用研究, 2010, 27(12): 4608-4611.
    [46]何鹏.第三方物流企业基于客户价值的客户分类研究[D].武汉:武汉理工大学,2008:42-52.
    [47]朱子昊.基于数据挖掘技术的物流信息系统研究[D].上海:上海交通大学,2007:38-57.
    [48] Arthur Asuncion and David J. Newman. (2007) UCI Machine Learning Repository [Online].http://www.ics.uci. edu/~mlearn/MLRepository.html, Irvine, CA: University of California,School of Information and Computer Science.
    [49]The MathWorks, Inc. Multidimensional Scaling [CP]. Statistics Toolbox of Matlab 7.6,2010-4-4.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700