改进的高维非线性PLS回归方法及应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
偏最小二乘(PLS)回归是一种基于高维投影思想的新的非参数回归方法,可有效地将多元回归、主成分分析以及典型相关分析等功能有机地结合起来,因此,它已被誉为第二代多元统计分析方法。识别特异点和对变量集实施降维是回归建模前的两个重要的数据分析预处理过程。本文基于PLS回归模型,结合非线性核主成分分析、二叉树等多种方法,提出了改进的非线性偏最小二乘回归模型、二叉树降维方法和降维二叉树评价方法,并扩展了特异点识别方法。主要研究内容如下:
     提出了一种改进的非线性偏最小二乘回归模型。传统的线性及非线性PLS回归模型计算因变量集与提取的主成分之间的线性回归,而没有考虑因变量集和主成分之间可能是非线性关系。本文把因变量集对各个主成分的线性回归改进为可根据具体情况选择线性回归或非线性回归,每个主成分依旧表示成原始自变量集的线性回归方程。本文还具体分析并建立了汽车油耗及其他十个设计及性能方面的指标之间的非线性回归模型。
     提出了高维空间的二叉树降维方法及降维二叉树评价方法。本文提出了将传统的整体降维,改进为从局部降维再延伸到全局降维的一种逐步降维的新方法。如果样本变量数n过大,可对相关性最强的两个变量实施主成分分析或核主成分分析:提取第一个成分变量代替原来的两个变量,样本变量数则降维为n ?1,循环执行此降维过程,直到满足精度为止。整个降维过程表现为一棵二叉树或残缺二叉树。根据降维二叉树评价方法,采用天津市2008年各区县经济发展指标,具体对天津市18个区县的经济发展水平进行了科学的评价。
     分析并扩展了高维空间的特异点识别方法。在基于PLS回归识别特异点的分析技术基础上,将识别特异点的二维平面T 2椭圆图方法扩展到三维空间T 2椭球和高维空间T 2超椭球,同时基于谱系聚类法,提出了基于高维空间主成分谱系图的特异点识别方法,并对我国主要省份、城市的汽柴油价格进行了分析。
Partial Least-Squares (PLS) Regression is a new non-parametric regression method based on higher-dimensional projection. It can effectively combine functions of multiple regression analysis, principal component analysis and canonical correlation analysis. That’s why it has already been labeled as the second generation of multiple statistical analysis method. Identification method of Specific Sample Points and bitree dimension reduction of a variable set are two important preprocessing of data analysis. Based on PLS regression model and combined with non-linear Kernel Principal Component Analysis and Binary Tree Dimension Reduction methods etc., the dissertation came up with a modified non-linear Partial Least-Squares Regression model. Moreover, dimension reduction method and evaluation methodology for Binary Tree were also presented. Furthermore, Specific Sample Points’identification method was also extended. Main research contents are as follows:
     An improved non-linear Partial Least-Squares Regression model was proposed. Traditional linear and non-linear PLS regression models calculate linear regression relations between dependent variable set and principal components extracted, without taking into consideration that dependent variable set and principal components may have non-linear relations. In the dissertation, linear regressions of dependent variable set to each principal component was modified to linear or non-linear regression choosing according to concrete conditions. And each principal component was still expressed as linear regression equation of the original independent variable set. The dissertation also elaborated on and further established a non-linear regression model of motor oil consumption and ten other indicators about design and performance.
     Binary Tree Dimension Reduction methods in higher dimensional space and evaluation methodology for dimension reduced Binary Tree were also proposed. In the dissertation, the traditional method to reduce dimensions on the whole was modified to reduce dimensions from partial sections to overall. If the sample had an oversized variable number, then Principal Component Analysis or Kernel Principal Component Analysis could be implemented between two variables having the strongest correlation: extracting the first component variable to replace the original two variables, the sample variable number would then be reduced to n ?1. This dimension reduction process would be executed circularly until the precision demanded obtained. Depending on evaluation methodology for dimension reduced Binary Tree, the dissertation adopted economic development indicators of each district or county in Tianjin in the year 2008, and made a scientific evaluation on economic development levels of 18 districts or counties in Tianjin. What’s more, identification method of Specific Sample Points in higher dimensional space was also extended. Based on analysis technics of PLS regression Specific Sample Points identification, ellipse T~2 recognition method in two dimensional surface was extended to Ellipsoid T~2 in three dimensional surface and
     Hyper ellipsoidal T~2 in higher dimensional surface. In the meanwhile, on the basis of pedigree clustering method,Specific Sample Points identification method based on principal component pedigree chart in higher dimensional surface was brought up and employed to evaluate gasoline and diesel prices in major provinces and cities in China.
引文
[1]俞立平,潘云涛,武夷山,学术期刊评价中主成分分析法应用悖论研究,情报理论与实践,2009,32(9):84~87
    [2] Wang Shufen. An Empirical Research on the Relations between Higher Education Development and Economic Growth in China, Asian Social Science, 2008, 4(1): 81~101
    [3]王刚,陈建成,基于PLS模型的中国林业产值影响因素分析,林业经济问题,2009,29(5):415~478
    [4]邓念武,徐晖,单因变量的偏最小二乘回归模型及其应用,武汉大学学报(工学版),2001,34(2):14 ~16
    [5]董梅生,中国农业投入和产出的关系:基于偏最小二乘回归法的分析,技术经济,2009,28(1):37~41
    [6] Anne-Laure Boulesteix, Korbinian Strimmer. Partial Least Squares: a versatile tool for the analysis of high-dimensional genomic data, Oxford Journals - Briefings in Bioinformatics, 2009, 8(1): 32~44
    [7]杨栋,基于PLS回归方法的中国高技术产品进口影响因素分析,经济研究导刊,2009,(25):172~173
    [8]张新安,田澎,购后行为意向的偏最小二乘建模与分析,工业工程与管理,2003,(3):14~19
    [9] Oliver R L. A Cognitive Model of the Antecedents and Consequences of Satisfaction Decisions, Journal of Marketing Research, 1980, 17(4): 460~470
    [10] Westbrook R A, Reilly M D. An Alternative to Disconfirmation of Expectations Theory of Consumer Satisfaction, Advanced in Consumer Research, 1983, 6(3): 256~261
    [11]黄敏杰,叶昊,王桂增,基于投影的回归分析方法综述,控制理论与应用, 2001,(18):1~6
    [12]肖琳,何大卫,PLS回归在消除多元共线性中的作用,山西医科大学学报, 2002,33(3):228~231
    [13] Debole F, Sebastiani F. An analysis of the relative hardness of Reuter’s subsets, Journal of the American Society for Information Science and Technology, 2004, 56(6): 584~596
    [14]贺玲,蔡益朝,杨征,高维数据聚类方法综述,计算机应用研究, 2010,27(1):23~26
    [15] Kriegel H P, Ger R P, Zimeka. Clustering high dimensional data: a survey on subspace clustering, pattern based clustering and correlation clustering, ACM Trans. on Knowledge Discovery from Data, 2009, 3(1): 1~58
    [16]靳刘蕊,函数性主成份分析的思想、方法及应用,统计与决策,2010,(1):15~19
    [17]刘海峰,姚泽清,刘守生等,基于聚类降维的改进KNN文本分类,微计算机信息,2010,26(3):18~20
    [18] Anil K G, Probal Chaudhuri, C A Murthy. Multiscale classification using nearest neighbor density estimates, IEEE transactions on systems, man, and cybernetics-part: cybernetics, 2006, 36(5): 1139~1148
    [19] Carlotta Domeniconi, Jing Peng, Dimitrios Gunopulos. Locally Adaptive Metric Nearest Neighbor Classification, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002, 24(9):1281~1289
    [20]吕朝辉,张会锋,统计学基础,北京:化学工业出版社,2008. 114~115
    [21]项响琴,汪彩梅,基于聚类高维空间算法的离群数据挖掘技术研究,计算机技术与发展,2010,20(1):124~127
    [22] Struyf A, Rousseeuw P J. High- dimensional Computation of the Deepest Location, Computational Statistics and Data Analysis, 2000, 34: 415~426
    [23]蔡江辉,张华煜,离群数据挖掘方法研究,电脑开发与应用,2005,18(12):46~47
    [24]张先林,于佐军,一种改进的模糊PLS模型在软测量中的应用,控制工程,2008,15(S2):118~121
    [25]白裔峰,偏最小二乘算法及其在基于结构风险最小化的机器学习中的应用,博士学位论文,西南交通大学,2007. 1~26
    [26]李波,基于偏最小二乘回归的大坝安全监控统计模型的研究,硕士学位论文,西南理工大学,2007. 1~9
    [27]吴玲达,贺玲,蔡益朝,高维索引机制中的降维方法综述,计算机应用研究,2006,(12):4~7
    [28]薛安荣,姚林,鞠时光等,离群点挖掘方法综述,计算机科学,2008,35(11):13~18
    [29]施冬冬,贾瑞玉,黄义堂,基于遗传算法的高维离群点检测算法的改进,计算机技术与发展,2009,19(3):141~145
    [30] Aggarwal C C, Yu P S. An Effective and Efficient Algorithm for High- dimensional Outlier Detection, The VLDB Journal, 2005, 14(2): 211~221
    [31]曾颖,基于Voronoi和空间自相关的离群点检测,计算机工程与应用,2009,(29):123~128
    [32]王妍,潘瑜春,阎波杰,基于Voronoi和空间自相关的离群点检测,计算机工程,2010,(1):21~25
    [33] Leflaive P, Pirngruber G D, Faraj A, et al. Statistical analysis and Partial Least Square regression as new tools for modelling and understanding the adsorption properties of zeolites, Microporous and Mesoporous Materials, 2010, 246~257
    [34] Tandy Susan, Healey John R, Nason Mark A, et al. FT-IR as an alternative method for measuring chemical properties during composting, Bioresource Technology, 2010, 5431~5436
    [35] Germont Hilde, Verschuren Dirk, Audenaert Leen, et al. Limnological and ecological sensitivity of Rwenzori mountain lakes to climate warming, Hydrobiologia, 2010, 123~142
    [36] Wendler Frank, Lepri Fabio G, Borges Daniel L G, et al. Trace element status of activated charcoals and carbon black: influence on thermal stability of modified lyocell solutions, Journal of Applied Polymer Science, 2010, 3408~3418
    [37] Verbeek J J, Vlassis N, Krose BJ A. A K-segments Algorithm for Finding Principal Curves, Pattern Recognition Letters, 2002, 23(8): 1009~1017
    [38] Delicado P, Huerta M. Principal Curves of Oriented Points: Theoretical and Computational Improvements, Computational Statistics, 2003, 18(2): 293~315
    [39] A Ifarragnerr, Chein-I Chang. Unsupervised Hyperspectral Image Analysis with Projection Pursuit, IEEE Trans. Geosc. Remote Sensing, 2000, 38(6): 2529~2538
    [40] Rosman Guy, Bronstein Michael M, Bronstein Alexander M, et al. Nonlinear dimensionality reduction by topologically constrained isometric embedding, International Journal of Computer Vision, 2010, 56~68
    [41] Rueda Luis, John Oommen B, Henríquez Claudio. Multi-class pairwise linear dimensionality reduction using heteroscedastic schemes, Pattern Recognition, 2010, 2456~2465
    [42] Ivanova O V, Stoffer Remco, Hammer Manfred. A dimensionality reduction technique for 2D scattering problems in photonics, Journal of Optics, 2010, (12): 72~78
    [43] Gandhi Mital A, Mili Lamine. Robust Kalman Filter based on a generalized maximum-likelihood-type estimator, IEEE Transactions on Signal Processing, 2010, 58(5): 2509~2520
    [44] Unnikrishnan N K. Bayesian analysis for outliers in survey sampling, Computational Statistics and Data Analysis, 2010, 54(8): 47~51
    [45] Lee Hyunjung, Seo Yongduek, Lee Sang Wook. Bayesian analysis for outliers in survey sampling, Image and Vision Computing, 2010, 28(6): 156~168
    [46] Fu Yu-Yi, Wu Chia-Ju, Jeng Jin-Tsong, et al. ARFNNs with SVR for prediction of chaotic time series with outliers, Expert System with Application, 2010, 37(6): 4441~4451
    [47]薛安荣,空间离群点挖掘技术的研究,博士学位论文,江苏大学,2008. 1~6
    [48]王文博,陈秀芝,多指标综合评价中主成分分析和因子分析方法的比较,统计与信息论坛,2006,21(5):19~22
    [49]王惠文,偏最小二乘回归方法及其应用,北京:国防工业出版社,1999. 1~5
    [50]高惠璇,应用多元统计分析,北京:北京大学出版社,2005. 369~379
    [51]欧阳露莎,刘寅,刘敏思,湖北省高等教育投入—产出状况的偏最小二乘回归分析,中南民族大学学报(自然科学版),2009,28(4):111~114
    [52]任若恩,王惠文,多元统计数据分析,北京:国防工业出版社,1997. 1~46
    [53]吴琼,原忠虎,王晓宁,基于偏最小二乘回归分析综述,沈阳大学学报:自然科学版,2007,19(2):33~35
    [54]王惠文,变量多重相关性对主成分分析的危害,北京航空航天大学学报,1996,22(1):65~70
    [55] Wang Hongli, Guo Long. Information management system of automobile chassis dynamometer based on multi-agent Web service, Inst. of Elec. and Elec. Eng. Computer Society, 2008, 74~78
    [56] Guo Long, Wang Hongli. Research of dynamic simulation software in road machinery based on intelligent and hybrid multi-agent system, Institute of Electrical and Electronics Engineers Computer Society, 2008, 59~62
    [57]王洪礼,郭龙,基于模糊免疫PID的非线性汽车悬架控制策略与仿真研究,机械强度,2008,30(6):45~48
    [58] Jianguo Tan, Hongli Wang. Convergence and stability of the split-step backward Euler method for linear stochastic delay integro-differential equations, Mathematics and Computer Modelling, 51(2010): 54~58
    [59]王惠文,吴载斌,孟洁,偏最小二乘回归的线性与非线性方法,北京:国防工业出版社,2006. 63~69, 80~86, 117~127, 116~117, 186~187, 218~219
    [60]王惠文,王劫,黄海军,主成分回归的建模策略研究,北京航空航天大学学报,2008,34(6):661~664
    [61]和燕,主成分回归与偏最小二乘回归方法比较,电子成都机械高等专科学校学报,2003,(4):34~37
    [62]张恒喜,小样本多元数据分析方法及应用,西安:西北工业大学出版社,2002. 75~86, 101~117
    [63]汤银才,R语言与统计分析,北京:高等教育出版社,2008. 11~13
    [64]薛毅,陈立萍,统计建模与R软件,北京:清华大学出版社,2007. 279~284,318~331
    [65]王淑芬,应用统计学,北京:北京大学出版社,中国林业出版社,2007. 158~159,164~166
    [66]黄志坚,贾仁安,吴建辉,中国中部6市综合经济实力比较研究,科技进步与对策,2006,(1):16~18
    [67]陈文婷,李勇,陈宁,基于PLS模型的我国服务贸易出口影响因素分析,运筹与管理,2008,17(3):107~110
    [68]孟辉,洪文学,宋佳霖等,基于多元图形特征融合原理的降维方法研究,燕山大学学报,2008,(5):445~450
    [69]叶双峰,关于主成分分析做综合评价的改进,数理统计与管理,2001,(2):52~56
    [70]黄宁,关于主成分分析应用的思考,数理统计与管理,1999,18(5):44~47
    [71] John I M. Some robust estimates of principal components, Statistics & Probability Letter, 1999, (43): 349~359
    [72]王松,夏绍玮,一种鲁棒主成分分析算法,系统工程理论与实践,1998,(l):9~13,75~77
    [73]朱惠倩,基于KPCA的中部六市综合经济实力评价,科技广场,2006,(10): 16~17
    [74]黄添强,卓飞豹,叶飞跃,基于空间自相关的环境预测方法,青岛大学学报(自然科学版),2007,20(4): 59~62
    [75]严蔚敏,吴伟民,数据结构(C语言版),北京:清华大学出版社,1997. 121~128
    [76]金凌辉,郭丽莎,支付红利的欧式期权二叉树模型的矩阵算法,甘肃联合大学学报,2007,21(5):32~36
    [77]柳向东,何远兰,二叉树模型的测度变换及应用,暨南大学学报(自然科学与医学版),2010,(1):78~80
    [78]陈伏兵,杨静宇,分块PCA及其在人脸识别中的应用,计算机工程与设计,2007,28(8): 1889~1892
    [79]天津市统计局,天津市统计年鉴(2009年),北京:中国统计出版社,2009年. 475~542
    [80]王家远,袁红平,基于因子分析法的建筑业综合评价,深圳大学学报(理工版),2007,(4):373~377
    [81]刘举,刘云,曾诚,基于因子分析法的综合大学创新力指标研究,科学学与科学技术管理,2007,(10):111~113
    [82]郭岚,张勇,李志娟,基于因子分析与DEA方法的旅游上市公司效率评价,管理学报,2008,(2):258~261
    [83]赵黎明,乔建生,惠民,科学基金综合评价指标体系优化的因子分析,天津大学学报,1996,(3):422~426
    [84]刘春霞,高校教师教学质量的综合评价,广西大学学报(哲学社会科学版), 2001,(6):248~251
    [85]彭丽华,模糊数学在高校重点学科质量评价中的应用,科学管理研究,2003,(5):72~75
    [86]肖鹏,层次分析法在科研专项绩效评价中的应用,科学管理研究,2008, (4):38~41
    [87]王宗军,综合评价的方法、问题及其研究趋势,管理科学学报,1998,(l):73~79
    [88]崔贯勋,朱庆生,一种改进的基于密度的离群数据挖掘算法,计算机应用,2007,27(3):559~560
    [89]魏藜,宫学庆,高维空间中的离群点发现,软件学报,2002,13(2):280 ~282
    [90]黄洪宇,林甲祥,陈崇成等,离群数据挖掘综述,计算机应用研究,2006,8(6):8~13
    [91]金洪杰,离群点挖掘技术在入侵检测中的研究,黑龙江科技信息,2009,15(36):12~13
    [92]熊君丽,高维空间下基于密度的离群点探测算法实现,现代电子技术,2006,(15):67~69
    [93]金义富,朱庆生,邹咸林,高维数据集离群子空间特性研究,计算机工程与应用,2006,(9):147~150
    [94] Guo Jianxiao, Gao Yarong, Li Jinling, et al. A new filter method of specific sample points based on partial least-squares analysis, 2009 International Conference on Future Information Technology and Management Engineering (FITME 2009), IEEE CPS Press, Dec. 2009, 274~277
    [95] Guo Jianxiao, Wang Hongli, Gao Yarong, et al. A new data mining method of iterative dimensionality reduction derived from Partial Least-Squares Regression, The 3rd International Conference on Intelligent Information Technology Application (IITA 09), IEEE CPS Press, Nov. 2009, 471~474
    [96] Bai Bin, Wang Hongli, Gao Yarong, et al. Approach for recognition of True and False Specific Sample Points, 2009 Asia-Pacific Conference on Computational Intelligence and Industrial Applications (PACIIA 2009), IEEE CPS Press, Nov. 2009, 365~368
    [97] Gao Yarong, Guo Jianxiao, Wang Hongli, et al. Study of recognition approach for specific sample points in high dimension space, The 3rd International Conference on Intelligent Information Technology Application (IITA 09), IEEE CPS Press, Nov. 2009, 259~262
    [98] E K Kemsley. Disctiminant analysis of high-dimensional data, Chemometries and intelligent laboratory systems, 1996, (33): 47~61
    [99] Partha Pratim Roy, Kunal Roy. On some aspects of variable selection for Partial Least Squares regression models, QSAR & Combinatorial Science, 2007, 27(3): 92~98
    [100] Z Wen, M Li, Y Li, et al. Delaunay triangulation with Partial Least Squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition, Amino Acids, 2007, (32): 77~82
    [101]陈雪娇,任燕,基于决策树与相异度的离群数据挖掘方法,微计算机信息,2009, 25(21):131~135
    [102]金义富,朱庆生,邢永康,一种基于关键域子空间的离群数据聚类算法,计算机研究与发展,2007,44(4):651~659

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700