基于化学数据的若干统计学习新方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
针对日益复杂的数据,特别是在构效关系研究和谱数据分析领域,如何运用统计学习方法从中挖掘出最有用的信息,这是当今应用统计研究的一个热点问题之一。本文在数据驱动模式的指导下,以化学数据为背景,经过深入研究一些经典的统计方法,如分类与回归树,支持向量机,偏最小二乘等的优势与不足之后,创造性地构造了一种新颖的树核,提出了一系列新的统计学习方法。研究内容主要包括七章。
     首先,简要介绍了本文的研究背景与动机。然后较详细地总结与探讨了化学数据分析中一些常用的理论及其方法,指出了它们各自的优点与不足,这些是我们研究统计新方法的基础。最后介绍了本文的主要内容和创新之处(第一章)。
     第二章树核的构造是我们第一次提出来,是我们的重要创新之一。在深入研究CART原理的基础上,我们首次指出同一终节点中的样本不仅仅具有类别的相似性,可能拥有其它某种特定的相似性。同时为得到结构多种多样的树,我们将蒙特卡洛方法耦合到分类树算法中,通过使用fuzzy修剪和集成策略,巧妙地构造了一种新颖的树核。这fuzzy修剪策略,能够有效的探索节点内部的信息,但不完整地破坏树的结构。集成策略能更加体现数据中的有规律的信息,使得结果更稳定。这是我们构造树核的原始动机。在构建树核的过程中,通过建立大量的树模型,为了寻找与分类最相关的变量集以及在不同变量空间中具有特定相似性的样本集,分类树模型同时在变量空间和样本空间执行一个贪婪但不一定是全局最优的搜寻。这样,大量的树模型能够有效地发现样本之间的相似性,同时,能够评估每一个变量的重要性。自然地,我们构造的树核具有以下优点:第一,它是属于有监督学习,因为在核的构造过程中,类的信息暗示着树的结构。第二,由于无关的变量对树集群的贡献很小,这样它们对树核的测量值的影响就很小,从而能够有效地发现重要变量。第三,由于结合了分类树算法,它能够处理非线性问题。
     然后在核方法的框架下,我们将构造的新颖树核融入到支持向量机,偏最小二乘和k-最近邻算法中,提出了三种新的统计学习方法:树核支持向量机(TKSVM),树核偏最小二乘(TKPLS)和树核k-最近邻分类方法。三个SAR数据集的实证结果表明,构造的树核所具有的优点能够有效改进这些传统的算法。
     针对高维光谱数据,我们提出了一种新的建模方法PLSSIS。高维光谱数据(如近红外)分析的困难在于量测的数据在呈现出很高共线性的同时,含有大量的冗余信息。通常会应用PLS方法来处理。然而,PLS方法所建立的模型包括了所有的原始变量,其中包含冗余信息,这会降低模型的预测性能。我们通过运用PLS回归系数,结合安全独立筛选SIS (sure independence screening)原理来逐步选择重要的变量,提出了一种基于安全独立筛选的偏最小二乘回归(PLSSIS)的新变量选择策略。PLSSIS是一种结合了PLSR和SIS的前向迭代算法,能够快速有效地处理高维共线性数据。三个光谱数据集实验结果表明,比较标准的PLS方法和移动窗口偏最小二乘方法回归MWPLSR(moving window partial least squares regression), PLSSIS方法选择了更少的变量,具有更好的可解释性与预测性能。
     最后,第七章对全文进行了总结并对今后的研究提出了展望。
For the increasingly complex data, especially in the field of structure-activity relationship and spectra data, how to mine the most useful information from the complex data by statistical learning methods is one of the hot issues in current applied statistics research. Under the guidance of "data-driven", in the background of chemical data, through in-depth study the advantages and disadvantages of some classical statistical methods, such as classification and regression tree, support vector machine, partial least squares, etc. we proposed creatively some new statistical learning methods. The thesis consists of seven chapters.
     Firstly, we briefly introduced the research background and motivation, and then reviewed some theories and methods of statistical learning on chemical data analysis. These are the foundation of the new methods of statistical learning. Finally, we introduced the main content and innovation of this thesis in Chapter1.
     In Chapter2, the constructed tree kernel is proposed for the first time, which is one of the most important innovations. We discussed in detail the classification and regression tree(CART) algorithm. We pointed out that the samples under the same terminal node may possess some specific similarity to some extent, rather than only being limited to class similarity. Simultaneously, in order to obtain the diversity of tree structures, We coupled Monte Carlo procedure with a classification tree algorithm, and skillfully constructed a novel tree kernel by using the fuzzy pruning strategy and ensemble strategy. The fuzzy pruning strategy helps in effectively exploiting the information of inner nodes in trees, but does not totally destroy the structure of tree. Ensemble strategy selection can effectively guarantee that the results by tree kernel is more stable and reliable compared to one by CART, not deriving from the chanciness. This is our original motivation of building tree kernel. In fact, CART carries out a greedy but may not be global optimal search in sample and variable to seek for variable subsets most relevant to classification and sample subsets with specific similarity under different variable subspace. The constructed tree kernel has several outstanding advantages:It is "supervised" because the class information dictates the structure of the trees in the process of constructing tree kernel; Because irrelevant metabolites contribute little to the tree ensemble, they have little influence on the proximity measure, and tree kernel thereby can easily discover the inportant variable; By means of the classification tree, constructed tree kernel can effectively deal with nonlinear problems.
     Then, under the framework of kernel methods, we coupled a novel tree kernel with support vector machine, partial least squares and k-nearest neighbor, and presented three new statistical learning methods: tree kernel support vector machine (TKSVM),tree kernel partial least squares (TKPLS) and tree kernel k-nearest neighbor (TKk-NN). Three datasets related to different categorical bioactivities of compounds are used to test the performance of these methods. The results show that advantages of constructed tree kernel can effectively improve the traditional methods.
     For the high-dimensional spectral data, we proposed a novel model method PLSSIS. A difficulty of high-dimensional data analysis lies in multi-collinear and a lot of redundant information. PLS can be usually employed to deal with this case. However, calibration model including all the variables contains much redundant information, which will bring about negative influence on the prediction ability of the model. By employing PLS regression coefficients and sure independence screening principle, a novel strategy for selecting stepwise the variables, named PLS regression combined with sure independence screening (PLSSIS), is developed. PLSSIS is a forward iteration algorithm that combines the PLSR with SIS, which can fastly and efficiently deal with the high dimensional collinear data. For three spectral datasets, Our study shows that better prediction is obtained by PLSSIS when compared to PLS modeling and moving window partial least squares regression (MWPLSR).
     At last, Chapter7is the summarization of whole thesis and expectation for the future.
引文
[1]Martin Y. C.Quantitative drug design. New York:Marcel Dekker.1978.
    [2]Gleeson M.P. Generation of a set of simple, interpretable ADMET rules of thumb, Journal of Medicinal Chemistry.2008,51:817-834.
    [3]Gola J, Obrezanova O, Champness E, and Segall M. ADMET Property Prediction:The State of the Art and Current Challenges, QSAR & Combinatorial Science.2006,25,1172-1180.
    [4]Mazzatorta P, Benfenati E, Lorenzini P, et al. QSAR in ecotoxicity:an overview of modern classification techniques. Journal of Chemical Information and Computer Sciences,2004,44(1):105-112.
    [5]Tropsha A, Gramatica P, and Gombar V.K. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models, QSAR & Combinatorial Science,2003,22: 69-77.
    [6]Draper N R, Smith H. Applied regression analysis. New York:Wiley,1998.
    [7]Lin W Q, Jiang J H, Yu R Q, et al. Optimized block-wise variable combination by particle swarm optimization for partial least squares modeling in quantitative structure, activity relationship studies. Journal of Chemical Information and Modeling,2005,45(2):486-493.
    [8]Shen Q, Jiang J. H, Shen G. L, et al. Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling:QSAR studies of carcinogenicity of aromatic amines. Analytical and Bioanalytical Chemistry, 2003,375(2):248-254.
    [9]Zernov V V, Balakin K V, Ivaschenko A.A, et al. Drug discovery using support vector machines the case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. Journal of Chemical Information and Computer Sciences,2003,43(6):2048-2056.
    [10]Zhang S W, Pan Q, Zhang H C, et al. Classification of Protein Quaternary Structure with Support Vector Machine. Bioinformatics,2003,19(18): 2390-2396.
    [11]Sventnik V, Wang T, Tong C, et al. Boosting:all ensemble learning tool for compound classification and QSAR modeling. Journal of Chemical Information and Modeling,2005,45(3):786-799.
    [12]Sventnik V, Liaw A, Tong C, et al. Random forest:a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences,2003,43(6):1947-1958.
    [13]W. Zheng, and A. Tropsha, Novel Variable Selection Quantitative Structure property Relationship Approach Based on the k-Nearest-Neighbor Principle, Journal of Chemical Information and Computer Sciences,1999,40:185-194.
    [14]S. Ajmani, K. Jadhav, and S.A. Kulkarni, Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation, Journal of Chemical Information and Modeling,2005,46:24-31.
    [15]S. Ajmani, K. Jadhav, and S.A. Kulkarni, Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation, Journal of Chemical Information and Modeling,2005,46:24-31.
    [16]H.W. Siesler, Y.Ozaki, S. Kawata, H. M. Heise, Eds., Near Infrared Spectroscopy:Principals, Instruments, Applications (Weinheim, Germany Wiley-VCH,2002.
    [17]Ghosh D, Chinnaiyan AM. Classification and Selection of Biomarkers in Genomic Data Using Lasso. J. Biomed. Biotechnol.,2005, (2):147-154.
    [18]Burbaum J, Tobal GM. Proteomics in drug discovery. Current Opinion in Chemical Biology,2002,6(4):427-433.
    [19]Fan, J. and Li, R. Statistical challenges with high dimensionality:Feature selection in knowledge discovery. Proceedings of the International Congress of Mathematicians (Edited by M. Sanz-Sole, J. Soria, J. L. Varona and J. Verdera), Vol. Ⅲ,595-622.2006.
    [20]Donoho, D. L. High-dimensional data analysis:The curses and blessings of dimensionality. Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century.2000.
    [21]Hoerl, A.E. and Kennard, R. Ridge regression:Biased estimation for nonorthogonal problems. Technometrics,1970,12:55-67.
    [22]Seber, G. and Lee, A. Linear Regression Analysis,2nd Edition. Wiley Series in Probability and Statistics.2003.
    [23]Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B,1996,58:267-288.
    [24]Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association.2006,101:1418-1429.
    [25]Meinshausen, N., and Buhlmann, P., High dimensional graphs and variable selection with the Lasso. Ann. Statist.2006,34 (3):1436-1462.
    [26]Jolliffe, I. T., N. T. Trendafilov, and M. uddin. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics,2003,12:531-547.
    [27]Zou, H., Hastie, T and Tibshirani, R. On the degrees of freedom of the lasso. Ann. Statist.2007,35:2173-2192.
    [28]KNIGHT, K. and FU, W. Asymptotics for lasso-type estimators. Ann. Statist. 2000,28:1356-1378.
    [29]Wang, H. and Leng, C. Unfied lasso estimation via least squares approximation, Journal of the American Statistical Association,2007,101:1418-1429.
    [30]Zhang, C. H. and Huang, J. The sparsity and bias of the lasso selection in high-dimensional linear regression, The Annals of Statistics,2008,36: 1567-1594.
    [31]Zhao, P. and Yu, B. On model selection consistency of lasso, Journal of Machine Learning Research,2006,7:2541-2567.
    [32]Meinshausen, N. and B"uhlmann, P. High dimensional graphs and variable selection with the LASSO. Ann. Statist.2006,34:1436-1462.
    [33]Efron B, Hastie T, Johnstone I, et al. Least angle regression. Ann. Statist.,2004, 32:407-499.
    [34]Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B,2005,67:301-320.
    [35]Zou, H. and Zhang, H. H. On the adaptive elastic-net with a diverging number of parameters, The Annals of Statistics,2009,37:1733-1751.
    [36]Zou, H. and Hastie, T. Regression shrinkage and selection via the elastic net with application to microarrays, Journal of the Royal Statistical Society, Series B,2005,67:301-320.
    [37]Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc.2001,96,1348-1360.
    [38]Fan, J. and Lv, J. Sure independence screening for ultrahigh dimensional feature space (with discussion). J. Roy. Statist. Soc. Ser. B.2008,70,849-911.
    [39]Fan, J., Samworth, R. and Wu, Y. Ultrahigh dimensional variable selection: beyond the linear model. J. Mach. Learn. Res.2009,10:1829-1853.
    [40]T. N(?)s, H. Martens. Principal component regression in NIR analysis: Viewpoints, background details and selection of components, J. Chemom,1988, 2,155-167.
    [41]Frank, I. E., and Friedman, J. H.,A statistical view of some chemometrics regression tools. Technometrics.1993,35:109-148.
    [42]Hand, D. J., Mannila, H., and Smyth, P., Principles of Data Mining.MIT Press, Cambridge, MA,2001.
    [43]Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometr. Intell. Lab.,1987,2(1-3):37-52
    [44]Zou, H., T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics.2006,15:265-286.
    [45]Johnstone, I. M., On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist.2001,29:295-327.
    [46]Ghosh, D. Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing,2002,11462-11467.
    [47]Wold S, Sjostrom M, Eriksson L. PLS-regression:a basic tool of chemometrics. Chemometr. Intell. Lab.,2001,58(2):109-130.
    [48]Geladi P, Kowalski BR. Partial least-squares regression:a tutorial. Anal. Chim. Acta,1986,185:1-17.
    [49]Barker M, Rayens W. Partial least squares for discrimination. J. Chemometr., 2003,17:166-173.
    [50]Stone M, Brooks RJ. Continuum Regression:Cross-Validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression. J. R. Statist. Soc. B,1990, 52(2):237-269.
    [51]Li H-D, Liang Y-Z, Xu Q-S. Uncover the path from PCR to PLS via elastic component regression. Chemometrics and Intelligent Laboratory Systems,2010, 104(2):341-346.
    [52]Boulesteix, A. PLS Dimension reduction for classification with microarray data. Statist. Appl. Genet. Mol. Biol.2004,3:1-33.
    [53]Osborne SD, Kunnemeyer R, Jordan RB. Method of Wavelength Selection for Partial Least Squares. Analyst,1997,122:1531-1537.
    [54]H. Xu, W. S. Cai, X. G. Shao, Weighted partial least squares regression by variable grouping strategy for multivariate calibration of near infrared spectra Anal. Methods.2010,2,289-294.
    [55]X. G. Shao, X. H. Bian, W. S. Cai, An improved boosting partial least squares method for near-infrared spectroscopic quantitative analysisAnalytica Chimica Acta.2010,666(1-2),32-37.
    [56]X.Huang, Q-S Xu and Y-Z Liang. PLS regression based on sure independence screening for multivariate, Anal. Methods,2012,4:2815-2821.
    [57]De Jong S. SIMPLS:An alternative approach to partial least squares regression. Chemometr. Intell. Lab.,1993,18(3):251-263.
    [58]L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classification and Regression Trees, Wadsworth International, CA, USA,1984.
    [59]Kingsford C, Salzberg SL. What are decision trees? Nat. biotech.,2008,26(9): 1011-1013.
    [60]S. Izrailev, D. Agrafiotis, A novel method for building regression tree models for QSAR based on artificial ant colony systemsJ. Chem. Inf. Model.2001, 41:176-180.
    [61]F. Questier, R. Put, D. Coomans, B. Walczak, Y.V. Heyden, The use of CART and multivariate regression trees for supervised and unsupervised feature selection, Chemometr. Intell. Lab. Syst.2005,76:45-54.
    [62]M.P. Gomez-Carracedo, J.M. Andrade, G.V.S.M. Carrera, J. Airesde-Sousa, A. Carlosena, D. Prada, Combining Kohonen neural networks and variable selection by classification trees to cluster road soil samples,Chemometr. Intell. Lab. Syst.2010,102 (1):20-34.
    [63]B. Debska, B. Guzowska-Swider, Decision trees in selection of featured determined food quality,Anal. Chim. Acta.2011,705:261-271.
    [64]M.H. Zhang, Q.S. Xu, F. Daeyaert, P.J. Lewi, D.L. Massart, Application of boosting to classification problems in chemometrics,Anal. Chim. Acta.2005, 544:167-176.
    [65]M Daszykowski, B Walczak, Q-S Xu, F Daeyaert, MR de Jonge, J Heeres, LMH Koymans, PJ Lewi, HM Vinkers, PA Janssen, DL Massart,Classification and Regression Trees Studies of HIV Reverse Transcriptase Inhibitors, Journal of chemical information and computer sciences.2004,44(2):716-726.
    [66]A.J. Myles, R.N. Feudale, Y. Liu, N.A. Woody, S.D. Brown, An introduction to decision tree modeling,J. Chemometr.2004,18:275-285.
    [67]A.J. Myles, S.D. Brown, Induction of Decision Trees Using Fuzzy Partitions, J. Chemometr.2003,17:531.
    [68]B. Hemmateenejad, M. Shamsipur, V. Zare-Shahabadi, M. Akhond, Building optimal regression tree by ant colony system-genetic algorithm:application to modeling of melting points,Anal. Chim. Acta.2011,704:57.
    [69]Vapnik V. Statistical Learning Theory. New York:Wiley,1998.
    [70]Vapnik V. The Nature of Statistical Learning Theory. New York:Springer, 1999.
    [71]J.H. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York, USA, 2008.
    [72]Ripley B. Pattern Recognition and Neural Networks. Cambridge:Cambridge University Press,1996.
    [73]Burges C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Disc.,1998,2:121-167.
    [74]Li H-D, Liang Y-Z, Xu Q-S. Support vector machines and its applications in chemistry. Chemometr. Intell. Lab.,2009,95:188-198.
    [75]Guyon I, Weston J, Barnhill S, et al. Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn.,2002,46(1):389-422.
    [76]Gualdron O, Brezmes J, Llobet E, et al. Variable selection for support vector machine based multisensor systems. Sensor Actuat B-Chem.,2007,122(1): 259-268.
    [77]Smola AJ, Scholkopf B. A tutorial on support vector regression. Stat. Comput., 2004,14:199-222.
    [78]N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK,2000.
    [79]J. Shawe. Taylor, and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press (2004).
    [80]Fechner N, Jahn A, Hinselmann G, et al. Estimation of the applicability domain of kernel-based machine learning models for virtual screening. Journal of Cheminformatics,2010,2(2):1-20.
    [81]Aksu Y, Miller DJ, Kesidis G, et al. Margin-Maximizing Feature Elimination Methods for Linear and Nonlinear Kernel-Based Discriminant Functions. IEEE Trans Neural Netw,2010,21(5):701-717.
    [82]B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002.
    [83]D.S. Cao, Y.Z. Liang, Q.S. Xu, Q.N. Hu, L.X. Zhang, and G.H. Fu, Exploring nonlinear relationships in chemical data using kernel-based methods,Chemometrics and Intelligent Laboratory Systems.2011,107: 106-115.
    [84]边肇祺,张学工.模式识别(第二版).北京:清华大学出版社,2000.
    [85]Amari S, wu S, Improving support vector machine classifier by modifying kernel funetions. Neural Networks,1999,12(9):783-789.
    [86]Breiman L. Bagging Predictors. Machine Learning,1996,24(2):123-140.
    [87]Efron B, Tibshirani R. An introduction to the bootstrap, Chapman&Hall.1993
    [88]Efron B, Gong G. A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation. Am. Stat.,1983,37(1):36-48.
    [89]Breiman L. Random Forests. Machine Learning,2001,45(1):5-32.
    [90]Wang L, Aliferis C. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics,2008,9(1):319.
    [91]Friedman J, Hastie T, Tibshirani R. Additive logistic regression:a statistical view of boosting. Annls Statistics,2000,28:337-407.
    [92]Freund Y, Schapire R. Experiments with a new boosting algorithm, Machine Learning:Proceedings of the Thirteenth International Conference.1996, 148-156.
    [93]P. Zhang, Model Selection via Multifold Cross Validation,Annals of Statistics, 1993,21:299-313.
    [94]Shao J. Linear Model Selection by Cross-Validation. J Am. Stat. Assoc.,1993, 88(422):486-494.
    [95]Stone M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. B,1974,36:111-147.
    [96]Xu Q-S, Liang Y-Z. Monte Carlo cross validation. Chemometr. Intell. Lab., 2001,56(1):1-11.
    [97]Filzmoser P, Liebmann B, Varmuza K. Repeated double cross validation. J Chemometr,2009,23(4):160-171.
    [98]Muller, K.-R., Ratsch, G., Sonnenburg, S., Mika, S., Grimm, M., and Heinrich, N. Classifying drug-likeness with kernel-based learning methods. J. Chem. Inf. Model.,2005,45:249-253.
    [99]Yao, X.J., Panaye, A., Doucet, J.P., et al. Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression. J. Chem. Inf. Comput. Sci., 2004,44:125-1266.
    [100]Burbidge, R., Trotter, M., Buxton, B., and Holden, S. Drug design by machine learning:Support vector machines for pharmaceutical data analysis. Comput. Chem.,2001,26:5-14.
    [101]H. van de Waterbeemd, and E. Gifford, ADMET in silico modelling:towards prediction paradise?Nat Rev Drug Discov.2003,2:192-204.
    [102]Y.-H. Wang, Y. Li, S.-L. Yang, and L. Yang, Classification of substrates and inhibitors of P-glycoprotein using unsupervised machine learning approach,Journal of Chemical Information and Modeling.2005,45: 750-757.
    [103]J. Gola, O. Obrezanova, E. Champness, and M. Segall, ADMET property prediction:The state of the art and current challenges, QSAR & Combinatorial Science.2006,25:1172-1180.
    [104]U. Norinder, and C.A.S. Bergstrom, Predicting ADME properties, ChemMedChem.2006,259(1):920-937.
    [105]Oolbraikh A P, Oloff S, Xiao Y, et al. Combinatorial QSAR modeling of P-glycoprotein substrates cerqueira lima. Journal of Chemical Information and Modeling,2006,46(3):1245-1254
    [106]Y. Xue, C.W. Yap, L.Z. Sun, Z.W. Cao, J.F. Wang, and Y.Z. Chen, Prediction of P-Glycoprotein Substrates by a Support Vector Machine Approach, Journal of Chemical Information and Computer Sciences.2004, 44:1497-1505.
    [107]Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ, Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents, J.Chem. Inf. Comput.2004,44:1630-1638.
    [108]T. Fawcett, An introduction to ROC analysis, Pattern Recogn, Lett,27 (2006), 861-874.
    [109]Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics,2005,21(24): 4356-4362.
    [110]R. Rosipal, L.J. Trejo, Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space,Journal of Machine Learning Research, 2001,2:97-123.
    [111]R. Rosipal, Kernel Partial Least Squares for Nonlinear Regression and Discrimination,Neural Network World,2003,13:291-300.
    [112]H. Wold. Soft Modeling by Latent Variables; the Nonlinear Iterative Partial Least Squares Approach. In J. Gani, editor, Perspectives in Probability and Statistics, Papers in Honour of M.S. Bartlett, pages 520-540. Academic Press, London,1975.
    [113]P.J. Lewi. Pattern recognition, reflection from a chemometric point of view. Chemometrics and Intelligent Laboratory Systems.1995,28:23-33.
    [114]R. Manne. Analysis of Two Partial-Least-Squares Algorithms for Multivariate Calibration. Chemometrics and Intelligent Laboratory Systems.1987, 2:187-197.
    [115]S. RAannar, F. Lindgren, P. Geladi, and S. Wold. A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1:Theory and algorithm. Chemometrics and Intelligent Laboratory Systems.1994,8:111-125.
    [116]Yi L-Z, He J, Liang Y-Z, et al. Plasma fatty acid metabolic profiling and biomarkers of type 2 diabetes mellitus based on GC/MS and PLS-LDA. FEBS Letters,2006,580(30):6837-6845.
    [117]B. Walczak, D.L.Massart, The Radial Basis Functions-Partial Least Squares approach as a flexible non-linear regression technique,Analytica Chimica Acta. 1996,331:177-185.
    [118]B. Walczak, D.L. Massart, Application of Radial Basis Functions-Partial Least Squares to non-linear pattern recognition problems:diagnosis of process faults,Analytica Chimica Acta.1996,331:187-193.
    [119]Kai Yu, Liang Ji and Xuegong Zhang. Kernel Nearest-Neighbor Algorithm, Neural Processing Letters.2002,15:147-156.
    [120]Courant, R. and Hilbert, D.:Methods of Mathematical Physics, J. Wiley, New York,1953.
    [121]Hart, P. E.:The condensed nearest neighbor rule, IEEE Trans. Inf. Theory. 1968,16:515-516.
    [122]Wilson, D. L.:Asymptotic properties of nearest neighbor rules using edited data, IEEE Trans. Syst. Man Cybern.1972,2:408-421.
    [123]Centner V, Massart D L, de Noord OE, et al. Elimination of Uninformative Variables for Multivariate Calibration. Anal. Chem.,1996,68(21):3851-3858.
    [124]Cai W, Li Y, Shao X. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometr. Intell. Lab.,2008,90(2):188-194.
    [125]Han Q-J, Wu H-L, Cai C-B, et al. An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. Anal. Chim. Acta,2008,612(2): 121-125.
    [126]X.G. Shao, F. Wang, D. Chen, Q.D. Su, A method for near-infrared spectral calibration of complex plant samples with wavelet transform and elimination of uninformative variables.Anal. Bioanal. Chem.2004,378(5):1382-1387.
    [127]Jiang J-H, Berry RJ, Siesler HW, et al. Wavelength Interval Selection in Multicomponent Spectral Analysis by Moving Window Partial Least-Squares Regression with Applications to Mid-Infrared and Near-Infrared Spectroscopic Data. Anal. Chem.,2002,74(14):3555-3565.
    [128]Li H-D, Liang Y-Z, Xu Q-S and Cao D-S.Key wavelengths screening using competitive adaptive reweightedsampling method for multivariate calibration, Analytica Chimca Acta 2009,648(1):77-84.
    [129]J. H. Kalivas, G. G. Siano, E. Andries, H. C. Goicoechea, Calibration Maintenance and Transfer Using Tikhonov Regularization Approaches,Appl. Spectrosc.2009,63:800-809.
    [130]http://software.eigenvector.com/data/index.html.
    [131]R.W. Kennard, L.A. Stone, Computer aided design of experiments, Technometrics.1969,11(1):137-148.
    [132]Bickel P J, Ritov Y,Tsybakov A B.Simultaneous analysis of LASSO and Dantzig selector.Annals of Statistics,2009,37:1705-1732.
    [133]Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics,2009,37:3498-3528.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700