用户名: 密码: 验证码:
随机森林是特点鲜明的模型,不是万能的模型
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Random forest is a specific algorithm, not omnipotent for all datasets
  • 作者:李欣
  • 英文作者:LI Xin-Hai;Institute of Zoology, Chinese Academy of Sciences;University of Chinese Academy of Sciences;
  • 关键词:随机森林 ; 偏效应 ; 交互作用 ; 多元共线性 ; R语言
  • 英文关键词:random forest;;partial effect;;interaction;;multicollinearity;;R
  • 中文刊名:应用昆虫学报
  • 英文刊名:Chinese Journal of Applied Entomology
  • 机构:中国科学院动物研究所;中国科学院大学;
  • 出版日期:2019-01-26
  • 出版单位:应用昆虫学报
  • 年:2019
  • 期:01
  • 基金:国家自然科学基金面上项目(31772479;31572287)
  • 语种:中文;
  • 页:172-181
  • 页数:10
  • CN:11-6020/Q
  • ISSN:2095-1353
  • 分类号:TP181
摘要
随机森林(Random forest)模型在2001年发表后得到广泛的关注。由于随机森林可以进行回归和判别等多种统计分析,而且不受正态性、方差齐性和自变量独立性等参数检验的前提条件的制约,其应用日益普遍,有被看作万能模型的趋势。实际上,随机森林是一种特点鲜明的模型,应用局部优化拟合观察值,在分析有偏效应关系的数据时,其结果往往不准确。本文以蝉科(Cicadidea)物种的分布数据为例,比较了随机森林在回归分析时与多元线性回归、广义可加模型和人工神经网络模型的差别,在判别分析时与线性判别分析的差别,强调了随机森林预测时的碎片化特点。结果显示随机森林在处理有多元共线性和交互作用的数据时,以及在判别分析时,其准确率最高。鉴于随机森林的局限性,建议做数据分析时选择多种模型进行比较。文中的R语言代码可为研究者提供参考。
        Random forest has gained extensive attention since its publication in 2001. Random forest can handle both regression and classification with minimum assumptions(no need for normality, homogeneity of variance, and independence between explanatory variables), so that its applications has dramatically increased. Someone even use it as an omnipotent tool for all analysis. In fact, random forest is a specific algorithm with clear characteristics. It is an ensemble method by constructing a number of decision trees, which intends to use local optimization to fit data. When the data have strong partial effect, random forest usually does not fit well. I compared the performance of random forest with multiple regression models,generalized additive models, and artificial neural network using the occurrence data of Cicadidea species. The results showed,although the prediction of random forest looked fragmented, it outperformed the other three models. Random forest also performed better than linear discriminant analysis for classifications. Random forest has its strength and weakness. I suggestion to use multiple models for data analysis rather than one "powerful" model.
引文
Bader-El-Den M,Teitei E,Perry T,2018.Biased random forest for dealing with the class imbalance problem.IEEE Transactions on Neural Networks and Learning Systems,doi:10.1109/TNNLS.2018.2878400.
    Biau G,2012.Analysis of a random forests model.Journal of Machine Learning Research,13(4):1063-1095.
    Breiman L,2001a.Random forests.Machine Learning,45(1):5-32.
    Breiman L,2001b.Statistical modeling:the two cultures.Statistical Science,16(3):199-215.
    Breiman L,JFriedman JH,Olshen RA,Stone CJ,1984.Classification and Regression Trees.New York:Chapman and Hall.358.
    Cutler DR,Edwards Jr TC,Beard KH,Cutler A,Hess KT,2007.Random forests for classification in ecology.Ecology,88(11):2783-2792.
    Díaz-Uriarte R,de Andrés SA,2006.Gene selection and classification of microarray data using random forest.BMCBioinformatics,7:3.
    Elith J,Graham CH,2009.Do they?How do they?Why do they differ?on finding reasons for differing performances of species distribution models.Ecography,32:66-77.
    GBIF.org,2018.GBIF occurrence download.https://doi.org/10.15468/dl.mqaniq(29 December 2018).
    Gr?emping U,2009.Variable importance assessment in regression:linear regression versus random forest.American Statistician,63(4):308-319.
    Hajjem A,Bellavance F,Larocque D,2014.Mixed-effects random forest for clustered data.Journal of Statistical Computation and Simulation,84(6):1313-1328.
    Hallett MJ,Fan JJ,Su XG,Levine RA,Nunn ME,2014.Random forest and variable importance rankings for correlated survival data,with applications to tooth loss.Statistical Modelling,14(9):523-547.
    Hastie T,Tibshirani R,Friedman J,2008.The Elements of Statistical Learning(2nd ed.).Stanford:Springer.745.
    Hastie TJ,Tibshirani R,1986.Generalized additive models.Statistical Science,1(3):297-310.
    Hijmans RJ,Cameron SE,Parra JL,Jones PG,Jarvis A,2005.Very high resolution interpolated climate surfaces for global land areas.International Journal of Climatology,25(12):1965-1978.
    Hopfield JJ,1982.Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences of the United States of AmericaBiological Sciences,79(8):2554-2558.
    Kampichler C,WielandR,CalméS,Weissenberger H,Arriaga-Weiss S,2010.Classification in conservation biology:a comparison of five machine-learning methods.Ecological Informatics,5(6):441-450.
    Kim Y,Wojciechowski R,Sung H,Mathias RA,Wang L,Klein AP,Lenroot RK,Malley J,Bailey-Wilson JE,2009.Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects.BMC Proceedings,3(Suppl.7):S64.
    Li XH,Gao EH,Li BD,Zhan XJ,2019.Estimating abundance of Tibetan wild ass,Tibetan gazelle and Tibetan antelope using species distribution models and distance sampling.Scientia Sinica Vitae,49,doi:10.1360/N052018-000171.[李欣海,郜二虎,李百度,詹祥江,2019.用物种分布模型和距离抽样估计三江源藏野驴、藏原羚和藏羚羊的数量.中国科学:生命科学,49,doi:10.1360/N052018-000171.].
    Li XH,2013.Using random forest for classification and regression.Chinese Journal of Applied Entomology,50(4):1190-1197.[李欣海,2013.随机森林模型在分类与回归分析中的应用.应用昆虫学报,50(4):1190-1197.]
    Li XH,Wang Y,2013.Applying various algorithms for species distribution modeling.Integrative Zoology,8(2):124-135.
    Liaw A,Wiener M,2002.Classification and regression by randomForest.R News,2(3):18-22.
    Nembrini S,2018.Bias in the intervention in prediction measure in random forests:illustrations and recommendations.Bioinformatics,doi:10.1093/bioinformatics/bty959.
    Pal M,2005.Random forest classifier for remote sensing classification.International Journal of Remote Sensing,26(1):217-222.
    Reis I,Baron D,Shahaf S,2019.Probabilistic random forest:a machine learning algorithm for noisy data sets.The Astronomical Journal,10.3847/1538-3881/aaf101.
    Rodriguez-Galiano V,Sanchez-Castillo M,Chica-Olmo M,ChicaRivas M,2015.Machine learning predictive models for mineral prospectivity:an evaluation of neural networks,random forest,regression trees and support vector machines.Ore Geology Reviews,71(12):804-818.
    Strobl C,Boulesteix AL,Zeileis A,Hothorn T,2007.Bias in random forest variable importance measures:illustrations,sources and a solution.BMC Bioinformatics,8:25.
    Verikas A,Gelzinis A,Bacauskiene M,2011.Mining data with random forests:a survey and results of new tests.Pattern Recognition,44(2):330-349.
    Wager S,Hastie T,Efron B,2014.Confidence Intervals for random forests:the jackknife and the infinitesimal jackknife.Journal of Machine Learning Research,15:1625-1651.
    Winham S,Wang X,de Andrade M,Freimuth R,Colby C,Huebner C,Biernacka J,2012.Interaction detection with random forests in high-dimensional data.Genetic Epidemiology,36:142.
    Zhao X,Wu Y,Lee DY,Cui W,2019.iForest:interpreting random forests via visual analytics.IEEE Transactions on Visualization and Computer Graphics,25:407-416.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700