用户名: 密码: 验证码:
基于机器学习的洋岛玄武岩主量元素预测稀土元素
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Prediction of REEs in OIB by major elements based on machine learning
  • 作者:洪瑾 ; 甘成势 ; 刘洁
  • 英文作者:HONG Jin;GAN Chengshi;LIU Jie;School of Earth Sciences and Engineering,Sun Yat-sen University;Guangdong Provincial Key Laboratory of Mineral Resources &Geological Processes;
  • 关键词:机器学习 ; 随机森林 ; 洋岛玄武岩 ; 主量元素 ; 稀土元素
  • 英文关键词:machine learning;;random forest;;oceanic island basalt;;major elements;;rare earth elements
  • 中文刊名:DXQY
  • 英文刊名:Earth Science Frontiers
  • 机构:中山大学地球科学与工程学院;广东省地质过程与矿产资源探查重点实验室;
  • 出版日期:2019-07-11 17:10
  • 出版单位:地学前缘
  • 年:2019
  • 期:v.26;No.138
  • 基金:国家重点研发计划项目(2016YFC0600506);; 国家自然科学基金项目(41574087)
  • 语种:中文;
  • 页:DXQY201904008
  • 页数:10
  • CN:04
  • ISSN:11-3370/P
  • 分类号:49-58
摘要
地学共享数据库(如GEOROC、PetDB等)可为地球科学研究提供重要基础数据。然而,这些数据库均存在一个明显缺陷:样品的9种主量元素(SiO2、TiO2、Al2O3、CaO、MgO、MnO、K2O、Na2O和P2O5)均有准确数据,但稀土元素(rare earth elements,REE)数据大量缺失。鉴于REE在地球化学领域的重要作用,我们尝试为数据库缺失的REE值提供一个补全方案,即利用机器学习中的随机森林方法实现由9种主量元素预测REE值。以洋岛玄武岩(ocean island basalt,OIB)为例,把从GEOROC库中搜集到的1 283组OIB数据按8∶2的比例分为两组,其中80%的数据作为训练数据集用于建模,20%的数据作为测试数据集验证模型。比较了随机森林和多元线性回归方法对相同数据进行建模和预测的效果差异,发现无论是回归建模还是预测,随机森林方法都优于多元线性回归,且随着输入参数与输出参数之间关系的复杂化,这种优势更加明显。随机森林对测试数据集的预测效果整体较好,只是随着REE原子序数的增大,预测效果逐渐减弱。这一方面可能是因为原子序数大的REE与主量元素的关系更弱;另一方面可能是由于原子序数大的REE与主量元素的关系更加复杂。其次,随机森林方法预测的REE配分曲线与实际配分曲线吻合度较高,且预测所得配分曲线的区分能力较强,能够反映实际配分曲线之间的相对差异,这一点对推断地球化学过程尤为重要。随机森林方法随着训练数据的增多,其建立的模型也将越稳定,预测结果也会更精确。因此,随着数据库的不断完善,对数据库中REE值的预测也将更为可信、可行。
        Geoscience shared databases(GEOROC,PetDB,etc.)provide important basic data for geoscience research.However,there is an obvious defect in these databases,i.e.,in database samples,the nine major elements(SiO_2,TiO_2,Al_2O_3,CaO,MgO,MnO,K_2O,Na_2O and P_2O_5)are mostly present,but rare earth element(REE)data are often missing.In view of the important role of REE in geochemistry,here we attempt to provide a solution for supplementing the missing REE data by using random forest method of machine learning to predict REE values by major elements.Taking Ocean Island Basalt(OIB)as an example,1283 OIB samples collected from the GEOROC database were divided into two groups:80%of the data were used as training data for modeling and the remaining 20% were test data for model validation.Comparing the modeling and prediction results using random forest and multivariable linear regression methods on the same data,we found that the random forest method was superior in both aspects with clear advantage;however,the relationship between input and output parameters was not simple.The random forest method predicted the test data very well for light REEs,but prediction power decreased gradually with increasing atomic number,possibly due to a weaker or more complex relationship between heavy rare earth and major elements.The predicted REE distribution pattern by the random forest method matched the actual REE distribution pattern,with good distinguishing power to reflect the relative difference between the actual distribution patterns,which is particularly important to infer the geochemical process.With increasing training data,the model established by the random forest method will be more stable thus to provide more accurate prediction results.Ultimately,REE value prediction will be more reliable and feasible with continuous improvement of databases.
引文
[1] MAYER-SCHNHBERGER V,CUKIER K.大数据时代:生活工作与思维的大变革[M].杭州:浙江人民出版社,2013,1-261.
    [2]张旗,周永章.大数据正在引发地球科学领域一场深刻的革命:《地质科学》2017年大数据专题代序[J].地质科学,2017,52(3):637-648.
    [3]杜雪亮,张旗,王金荣,等.全球海山玄武岩数据挖掘研究[J].地质科学,2017,52(3):668-692.
    [4]陈万峰,王金荣,张旗,等.洋岛和洋底高原玄武岩数据挖掘:地球化学特征及其与MORB的对比[J].地质学报,2017,91(11):2443-2455.
    [5]王金荣,陈万峰,张旗,等.N-MORB和E-MORB数据挖掘:玄武岩判别图及洋中脊源区地幔性质的讨论[J].岩石学报,2017,33(3):993-1005.
    [6]李玉琼,张旗,王金荣,等.全球大陆弧玄武岩(CAB)的特征:与岛弧玄武岩(IAB)和弧后玄武岩(BAB)的对比[J].地质科学,2017,52(3):693-713.
    [7]刘欣雨,张旗,张成立.全球新生代安山岩构造环境有关问题探讨[J].地质科学,2017,52(3):649-667.
    [8]安屹,杨婧,陈万峰,等.N-MORB、E-MORB和OIB的区别及其可能的原因:大数据的启示[J].地质科学,2017,52(3):727-742.
    [9]第鹏飞,陈万峰,张旗,等.全球N-MORB和E-MORB分类方案对比[J].岩石学报,2018,34(2):264-274.
    [10]张旗,焦守涛,卢欣祥.论地质研究中的因果关系和相关关系:大数据研究的启示[J].岩石学报,2018,34(2):275-280.
    [11]曾建国.大数据时代数据库信息系统安全风险评估技术分析[J].信息安全与技术,2015,6(9):27-28.
    [12]周永章,陈烁,张旗,等.大数据与数学地球科学研究进展:大数据与数学地球科学专题代序[J].岩石学报,2018,34(2):256-263.
    [13] BREIMAN L.Random forests[J].Machine Learning,2001,45(1):5-32.
    [14] RODRIGUEZ-GALIANO V,MENDES M P,GARCIASOLDADO M J,et al.Predictive modeling of groundwater nitrate pollution using random forest and multisource variables related to intrinsic and specific vulnerability:a case study in an agricultural setting(Southern Spain)[J].Science of the Total Environment,2014,476:189-206.
    [15] MATIN S S,HOWER J C,FARAHZADI L,et al.Explaining relationships among various coal analyses with coal grindability index by random forest[J].International Journal of Mineral Processing,2016,155:140-146.
    [16] DIAZ-URIARTE R,DE ANDRES S A.Gene selection and classification of microarray data using random forest[J].Bmc Bioinformatics,2006,7(1):3.
    [17] HEIDEMA A G,BOER J M,NAGELKERKE N,et al.The challenge for genetic epidemiologists:how to analyze large numbers of SNPs in relation to complex diseases[J].Bmc Genetics,2006,7(1):23.
    [18] BIAU G,DEVROYE L,LUGOSI G.Consistency of random forests and other averaging classifiers[J].Journal of Machine Learning Research,2008,9(1):2015-2033.
    [19] ARCHER K J,KIRNES R V.Empirical characterization of random forest variable importance measures[J].Computational Statistics and Data Analysis,2008,52(4):2249-2260.
    [20] HOPWOOD W,MCKEOWN J C,MUTCHLER J F.A reexamination of auditor versus model accuracy within the context of the going-concern opinion decision[J].Contemporary Accounting Research,2010,10(2):409-431.
    [21] SETOGUCHI S,SCHNEEWEISS S,BROOKHART M A,et al.Evaluating uses of data mining techniques in propensity score estimation:a simulation study[J].Pharmacoepidemiology and Drug Safety,2008,17(6):546-555.
    [22] AURET L,ALDRICH C.Interpretation of nonlinear relationships between process variables by use of random forests[J].Minerals Engineering,2012,35(8):27-42.
    [23] HALLETT M J,FAN J J,SU X G,et al.Random forest and variable importance rankings for correlated survival data,with applications to tooth loss[J].Statistical Modelling,2014,14(6):523-547.
    [24] CHELGANI S C,MATIN S S,HOWER J C.Explaining relationships between coke quality index and coal properties by random forest method[J].Fuel,2016,182:754-760.
    [25] BELOUSOVA E A,GRIFFIN W L,O'REILLY S Y,et al.Igneous zircon:trace element composition as an indicator of source rock type[J].Contributions to Mineralogy and Petrology,2002,143(5):602-622.
    [26] VERMEESCH P.Tectonic discrimination of basalts with classification trees[J].Geochimica et Cosmochimica Acta,2006,70(7):1839-1848.
    [27]洪瑾,甘成势,刘洁.基于机器学习的岩石微量元素与主量元素关系初探:以洋岛玄武岩中锆元素为例[J].地质科学,2018,53(4):1285-1299.
    [28] SVETNIK V,LIAW A,TONG C,et al.Random forest:a classification and regression tool for compound classification and QSAR modeling[J].Journal of Chemical Information and Computer Sciences,2003,43(6):1947-1958.
    [29] LIAW A,WIENER M.Classification and regression by random forest[J].R News,2002,23(23).
    [30] BYLANDER T.Estimating generalization error on two-class datasets using out-of-bag estimates[J].Machine Learning,2002,48(1):287-297.
    [31] WANG H,YANG F,LUO Z.An experimental study of the intrinsic stability of random forest variable importance measures[J].Bmc Bioinformatics,2016,17(1):1-18.
    [32] STROBL C,BOULESTEIX A-L,ZEILEIS A,et al.Bias in random forest variable importance measures:illustrations,sources and a solution[J].BmC Bioinformatics,2007,8(1):25.
    [33] HAPFELMEIER A,HOTHORN T,ULM K,et al.A new variable importance measure for random forests with missing data[J].Statistics and Computing,2014,24(1):21-34.
    [34] GUYON I,ELISSEEFF A.An introduction to variable and feature selection[J].Journal of Machine Learning Research,2003,3(6):1157-1182.
    [35] GENUER R,POGGI J M,TULEAU-MALOT C.Variable selection using random forests[J].Pattern Recognition Letters,2010,31(14):2225-2236.
    [36] JANITZA S,TUTZ G,BOULESTEIX A L.Random forest for ordinal responses:prediction and variable selection[J].Computational Statistics and Data Analysis,2016,96:57-73.
    [37] DIETTERICH T G.Ensemble methods in machine learning[C]∥Proceedings of the first international workshop on multiple classifier systems.London,UK:Springer-Verlag,2000:1-15.
    [38] BREIMAN L I,FRIEDMAN J H,OLSHEN R A,et al.Classification and regression trees(CART)[J].Encyclopedia of Ecology,1984,40(3):582-588.
    [39] XU M,WATANACHATURAPORN P,VARSHNEY P K,et al.Decision tree regression for soft classification of remote sensing data[J].Remote Sensing of Environment,2005,97(3):322-336.
    [40] QUINLAN J R.Learning efficient classification procedures and their application to chess end games[C]∥Machine learning:an artificial intelligence approach.Berlin,Heidelberg:Springer,1983:463-482.
    [41] QUINLAN J R.C4.5:programs for machine learning[M].San Francisco:Morgan Kaufmann,1993.
    [42] RUTKOWSKI L,JAWORSKI M,PIETRUCZUK L,et al.The CART decision tree for mining data streams[J].Information Sciences,2014,266:1-15.
    [43] BASHIR S,QAMAR U,KHAN F H,et al.An efficient rule-based classification of diabetes using ID3,C4.5,&CART ensembles[C].Proceedings of international conference on frontiers of information technology.Islamabad,Pakistan:2014:226-231.
    [44] NICODEMUS K K,MALLEY J D,STROBL C,et al.The behaviour of random forest permutation-based variable importance measures under predictor correlation[J].Bmc Bioinformatics,2010,11(1):110.
    [45] STROBL C,BOULESTEIX A L,KNEIB T,et al.Conditional variable importance for random forests[J].Bmc Bioinformatics,2008,9(1):307.
    [46] NICODEMUS K K,MALLEY J D.Predictor correlation impacts machine learning algorithms:implications for genomic studies[J].Bioinformatics,2009,25(15):1884-1890.
    [47] MATIN S S,CHELGANI S C.Estimation of coal gross calorific value based on various analyses by random forest method[J].Fuel,2016,177:274-278.
    [48] SHAHBAZI B,CHELGANI S C,MATIN S S.Prediction of froth flotation responses based on various conditioning parameters by random forest method[J].Colloid Surface A,2017,529:936-941.
    [49] MESROGHLI S,JORJANI E,CHELGANI S C.Estimation of gross calorific value based on coal analysis using regression and artificial neural networks[J].International Journal of Coal Geology,2009,79(1/2):49-54.
    [50] MATIN S S,FARAHZADI L,MAKAREMI S,et al.Variable selection and prediction of uniaxial compressive strength and modulus of elasticity by random forest[J].Applied Soft Computing,2017,70:980-987.
    [51] VOYANT C,MUSELLI M,PAOLI C,et al.Predictability of PV power grid performance on insular sites without weather stations:use of artificial neural networks[C]∥European photovoltaic solar energy conference and exhibition.Hamburg,Germany,2009:4141-4144.
    [52] SALES D,CORREA D,OS RIO F S,et al.3Dvision-based autonomous navigation system using ANN and kinect sensor[J].Communications in Computer and Information Science,2012,311(2):305-314.
    [53] SADIGHI S,MOHADDECY R S.Predictive modeling for an industrial naphta performing plant using artificial nueral network with recurrent layers[J].International Journal of Technology,2013,2(6):102-111.
    [54] HAPFELMEIER A,ULM K.A new variable selection approach using random forests[J].Computational Statistic and Data Analysis,2013,60:50-69.
    [55] SCORNET E.On the asymptotics of random forests[J].Journal of Multivarite Analysis,2016,146:72-83.
    [56] GREGORUTTI B,MICHEL B,SAINT-PIERRE P.Correlation and variable importance in random forests[J].Statistics and Computing,2017,27(3):659-678.
    [57] BOYNTON W V.Cosmochemistry of the rare earth elements:meteorite studies[M]∥Developments in geochemistry.Amsterdam:Elsevier,1984:63-114.
    [58] SUN S S,MCDONOUGH W F.Chemical and isotopic systematics of oceanic basalts:implications for mantle composition and processes[J].Geological Society of London-Special Publication,1989,42(1):313-345.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700