Automatic extraction and structuration of soil–environment relationship information from soil survey reports

英文篇名：Automatic extraction and structuration of soil–environment relationship information from soil survey reports
作者：WANG ; De-sheng ; LIU ; Jun-zhi ; ZHU ; A-xing ; WANG ; Shu ; ZENG ; Can-ying ; MA ; Tian-wu
英文作者：WANG De-sheng;LIU Jun-zhi;ZHU A-xing;WANG Shu;ZENG Can-ying;MA Tian-wu;Key Laboratory of Virtual Geographic Environment, Nanjing Normal University;State Key Laboratory Cultivation Base of Geographical Environment Evolution (Jiangsu Province);Jiangsu Center for Collaborative Innovation in Geographic Information Resource Development and Application;State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences;Department of Geography, University of Wisconsin-Madison;
英文关键词：soil–environment relationship;;text;;natural language processing;;extraction;;structuration
中文刊名：ZGNX
英文刊名：农业科学学报(英文版)
机构：Key Laboratory of Virtual Geographic Environment, Nanjing Normal University;State Key Laboratory Cultivation Base of Geographical Environment Evolution (Jiangsu Province);Jiangsu Center for Collaborative Innovation in Geographic Information Resource Development and Application;State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences;Department of Geography, University of Wisconsin-Madison;
出版日期：2019-02-20
出版单位：Journal of Integrative Agriculture
年：2019
期：v.18
基金：supported by the National Natural Science Foundation of China (41431177 and 41601413);; the National Basic Research Program of China (2015CB954102);; the Natural Science Research Program of Jiangsu Province, China (BK20150975 and 14KJA170001);; the Outstanding Innovation Team in Colleges and Universities in Jiangsu Province, China
语种：英文;
页：ZGNX201902008
页数：12
CN：02
ISSN：10-1039/S
分类号：84-95

摘要

In addition to soil samples, conventional soil maps, and experienced soil surveyors, text about soils(e.g., soil survey reports) is an important potential data source for extracting soil–environment relationships. Considering that the words describing soil–environment relationships are often mixed with unrelated words, the first step is to extract the needed words and organize them in a structured way. This paper applies natural language processing(NLP) techniques to automatically extract and structure information from soil survey reports regarding soil–environment relationships. The method includes two steps:(1) construction of a knowledge frame and(2) information extraction using either a rule-based method or a statistic-based method for different types of information. For uniformly written text information, the rule-based approach was used to extract information. These types of variables include slope, elevation, accumulated temperature, annual mean temperature, annual precipitation, and frost-free period. For information contained in text written in diverse styles, the statistic-based method was adopted. These types of variables include landform and parent material. The soil species of China soil survey reports were selected as the experimental dataset. Precision(P), recall(R), and F1-measure(F1) were used to evaluate the performances of the method. For the rule-based method, the P values were 1, the R values were above 92%, and the F1 values were above 96% for all the involved variables. For the method based on the conditional random fields(CRFs), the P, R and F1 values for the parent material were, respectively, 84.15, 83.13, and 83.64%; the values for landform were 88.33, 76.81, and 82.17%, respectively. To explore the impact of text types on the performance of the CRFs-based method, CRFs models were trained and validated separately by the descriptive texts of soil types and typical profiles. For parent material, the maximum F1 value for the descriptive text of soil types was 90.7%, while the maximum F1 value for the descriptive text of soil profiles was only 75%. For landform, the maximum F1 value for the descriptive text of soil types was 85.33%, which was similar to that of the descriptive text of soil profiles(i.e., 85.71%). These results suggest that NLP techniques are effective for the extraction and structuration of soil–environment relationship information from a text data source.
In addition to soil samples, conventional soil maps, and experienced soil surveyors, text about soils(e.g., soil survey reports) is an important potential data source for extracting soil–environment relationships. Considering that the words describing soil–environment relationships are often mixed with unrelated words, the first step is to extract the needed words and organize them in a structured way. This paper applies natural language processing(NLP) techniques to automatically extract and structure information from soil survey reports regarding soil–environment relationships. The method includes two steps:(1) construction of a knowledge frame and(2) information extraction using either a rule-based method or a statistic-based method for different types of information. For uniformly written text information, the rule-based approach was used to extract information. These types of variables include slope, elevation, accumulated temperature, annual mean temperature, annual precipitation, and frost-free period. For information contained in text written in diverse styles, the statistic-based method was adopted. These types of variables include landform and parent material. The soil species of China soil survey reports were selected as the experimental dataset. Precision(P), recall(R), and F1-measure(F1) were used to evaluate the performances of the method. For the rule-based method, the P values were 1, the R values were above 92%, and the F1 values were above 96% for all the involved variables. For the method based on the conditional random fields(CRFs), the P, R and F1 values for the parent material were, respectively, 84.15, 83.13, and 83.64%; the values for landform were 88.33, 76.81, and 82.17%, respectively. To explore the impact of text types on the performance of the CRFs-based method, CRFs models were trained and validated separately by the descriptive texts of soil types and typical profiles. For parent material, the maximum F1 value for the descriptive text of soil types was 90.7%, while the maximum F1 value for the descriptive text of soil profiles was only 75%. For landform, the maximum F1 value for the descriptive text of soil types was 85.33%, which was similar to that of the descriptive text of soil profiles(i.e., 85.71%). These results suggest that NLP techniques are effective for the extraction and structuration of soil–environment relationship information from a text data source.

引文

Aone C,Ramos-Santacruz M.2000.REES:A large-scale relation and event extraction system.Proceedings of the Sixth Conference on Applied Natural Language Processing.Association for Computational Linguistics,Stroudsburg,USA.
    Appelt D E.1999.Introduction to information extraction.AI Communications,12,161-172.
    Appelt D E,Hobbs J R,Bear J,Israel D,Tyson M.1993.Fastus:A finite-state processor for information extraction from real-world text.International Joint Conferences on Artificial Intelligence,93,1172-1178.
    Beucher A,Siemssen R,Fr?jd?S,?sterholm P,Martinkauppi A,Edén P.2015.Artificial neural network for mapping and characterization of acid sulfate soils:Application to Sirppujoki River catchment,southwestern Finland.Geoderma,247-248,38-50.
    Bird S,Klein E,Loper E.2009.Natural Language Processing with Python:Analyzing Text with the Natural Language Toolkit.O’Reilly Media,USA.
    Brungard C W,Boettinger J L,Duniway M C,Wills S A,Edwards Jr T C.2015.Machine learning for predicting soil classes in three semi-arid landscapes.Geoderma,239-240,68-83.
    Chang A X,Manning C D.2012.Sutime:A library for recognizing and normalizing time expressions.The International Conference on Language Resources and Evaluation,2012,3735-3740.
    Ciravegna F.2001.Adaptive information extraction from text by rule induction and generalisation.International Joint Conferences on Artificial Intelligence,32,1251-1256.
    Cook S E,Corner R J,Grealish G,Gessler P E,Chartres CJ.1996.A rule-based system to map soil properties.Soil Science Society of America Journal,60,1893-1900.
    Corner R J,Hickey R J,Cook S E.2002.Knowledge based soil attribute mapping in GIS:The expector method.Transactions in GIS,6,383-402.
    GB/T 17296-2009.2009.Classification and codes for Chinese soil.The State Bureau of Quality and Technical Supervision,China National Dtandardization Management Committee.(in Chinese)
    Hengl T,Heuvelink G B,Kempen B,Leenaars J G,Walsh M G,Shepherd K D,Tamene L.2015.Mapping soil properties of Africa at 250 m resolution:random forests significantly improve current predictions.PLoS ONE,10,e0125814.
    Heung B,Ho H C,Zhang J,Knudby A,Bulmer C E,Schmidt MG.2016.An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping.Geoderma,265,62-77.
    Jenny H.1941.Factors of Soil Formation:A System of Quantitative Pedology.Dover Publications,New York.
    Jurafsky D,Martin J H.2000.Speech and language processing:An introduction to natural language processing.Computational Linguistics and Speech Recognition,36,161-187.
    Jurafsky D,Martin J H.2014.Speech and Language Processing.Pearson,London.
    Lafferty J D,Mccallum A,Pereira F C N.2001.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In:Proceedings of the Eighteenth International Conference on Machine Learning.Morgan Kaufmann Publishers,San Francisco,USA.pp.282-289.
    Liu J,Zhu A X.2009.Mapping with words:A new approach to automated digital soil survey.International Journal of Intelligent Systems,24,293-311.
    Manaris B.1998.Natural language processing:A humancomputer interaction perspective.Advances in Computers,47,1-66.
    Martin J H,Jurafsky D.2009.Speech and Language Processing:An Introduction to Natural Language Processing,Computational Linguistics,and Speech Recognition.Pearson/Prentice Hall,London,England.
    McBratney A B,Santos M M,Minasny B.2003.On digital soil mapping.Geoderma,117,3-52.
    Mikheev A,Moens M,Grover C.1999.Named entity recognition without gazetteers.In:Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,Bergen,Norway.pp.1-8.
    Minsky M.1975.A framework for representing knowledge.In:The Psychology of Computer Vision.McGraw-Hill,New York.pp.211-277.
    Nauman T W,Thompson J A.2014.Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees.Geoderma,213,385-399.
    Odgers N P,Libohova Z,Thompson J A.2012.Equal-area spline functions applied to a legacy soil database to create weighted-means maps of soil organic carbon at a continental scale.Geoderma,189-190,153-163.
    OSNSSC(The Office for the Second National Soil Survey of China).1993.Soil Species(Series)of China(Vol.1).Chinese Agriculture Press,Beijing.(in Chinese)
    OSNSSC(The Office for the Second National Soil Survey of China).1994a.Soil Species(Series)of China(Vol.2).Chinese Agriculture Press,Beijing.(in Chinese)
    OSNSSC(The Office for the Second National Soil Survey of China)1994b.Soil Species(Series)of China(Vol.3).Chinese Agriculture Press,Beijing.(in Chinese)
    OSNSSC(The Office for the Second National Soil Survey of China).1995a.Soil Species(Series)of China(Vol.4).Chinese Agriculture Press,Beijing.(in Chinese)
    OSNSSC(The Office for the Second National Soil Survey of China).1995b.Soil Species(Series)of China(Vol.5).Chinese Agriculture Press,Beijing.(in Chinese)
    OSNSSC(The Office for the Second National Soil Survey of China).1996.Soil Species(Series)of China(Vol.6).Chinese Agriculture Press,Beijing.(in Chinese)
    Piskorski J,Yangarber R.2013.Information extraction:Past,present and future.In:Multi-source,Multilingual Information Extraction and Summarization.Springer,Heidelberg,Germany.pp.23-49.
    Pustejovsky J,Stubbs A.2012.Natural Language Annotation for Machine Learning:A Guide to Corpus-building for Applications.O’Reilly Media,Sebastopol,USA.
    Qi F,Zhu A X.2003.Knowledge discovery from soil maps using inductive learning.International Journal of Geographical Information Science,17,771-795.
    Rodrigues M,Teixeira A.2015.Advanced Applications of Natural Language Processing for Performing Information Extraction.Springer,Heidelberg,Germany.
    Rossiter D.2008.Digital soil mapping as a component of data renewal for areas with sparse soil data infrastructures.In:Digital Soil Mapping with Limited Data.Springer,Heidelberg,Germany.pp.69-80.
    Shaalan K,Raza H.2007.Person name entity recognition for Arabic.In:Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages:Common Issues and Resources.Association for Computational Linguistics,Stroudsburg,USA.pp.17-24.
    Shariff A R B,Egenhofer M J,Mark D M.1998.Natural language spatial relations between linear and areal objects:The topology and metric of English-language terms.International Journal of Geographical Information Science,12,215-245.
    Shi X,Yu D,Warner E,Pan X,Petersen G,Gong Z,Weindorf D.2004.Soil database of 1:1,000,000 digital soil survey and reference system of the Chinese genetic soil classification system.Soil Horizons,45,129-136.
    Soderland S.1999.Learning information extraction rules for semi-structured and free text.Machine learning,34,233-272.
    Stum A K,Boettinger J,White M,Ramsey R.2010.Random Forests Applied as a Soil Spatial Predictive Model in Arid Utah.Springer,Heidelberg,Germany.
    Sutton C,McCallum A.2012.An introduction to conditional random fields.Foundations and Trends in Machine Learning,4,267-373.
    Valenzuela-Escárcega M A,Hahn-Powell G,Surdeanu M,Hicks T.2015.A domain-independent rule-based framework for event extraction.In:Proceedings of ACL-IJCNLP 2015System Demonstrations.Association for Computational Linguistics,Stroudsburg,USA.pp.127-132.
    Wang H,Qi Z,Hao H,Xu B.2014.A hybrid method for Chinese entity relation extraction In:Natural Language Processing and Chinese Computing.Springer,Germany.pp.357-367.
    Wu L,Liu L,Li H,Gao Y.2017.A Chinese toponym recognition method based on conditional random field.Geomatics&Information Science of Wuhan University,42,150-156.(in Chinese)
    Wu Y,Jiang M,Lei J,Xu H.2015.Named entity recognition in Chinese clinical text using deep neural network.Studies in Health Technology and Informatics,216,624.
    Yu H,Zhang H,Liu Q.2003.Recognition of Chinese organization name based on role tagging.In:Advances in Computation of Oriental Languages:Proceedings of the20th International Conference on Computer Processing of Oriental Languages.Tsinghua University Press,China.pp.79-87.
    Zhang C,Zhang X,Jiang W,Shen Q,Zhang S.2009.Rulebased extraction of spatial relations in natural language text.In:2009 International Conference on Computational Intelligence and Software Engineering.IEEE,China.pp.1-4.
    Zhao Q,Sui Z.2008.To extract ontology attribute value automatically based on WWW.In:2008 International Conference on Natural Language Processing and Knowledge Engineering.IEEE,China.pp.1-7.
    Zhu A X.1999.A personal construct-based knowledge acquisition process for natural resource mapping.International Journal of Geographical Information Science,13,119-141.
    Zitouni I.2014.Natural Language Processing of Semitic Languages.Springer,Germany.
    Zong C Q.2013.Statistical Natural Language Processing.Tsinghua University Press,Beijing.(in Chinese)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700