网络爬虫的专题机构数据空间信息采集方法

英文篇名：Research on spatial information acquisition method of agency data based on Web crawler
作者：杨宇 ; 孙亚琴 ; 闫志刚
英文作者：YANG Yu;SUN Yaqin;YAN Zhigang;School of Environment Science and Spatial Informatics,China University of Mining and Technology;
关键词：泛在网络 ; 空间信息采集 ; 网络爬虫 ; 矩阵算法 ; 决策树
英文关键词：ubiquitous network;;spatial information acquisition;;Web Crawler;;matrix algorithm;;decision tree
中文刊名：CHKD
英文刊名：Science of Surveying and Mapping
机构：中国矿业大学环境与测绘学院;
出版日期：2019-02-27 09:11
出版单位：测绘科学
年：2019
期：v.44;No.253
基金：国家自然科学基金青年科学基金项目(41301433);; 中央高校基本科研业务费专项(2017XKQY019)
语种：中文;
页：CHKD201907019
页数：7
CN：07
ISSN：11-4415/P
分类号：126-131+144

摘要

针对海量专题机构数据空间信息和属性信息缺失的问题,该文使用专题机构信息网站作为信息源,以深度优先策略网络爬虫作为信息获取方法,提出了一种基于网络爬虫框架的专题机构数据空间信息采集方法。在网络爬虫方法关键功能模块中,设计了一种基于词元的字符串相似度矩阵算法来提高机构检索列表匹配准确度,并提出了一种基于决策树模式的行政区划信息识别和抽取算法用以实现地址字符串中行政区划的准确识别和提取。通过具体实现和实验测试,证明该方法能有效地实现专题机构数据空间信息和属性信息的采集,并具有较高的时间效率和准确率,可作为机构数据空间信息采集的一种有效方法。
Aiming at the problem of lack of spatial and attribute information of massively thematic agencies,this paper proposed a method for thematic agencies spatial information acquisition based on the Web crawler framework by using the website of thematic agencies as a source of information for reptiles and the depth-first Web crawler strategy as the information acquisition method.In the key functional modules of Web crawler,the word-based string similarity matrix algorithm also had been designed to improve the accuracy of institutional search list matching,at the same time,the decision-tree-based administrative division information identification extraction algorithm also had been designed to realize the accurate identification and extraction of the administrative divisions in the address string.Through the specific implementation and experimental testing,it was proved that this method could effectively collect the data and information of the thematic agencies with high time efficiency and accuracy.It could be used as an effective method for the spatial information acquisition of agencies.

引文

[1]刘经南.大数据与位置服务[J].测绘科学,2014,39(3):3-9.(LIU Jingnan.Big data and location services[J].Science of Surveying and Mapping,2014,39(3):3-9.)
    [2]孙立伟,何国辉,吴礼发.网络爬虫技术的研究[J].电脑知识与技术,2010,6(15):4112-4115.(SUN Liwei,HE Guohui,WU Lifa.Research on the Web crawler[J].Computer Knowledge and Technology,2010,6(15):4112-4115.)
    [3]张春菊,张雪英,朱少楠,等.基于网络爬虫的地名数据库维护方法[J].地球信息科学学报,2011,13(4):492-499.(ZHANG Chunju,ZHANG Xueying,ZHUShaonan,et al.Method of toponym database updating based on Web crawler[J].Journal of Geo-Information Science,2011,13(4):492-499.)
    [4]曾文华,黄桦.基于网页信息检索的地理信息变化检测方法[J].计算机应用,2010,30(4):1132-1134.(ZENGWenhua,HUANG Hua.Method for detecting changed geographical information based on information retrieval of Web pages[J].Journal of Computer Applications,2010,30(4):1132-1134.)
    [5]郭俊枫,赵仁亮,郑娇龙.面向网页文本的地理要素变化发现[J].地理信息世界,2015(1):52-56.(GUO Junfeng,ZHAO Renliang,ZHENG Jiaolong.Changing information search of geographic features based on Web page[J].Geomatics World,2015(1):52-56.)
    [6]王曙,吉雷静,张雪英,等.面向网页文本的地理要素变化检测[J].地球信息科学学报,2013,15(5):625-634.(WANG Shu,JI Leijing,ZHANG Xueying,et al.Change detection of geographic features based on Web pages[J].Journal of Geo-Information Science,2013,15(5):625-634.)
    [7]武昊,廖安平,何超英,等.基于主题相关度的地理信息Web服务爬虫研究[J].地理与地理信息科学,2012,28(2):31-34.(WU Hao,LIAO Anping,HE Chaoying,et al.Topic-relevance based crawler for geographic information Web services[J].Geography and GeoInformation Science,2012,28(2):31-34.)
    [8]陈睿嘉,康志忠,张卫涛.基于网络爬虫的导航深度服务信息自动采集[J].测绘工程,2015(1):17-24.(CHEN Ruijia,KANG Zhizhong,ZHANG Weitao.Automatic acquisition of extended service information for navigation based on Web crawler[J].Engineering of Surveying and Mapping,2015(1):17-24.)
    [9]王志琪,王永成.HTML文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48.(WANG Zhiqi,WANG Yongcheng.Text information preprocessing for HTML[J].Computer Engineering,2006,32(5):46-48.)
    [10]刘秉权,王喻红,葛冬梅,等.基于结构树解析的网页正文抽取方法[C]∥黑龙江省计算机学会2007年学术交流年会论文集.黑龙江:[出版者不详],2007:14-16.(LIU Bingquan,WANG Yuhong,GE Dongmei.Extracting text content from Chinese Web pages based on parsing structural tree[C]∥Heilongjiang Provincial Computer Society 2007 Annual Meeting of Academic Exchanges Proceedings.Heilongjiang:[s.n.],2007:14-16.)
    [11]牛永洁,张成.多种字符串相似度算法的比较研究[J].计算机与数字工程,2012,40(3):14-17.(NIU Yongjie,ZHANG Cheng.Comparation of string similarity algorithm[J].Computer&Digital Engineering,2012,40(3):14-17.)
    [12]李彬.计算字符串相似度的矩阵算法[J].现代电子技术,2007,30(24):106-108.(LI Bin.Matrix arithmetic of computing strings′similar degree[J].Modern Electronics Technique,2007,30(24):106-108.)
    [13]BRESLOW L A,AHA D W.Simplifying decision trees:a survey[J].Knowledge Engineering Review,1997,12(1):1-40.
    [14]杨明,张载鸿.决策树学习算法ID3的研究[J].微机发展,2002,12(5):6-9.(YANG Ming,ZHANGZaihong.Research on decision tree learning algorithm of ID3[J].Microcomputer Development,2002,12(5):6-9.)
    [15]朱伟忠.数据挖掘决策树分类技术及应用的研究[D].广州:华南理工大学,2004:21-22.(ZHU Weizhong.Research on data mining decision tree classification technology and application[D].Guangzhou:South China University of Technology,2004:21-22.)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700