基于Web的空间数据爬取与度量研究

英文题名：Research on Web-based Spatial Data Grab and Evaluation
作者：王明军
论文级别：博士
学科专业名称：地图制图学与地理信息工程
中文关键词：空间敏感爬虫 ; 空间数据爬取 ; 置信度度量 ; 空间数据分类标签
英文关键词：spatial data sensitive crawler ; spatial data grab ; measure of confidence
英文关键词：level ; the categories and tags of sptial data
学位年度：2013
导师：杜清运
学科代码：081603
学位授予单位：武汉大学
论文提交日期：2013-05-01

摘要

Web技术的飞速发展,为人们提供了丰富的信息,同时带来大量的信息冗余。如何快速定位用户需求,是目前网络检索中常见的问题之一。尤其在空间信息领域,空间数据涉及几何与属性两种信息,这种信息的独特性,在网络环境下只能通过文字描述信息与几何图形信息两方面分别表现。当前,对于空间信息的检索,主要集中在文字描述匹配方面,针对空间几何信息检索研究相对较少。
     本文在分析当前网络环境下空间信息检索存在问题的基础上,探讨了解决空间信息检索所涉及的主要研究领域,以及这些领域国内外的研究进展。论文从网络信息爬取入手,讨论空间信息在网络化环境下的主要特征与分类体系,探讨不同类型空间数据的解析与识别方法,针对不同数据类型与对应页面,阐述数据置信度度量基本方法,同时扩展空间数据分类体系,提出爬取空间数据分类标签体系思想,基于此体系,实现空间数据存储管理与后期应用,最后通过实例模型验证了空间数据爬取的某些过程,并做了相应质量评价与分析。
     论文针对不同空间数据类型,深入探讨了基于空间信息敏感爬虫爬取数据的基本原理与方法。首先引入空间敏感爬虫概念,介绍其与传统爬虫的异同与工作流程,以及空间敏感页面和网页链接空间信息与空间检索词的相似度度量。其次重点论述了不同类型空间数据发现机制,即空间数据服务、栅格、矢量及其他数据的发现方法,针对不同类型,讨论其在网页中的表现形式,解析的基本过程,其中对涉及主要算法与模型,给出了必要说明与阐述。
     论文提出了Web空间数据的置信度度量方法。Web空间数据由于描述信息缺乏,其数据质量很难准确衡量,后期数据检索与应用相对困难。结合空间数据质量的一些基本方法,综合考虑空间数据文本描述与数据本身信息,提出了定性度量矢量、栅格数据的方法。其次,对不同空间数据类型置信度做了分析比较,对链接到同一空间敏感页面的不同资源,选取较大置信度对整个页面最佳匹配。
     论文结合元数据模型与目前空间数据分类体系,提出了Web空间数据的分类标签思想。Web环境下空间数据由于表达尺度、范围、要素等等差异,很难采用传统的分类体系对其划分,必须采用新的方式记录其数据描述信息,借助元数据模型及数据应用相关的分类体系,提出了分类标签体系模型。在此基础上,对Web数据获取后,数据的存储管理,后期数据检索与应用做了简单说明。
     通过实例模型,对整个空间敏感爬虫从页面过滤,到信息提取,再到质量的基本评价,进行了必要的验证。分析、总结了相关理论与实践之间存在的不一致性问题,表明了网络空间数据爬取问题的复杂性,为后续研究奠定一定的理论与实践基础。
     最后论文对基于空间信息爬取基本整体流程的各个环节进行了总结,提出了下一步研究的几个方向。
Just as every coin has two sides, so does the rapid development of the Web technology. Through the Web technology, such as surfing on the internet, people can read abundant useful information worldwide. Meanwhile the readers have to receive the huge number of redundant online information either, especially in the field of geospatial information. The geospatial data including both attribute information and geometry information, which is special and unique from other kinds of data, can only be represented by the description of texts and geometric graphs. And so far, the main focus on the retrieval of geospatial information is the description and matching of texts, while less focus on that of geometry information.
     This paper aims firstly to analyze the problems on the retrieval of geospatial information and review the related study progress worldwide. Based on the former analysis and review, this paper secondly studies on the resolving of different geospatial data from the Web page, and discusses the basic methodologies to measure the degree of confidence to the page and different spatial data. Moreover, the paper extends the classification of spatial data and proposes the categories and tags system to web spatial data, and based on this system, it can help to save, manage the large number of web spatial information and data applications. At last, the paper gives some cases to verify the process of how to grab spatial data, and some evaluates and analysis to the quality of relevant spatial data.
     Furthermore, based on the sensitive crawler of geospatial information, this paper discusses the strategy of algorithm and the solution scheme for each step of grabbing the geospatial data. As an important aspect, it is studied that the analytical method of parsing the web pages by the geospatial sensitive crawler. This analytical method is based on the statistical methodology, and different algorithms can be applied to carry out the principle of computation of the spatial correlation in order to get the high sensitive web pages of geospatial information. In addition, this paper further studies both on the Web service discovery and the parsing of geospatial information. The Web service discovery of geospatial information refers to three versions of service description, i.e., OWL-S, WSDL and OGC Capabilities. The OGC Capabilities is a specification, which is a mature and well-known in the field of geospatial information service. After that, the parsing of different types of geospatial data is also discussed, such as the basic parsing of raster data, the basic parsing of vector data, and the basic parsing of data interchange formats, etc.
     Moreover, based on the above study, this paper analyzes the fusion methods of geospatial data, which is grabbed by the high sensitive web pages of geospatial information, and further illustrates and studies on some of the different fusion methods. It is also introduced that both the taxonomy system and the standard system, proposed by different organizations, of the geospatial information service. The geospatial information service contains most of the current web service, and this study can provide as a reference for the discovery and compositing of Web service. Additionally, it is respectively introduced that the methods of the quality evaluation and the fusion of the raster and the vector data. Comparing to the vector data, the data structure of the raster data is relatively simple, so that there are more types of fusion method of the raster data than that of the vector data while the progress of the development of fusion method of vector data is relatively slow. Then, it is also complemented that the visualization of non-spatial information belonging to the geospatial information.
     Moreover, combined the metadata model and the classification of traditional spatial data, this paper proposes the classifications and tags system to web spatial data. For the web spatial data, there are the differences of expression scale, data extents, elements, etc., it's difficult to adopt the traditional categorization to classify the web spatial data and must use new
     In addition, by the case studies, it is validated that the whole process of the geospatial sensitive crawler, in the sequence of the page filters, the information retrieval and the data quality evaluation. Then, it is analyzed and concluded that the inconsistency between the theory and the results of the case studies. These results can indicate the complexity of grabbing geospatial data and laid the foundation for the further works.
     Finally, this paper proposed some suggestions and related research directions based on the conclusions of each step of the overall flow about the crawling the geospatial information.

引文

[1]李德仁,邵振峰.论新地理信息时代[J].中国科学(F辑：信息科学).2009(6)：579-587.
    [2]罗刚.自己动手写搜索引擎[M].电子工业出版社,2009.
    [3]Netcraft. October 2011 Web Server Survey [Z].2011.
    [4]Csdn.网络爬虫,你知道多少?[Z].2007.
    [5]孙立伟,何国辉,吴礼发.网络爬虫技术的研究[J].电脑知识与技术.2010(6)：4112-4115.
    [6]王翌Google(?)的左手[N].计算机世界.2004.
    [7]Bergman, Michael K. White Paper:The Deep Web Surfacing Hidden Value [J]. The Journal Of Electronic Publishing.2001(7).
    [8]Mike Thelwall. A Web Crawler Design For Data Mining [J]. Journal Of Information Science. 2001(27):319-325.
    [9]Hongyu Liu, Jeannette Janssen, Evangelos Milios. Using Hmm To Learn User Browsing Patterns For Focused Web Crawling [J]. Data & Knowledge Engineering.2006(59):270-291.
    [10]Ali Mesbah, Arie van Deursen. A Componet- And Push-Based Architectural Style For Ajax Applications [J]. Journal Of Systems And Software.2008(81):2194-2209.
    [11]Ching-chi Hsu, Fan Wu. Topic-Specific Crawling On The Web With The Measurements Of The Relevancy Context Graph [J]. Information Systems.2006(31):232-246.
    [12]徐远超,刘江华,刘丽珍等.基于Web的网络爬虫的设计与实现[J].微计算机信息.2007(21)：119-121。
    [13]曾伟辉.支持AJAX的网络爬虫系统设计与实现[D].中国科学技术大学,2009.
    [14]李盛韬,赵章界,余智华.基于主题的Web信息采集系统的设计与实现[J].计算机工程.2003(17)：102-104.
    [15]刘淑梅,夏亮,许南山.主题搜索引擎网络爬虫搜索策略的研究与实现[J].计算机系统应用.2010(3)：49-52.
    [16]郑志高,刘庆圣,陈立彬.基于主题网络爬虫的网络学习资源收集平台的设计[J].中国教育信息化.2010(1)：36-38.
    [17]郑冬冬,崔志明.DeepWeb爬虫爬行策略研究[J].计算机工程与设计2006(27)：3154-3158.
    [18]郑冬冬,赵朋朋,崔志明.DeepWeb爬虫研究与设计[J].清华大学学报(自然科学版)2005(45)：1896-1902.
    [19]管翠花.支持Ajax技术的DeepWeb(?)网络爬虫模型研究[D].大连海事大学,2011.
    [20]张媚.Ajax友好的网络爬虫设计与实现[D].暨南大学,2011.
    [21]王佳.支持Ajax技术的主题网络爬虫系统研究与实现[D].北京交通大学,2011.
    [22]刘若梅,蒋景瞳.空间数据基础设施建设中的地理信息标准化问题[J].中国测绘.2000(1)：11-13.
    [23]Stefan Hinz, Albert Baumgartner. Automatic Extraction Of Urban Road Networks From Multi-View Aerial Imagery [J]. ISPRS Journal Of Photogrammetry and Remote Sensing. 2003(58):83-98.
    [24]W. Li, C.Yang, etc. Semantic-Based Web Service Discovery And Chaining For Building An Arctiv Spatial Data Infrastructure [J]. Computer & GeoSciences.2011(37):1752-1762.
    [25]James K. Batcheller, Femke Reitsma. Implementing Feature Level Semantics For Spatial Data Discovery [J]. Computers, Environment and Urban Systems.2010(34):333-344.
    [26]E. de Ves, J. Domingo, etc. A Novel Bayesian Framework For Relevance Feedback In Image Content-Based Retrieval Systems[J]. Pattern Recognition.2006(39):1622-1632.
    [27]张春菊,张雪英,朱少楠等.基于网络爬虫的地名数据库维护方法[J].地球信息科学学报.2011(4)：492-499.
    [28]傅明.基于Web的空间数据挖掘研究[D].中南大学.2004.
    [29]曾伟辉.支持AJAX的网络爬虫系统设计与实现[D].中国科学技术大学.2009.
    [30]孙立伟,何国辉,吴礼发.网络爬虫技术的研究[J].电脑知识与技术.2010(6)：4112-4115.
    [31]周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用.2005(25)：1965-1969.
    [32]李哲,陈鹏,李涛.深度搜索中下一页链接获取策略的研究[J].微计算机信息.2007(23).
    [33]郑冬冬,赵朋朋,崔志明.DeepWeb爬虫研究与设计[J].清华大学学报(自然科学版)2005(45)：1896-1902.
    [34]Kevin Chen-Chuan Chang, Bin He, etc. Structured Database on the Web:Observations And Implications [J]. ACM SIGMOD.2004(33):61-70.
    [35]Alvarez M, Pan A, Raposo J, etc. Client-side Deep Web Data Extraction[C]. IEEE International Conference On E-Commerce Technology For Dynamic E-Business.2004.158-161.
    [36]Bin He, Mitesh Patel, etc. Accessing the Deep Web [J]. Communications of the ACM. 2007(50):94-101.
    [37]孟庆崧.基于WebService的空间信息服务描述和发现机制研究[D].国防科学技术大学.2006.
    [38]吴斐,景东升,毕思文.基于WebService和OWL_S的地理空间信息语义服务[J].计算机应用.2006(26)：231-233.
    [39]W3C. Web Services Description Language (WSDL)1.1 [Z].2001.
    [40]Microsoft.使用UDDI的Web服务描述和发现[Z].2010.
    [41]OGC. Web Services Common Standards[Z].2006.
    [42]中华人民共和国国家标准.网络覆盖服务规范[Z].2005.
    [43]Yanan Hao, Yanchun Zhang, Jinli Cao. Web Services Discovery And Rank An Information Retrieval Approach [J]. Future Generation Computer Systems.2010(26):1053-1062.
    [44]Chen Wu. Wsdl Term Tokenization Methods For IR-Style Web Services Discovery[J]. Science of Computer Programming.2012(77):355-374.
    [45]张霞.地理信息服务组合与空间分析服务研究[D].武汉大学.2004.
    [46]蒋玲,龚健雅.基于OWL-S的地理信息服务描述和发现[J].测绘与空间地理信息.2007(5)：19-22.
    [47]罗安,王艳东,龚健雅.顾及上下文的空间信息服务组合语义匹配方法[J].武汉大学学报(信息科学版).2011(3)：368-372.
    [48]M. Diligenti, FM. Coetzee, etc. Focused Crawling Using Context Graphs[C]. Proceedings of the 26th International Conference on Very Large Databases.2000.527-534.
    [49]张征杰,王自强.文本分类及算法综述[J].电脑知识与技术.2012(04)：825-828.
    [50]刘丽珍,宋涵涛,陆玉昌.基于NaiveBayes(?)的CLIF_NB文本分类学习方法[J].小微型计算机系统.2005(26)：1575-1578.
    [51]周钦强.基于人工智能技术NaiveBayes文本自动分类系统研究[D].广东工业大学.2005.
    [52]P. A. Aguilera, A. Fernandez, etc. Bayesian Networks In Environmental Modelling [J]. Environmental Modelling & Software.2011(26):1376-1388.
    [53]虞欣,郑肇葆,叶志伟,田礼乔.基于TreeAugmentedNaiveBayesClassifier的影像纹理[J].武汉大学学报(信息科学版).2007(32)：287-291.
    [54]白凡.改进的K近邻算法在网页文本分类中的应用[D].安徽大学,2010.
    [55]Marcio Pupin Mello, Bernadrdo Fredrich Theodor Rudorff, etc. An R Implementation For Bayesian Newworks Applied To Spatial Data [J]. Procedia Environmental Sciences. 2011(7):275-280.
    [56]K. Rajan, V. Ramalingam, etc. Automatic Classification Of Tamil Documents Using Vector Space Model And Artificial Neural Network [J]. Expert Systems With Applications. 2009(36):10914-10918.
    [57]安增文.垂直搜索中信息属性抽取和分类模型研究与实现[D].中国石油大学,2010.
    [58]葛萌,欧阳宏基,刘敏娜.一种基于ANN的智能网页信息过滤模型[J].现代计算机(专业版).2009(09)：18-21.
    [59]王翔.基于BP神经网络的遥感影像模式识别方法研究[D].太原科技大学,2009.
    [60]俞冰.基于BP神经网络的遥感影像分类研究[J].中国科技论文在线.
    [61]黄小燕,史旭明,刘苏东,金龙.模糊神经网络方法在热带气旋强度预报中的应用研究[J].高原气象.2009(6)：1408-1413.
    [62]张敏,刘利雄,贾云得.一种基于图像区域系综分类的室外场景理解方法[J].中国图象图形学报.2004(12)：49-54.
    [63]彭建,王军.基于Kohonen^神经网络的中国土地资源综合分区[J].资源科学.2006(01)：43-50.
    [64]Yun-Qian Miao M K. Pairwise Optimized Rocchio Algorithm For Text Categorization[J]. ELSEVIER.2009.
    [65]黄海英.基于概念空间的文本分类的应用研究[D].广西师范大学,2002.
    [66]边馥苓.空间信息导论[G].测绘出版社,2006.
    [67]白玉琪,杨崇俊.空间信息搜索引擎研究[J].中国矿业大学学报.2004(1)：93-97.
    [68]王卉,王家耀.无缝GIS发展的两个关键技术[J].测绘通报.2002(4)：10-12.
    [69]王建涛.基于Web的地理信息服务的研究与实践[D].中国人民解放军信息工程大学.2005.
    [70]SuperMap. SuperMap_Objects_Java_6R_安装指南[D].
    [71]田玉敏,林高全.基于颜色特征的彩色图像检索方法[J].西安电子科技大学学报.2002(01)：43-46.
    [72]刘芳,王改梅.综合颜色特征的彩色图像检索方法[J].计算机工程与应用.2003(16)：83-85.
    [73]徐海荣,陆文华,张兴媛,等.彩色图像检索方法[J].计算机工程与设计.2010(09)：1965-1967.
    [74]陈骍,檀结庆.基于空间分布差异度的分块彩色图像检索方法[J].计算机应用.2012(06)：1539-1543.
    [75]彭炜.基于遗传算法的图像分类[J].山西师范大学学报(自然科学版).2011(02)：41-44.
    [76]聂青,战守义.基于区域特征的图像分类技术[J].北京理工大学学报.2008(10)：885-889.
    [77]韩敏,程磊.用于航空图片分类的神经网络模型[C].第21届中国控制会议论文集.2002
    [78]郭建忠,欧阳,魏海平,饯海忠.基于文件与基于数据库的格网索引[J].测绘学院学报.2002(3)：220-223.
    [79]李骁,范冲,邹峥嵘.空间数据存储摸式的比较研究[J].工程地质计算机应用.2009(2)：8-10.
    [80]刘三民,王杰文.空间数据存储管理研究综述[J].电脑与信息技术.2006(03)：19-21.
    [81]10-129r1_Geography_Markup_Language_GML_Version_3.3[J].
    [82]李清泉,谢智颖,左小清等.基于SVG的空间信息描述与可视化表达[J].测绘学报.2005(01)：58-63.
    [83]肖桂荣.区域地理空间数据共享平台与目录服务研究[J].计算机工程与应用.2009(16)：155-158.
    [84]朱霞.文图挂接的空间元数据目录服务系统的设计与实现[D].武汉大学,2005.
    [85]廖顺宝,蒋林.地球系统科学数据分类体系研究[J].地理科学进展.2005,24(6)：93-97.
    [86]韩李涛,赵军.空间数据质量相关问题探讨[J].东北测绘.2003(1)：11-14.
    [87]朱庆,陈松林,黄铎.关于空间数据质量标准的若干问题[J].武汉大学学报(信息科学版).2004(10)：863-867.
    [88]吴华意,章汉武.地理信息服务质量_QoGIS概念和研究框架[J].武汉大学学报(信息科学版).2007(32)：385-388.
    [89]章汉武,吴华意等.从地理空间数据质量到地理空间信息服务质量[J].武汉大学学报(信息科学版).2010(9)：1104-1107.
    [90]艾廷华.网络地图渐进式传输中的粒度控制与顺序控制[J].中国图象图形学报.2009(14)：999-1007.
    [91]中国科学院计算技术研究.ICTCLAS中文分词[Z]. http://www.oschina.net/p/freeictclas/.
    [92]Lucene中文分词Paoding[Z]. http://code.google.com/p/paoding/.
    [93]King A B. Website Optimization:Speed, Search Engine & Conversion Rate Secrets [M]. O'Reilly Media,2008.
    [94]张洋,张磊.网络信息资源评价研究综述[J].中国图书馆学报.2010(05)：75-89.
    [95]朱大龙.基于结构相似性的图像质量评价方法的研究[D].安徽大学,2006.
    [96]Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2004(2):91-100.
    [97]唐永鹤.基于特征点的图像匹配算法研究[D].国防科学技术大学,2007.
    [98]谭磊,张桦,薛彦斌.一种基于特征点的图像匹配算法[J].天津理工大学学报.2006(6)：66-69.
    [99]唐炉亮,杨必胜,徐开明.基于线状图形相似性的道路数据变化检测[J].武汉大学学报(信息科学版).2008(4)：367-370.
    [100]Edwin M. Knorr, Raymond T. Ng, etc. Finding Boundary Shape Matching Relationships in Spatial Data [C]. The 5th International Symposium (SSD'97).1997(1262):29-46.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700