非结构化网络空间信息智能搜索与服务研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
非结构化数据占据了网络信息资源的大部分内容,它是网络搜索引擎的主要数据来源和研究对象。非结构化空间数据是网络信息资源的重要组成部分,研究非结构化网络空间信息智能搜索与服务是通用搜索引擎在空间信息领域提供专业化信息服务的主要研究内容。它是搜索引擎技术与WebGIS等技术相结合的产物,可以为普通用户提供本地信息服务(Local Service)和空间信息检索工具,符合当今信息检索技术朝着智能化、个性化方向发展的潮流。
     作为“863”项目“空间信息智能网络搜索技术”的延续,本文以网络搜索引擎技术为基础,结合自然语言处理、GIS和信息提取等技术,对非结构化Web空间信息的智能获取、加工、服务方法进行了深入、系统的研究和实践。按照文本粒度的大小,本文分别在词、句、篇、篇层等层面上研究了空间命名实体的识别、空间语义分析、空间概念提取、锚文本层次结构语义索引等关键技术。利用这些技术,本文设计实现了地图网页搜索系统、“词虎”搜索器及“文图智通”的原型系统,并将这些技术和方法融入到非结构化Web空间信息智能搜索与服务系统(SIISE)的设计和实现中,初步构造出一个完整的空间信息搜索系统雏形。具体说来,主要开展了以下研究工作:
     [1] 研究了海量空间命名实体(SNE)在线识别问题。在分析一般命名实体识别方法的基础上,提出利用SNE的空间特性、采用地理编码的手段在线识别单句、全文中SNE的技术思想。对于单句,利用基础地名词典进行切词,通过编码分析和SNE单元合并的策略进行识别;对于全文,利用全文粗扫描获取相关的地理编码,通过编码分析锁定文中涉及的空间范围,然后按照一定的策略自动加载匹配词典识别文中其它SNE。实验表明,这种方法能识别出大量在词典中不存在的组合式SNE,系统具备一定的自适应性,较好地解决了因命名实体词典数量庞大而导致的低效率问题。
     [2] 研究了自然语言中的空间语义分析与空间概念提取方法。根据汉语表达空间概念的特点以及GIS表征空间信息的特点定义了空间语义角色,并利用空间语义角色定义了空间概念的形式化描述方法,提出了利用空间语义角色分析自然语言中的空间语义和空间概念基本思路。方法是:先构造空间语义词典,采用浅层句法分析的原理,通过空间语义角色标注、短语识别以及概念模式匹配等手段提取了文本中的空间概念。初步实验显示,该方法具有较好的准确率,召回率还有待提高。
     [3] 探索了锚文本层次结构语义索引检索机制。在深入剖析锚文本的特征以
Unstructured data occupies a large part of Web information resources. It is the main data source of Web Search Engine. As an important component of Web resources, unstructured spatial data is the major research content of Geo Search Engine (GSE), which is regarded as the embranchment of general Search Engine. GSE combines WebGIS with Search Engine, It can provide Local Service to common users and can satisfy us with geo-related information, in accord with the current trend of information retrieval towards intelligentization and individuation .
    As a continuation of the "863" program "Intelligent Web Search Engine for Spatial Information", the dissertation, based on the technologies of Web Search Engine, Natural Language Processing (NLP), GIS and Information Extraction (IE), makes an in-depth and systematic study on acquisition, processing and services of unstructured spatial information. It focuses on the key technologies and approaches of SNE recognition, spatial semantic analysis, spatial concept extraction, semantic indexing and retrieval of anchor texts hierarchical structure, in accordance with different grades of text size: word, phrase and sentence. By making use of these basic research results, the dissertation implements prototype systems like Map Page Search Engine, SNE Searching and WenTuZhiTong. Finally, an integrated prototype of Intelligent Web Search Engine of Unstructured Spatial Information (SIISE) is constructed. The main contributions and innovations of this dissertation can be concluded as follows:
    [1] Summaries of current research status on Geo Search Engine, spatial concepts extraction and semantic indexing are made.
    [2] Solutions to recognize Chinese SNE online are given. By means of geo-coding, the dissertation presents an approach to recognize new SNE (Chinese), which are not existed in gazetteers, from online web pages. The Experiments show that it has good efficiency. The algorithm is now applied to the system of SNE Searching, which is a client of CiHu software system.
    [3] Definitions of spatial semantic roles are put forward according to Chinese
引文
[1] Giuseppe Attardi. Search Engines & Question Answering. http://medialab.di.unipi.it/web/Search+QA/
    [2] 王东临, http://www.sciencetimes.com.cn/co138/co197/article.html?id=67517.
    [3] 韦升阳,ECM:在业务系统中提炼信息 http://www.e-works.net.cn/ewk2004/ewkArticles/520/Article35001.htm.2006-1-3
    [4] http://www.seochina.net/list/seo-ssybbtcs.htm.
    [5] 白玉琦.空间搜索引擎研究.中国科学院遥感应用研究所博士学位论文.2003.6.
    [6] Guoray Cai. GeoVSM: An Integrated Retrieval Model for Geographic Information. http://spatial.ist.psu.edu/cai/LNCS2478-GeoVSM.pdf.
    [7] Yasuhiko Morimot. Extracting Spatial Knowledge from the Web. http://www.mccurley.org/papers/SAINT03.pdf.
    [8] Torsten Suel. Local and Mobile Web Search Technology and Applications. http://catt.poly.edu/events/torstensuel.pdf
    [9] http://www.inperspective.com/tech.jsp.
    [10] Paul Clough, Mark Sanderson and Hideo Joho. Spatially-Aware Information Retrieval on the Internet. http://www.geo-spirit.org/publications/SPIRIT_WP6_D15_geomarkup_revised_FINAL.pdf. 2004,4,1.
    [11] Google Local. http://bendi.google.com/clochp
    [12] Yahoo Local. http://local.yahoo.com/
    [13] Sohu Sogou. http://map.sogou.com/localnew/index.jsp
    [14] Alexander Markowetz, Yen-Yu Chen, Torsten Suel, etc. Design and Implementation of a Geographic Search Engine. http://cis.poly.edu/tr/tr-cis-2005-03.pdf. 2005.2
    [15] L. Gravano. Geosearch: A geographically-aware search engine. 2003. http://geosearch.cs.columbia.edu.
    [16] J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web resources. In Proc. of the 26th VLDB, pages 545-556, September 2000. http://wwwl.cs.columbia.edu/~gravano/Papers/2000/vldb00.pdf
    [17] A. Daviel. April 1999. http://geotags.com.
    [18] Ghada Amoudi. Geo-Searcher: Geo Spatial Ranking of Search Engine Results. http://www.cs.dal.ca/news/def-1092.shtml
    [19] http://www.geo-spirit.org/publications/SPIRIT_WP6_D15_geomarkup_revised_FINAL.pdf
    [20] Mark A. Greenwood. Using Pertainyms to Improve Passage Retrieval for Questions Requesting Information About a Location. http://nlp.shef.ac.uk/ir4qa04/Greenwood-IR4QA.pdf[21] 李德仁 王树良 李德毅等。论空间数据挖掘和知识发现的理论与方法.武汉大学学报·信息科学版 2002 vol3:221-233
    [22] Gui-Rong Xue Qiang Yang Hua-Jun Zeng. Exploiting the Hierarchical Structure for Link Analysis. http://research.microsoft.com/asia/pubs/view.aspx?type=publication&id=1456
    [23] 化柏林.搜索引擎技术简析.http://news.ccidnet.com/art/1893/20040929/160489_1.html
    [24] 李晓明 刘建国.搜索引擎技术及趋势 2003.06.http://www.it201.com/jianzhan/wztuig/sousuoyq/200511/7011.html
    [25] 张远昌.搜主义:Google持续成长的秘密 清华大学出版社 2005.
    [26] 俞士汶主编.计算语言学概论.北京:商务印书馆.2003.
    [27] 符绍宏主编,赵荣,王琼等.信息检索.高等教育出版社.2004.
    [28] 石纯一等 人工智能原理 清华大学出版社 2003.
    [29] 陈述彭等.地理信息系统导论.科学出版社 2001.
    [30] 程承旗.GIS数据组织与结构.北京大学遥感与GIS研究所.http://www.hssd.gov.cn/zw/hygl/gis/5.ppt
    [31] 毛涌泉,.搜索引擎的数据索引与检索研究.哈尔滨工业大学硕士学位论文.2004.
    [32] 严亚兰.基于语义Web的知识处理研究.武汉大学博士学位论文;2005.
    [33] 李晓明,闫宏飞,王继民.搜索引擎.北京:科学出版社.2004.
    [34] Sergey Brin, Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. http://www-db.stanford.edu/~backrub/google.html
    [35] 王晓宇,周傲英 万维网的链接结构分析及其应用综述.软件学报.vol.14,No.10 2003.
    [36] 闫俊英.垂直搜索引擎的研究与实现.哈尔滨工业大学硕士学位论文.2004.
    [37] MR. Henzinger. Hyperlink Analysis for the Web. IEEE Internet Computing. 2001, 5(1): 5--50
    [38] A. Arasu, J. Cho, Hector Garcia-Molina, and et al. Searching the Web. ACM Transactions on Internet Technology. 2000, 1 (1): 2-43
    [39] K. Bharat, A. Broder, J. Dean, et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW. JASIS, 2000, 51(12): 1114--1122
    [40] 李源.网页概念语义空间的建立和联想检索的研究.中国科学技术大学硕士学位论文.2001.1
    [41] Hua-Ping ZHANG, Qun LIU, Hong-Kui YU. Chinese Named Entity Recognition Using Role Model. Computational Linguistics and Chinese Language Processing. 2003, 8(2: 29-60)
    [42] 李保利,陈玉忠.信息抽取研究综述.计算机工程与应用.2003,39(10:1-5)
    [43] 齐沪扬 著.现代汉语空间问题研究.上海:学林出版社.1998
    [44] 靳从,唐振民,杨静宇.自动标引中中文姓名的切分.计算机工程.2003,29(22:153-154)[45] 姜奇平.意义互联网与本体论.互联网周刊.2004.
    [46] 李毅.基于多层次概念语义网络结构的中文医学信息语义标引体系和语义检索模型研究.中国科学技术信息研究所硕士学位论文.2002.6.
    [47] Frank Schilder, etc. Extracting spatial information: grounding, classifying and linking spatial expressions, http://www.geo.unizh.ch/~rsp/gir/abstracts/schilder.pdf.2004
    [48] Richard Johansson, etc. Carsim: A System to Visualize Written Road Accident Reports as Animated 3D Scenes. http://www.cs.1th.se/home/Pierre_Nugues/Articles/acl2004/acl2004.pdf.2004
    [49] Stanislao Lauria, Guido Bugmann, etc. Mobile Robot Programming Using Natural Language. Robotics and Autonomous Systems, 2002, 38(3-4): 171-181.
    [50] 马林兵,龚健雅.空间信息自然语言查询接口的研究与应用.武汉大学学报.2003,28(3):301-305.
    [51] Minhua Eunice Ma, Paul Mc Kevitt. Building character animation for intelligent storytelling with the H-Anim standard. 2003. http://www.infm.ulst.ac.uk/~paul/pubs/egir103.ppt
    [52] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The Berkeley Frame Net project. 1998. http://framenet.icsi.berkeley.edu/~framenet/papers/ac198.pdf.
    [53] Daniel Gildea, Julia Hockenmaier. Identifying Semantic Roles Using Combinatory Categorial Grammar. 2003. http://www.cs.rochester.edu/~gildea/gildea-emnlp03.pdf.
    [54] Sameer Pradhan, etc. Semantic Role Parsing: Adding Semantic Structure to Unstructured Text. http://oak.colorado.edu/~spradhan/publications/pradhan-icdm-2003.pdf.2003
    [55] 杜淑敏,王永宁.编译程序设计原理.北京:北京大学出版社,2001
    [56] Jobn R.Levine,Tony Mason,etc.Lex与Yacc.机械工业出版社 2003
    [57] 乐小虬 杨崇俊 刘冬林.空间命名实体的识别.计算机工程.vol.31 No.20,2005.
    [58] 乐小虬 杨崇俊 于文洋.基于空间语义角色的自然语言空间概念提取.武汉大学学报(信息科学版).vol.30 No.12,2005
    [59] Erol Bozsak, Marc Ehrig, Siegfried Handschuh, et al. KAON - Towards a large scale Semantic Web. http://www.aifb.uni-karlsruhe.de/WBS/dob/pubs/ecweb2002.pdf,2002
    [60] Kaoru Hiramatsu, Femke Reitsma. GeoReferencing the Semantic Web: ontology based markup of geographically referenced information. http://www.mindswap.org/2004/geo/geoStuff_files/HiramatsuReitsma04 GeoRef.PDF,2004
    [61] Sanghee Kim. Question Answering Towards Automatic Augmentations of Ontology Instances. http://eprints.ecs.soton.ac.uk/8911/01/sangheekimesws2004-prepress.doc,2004.
    [62] 朱礼军.万维网环境下基于领域知识的信息资源管理模式研究.中国农业大学博士论文.2004.
    [63] Tim Berners-Lee. The semantic toolbox: building semantics on top of XML-RDF. http://www.w3.org/DesignIssues/Toolbox.html
    [64] Wolgang Wahlster. Multimodal Interfaces to Mobile Webservices. http://www.dfki.de/~wahlster/ICT-Kenniscongress_2002/Multimodal_Interfaces to Mobile??_Webservices.ppt,2002.
    [65] ChristianKray. Situated Interaction on Spatial Topics. http://www.comp.lancs.ac.uk/~kray/pub/2003 _sisto.pdf,2003
    [66] http://www.hpl.hp.com/semweb/
    [67] Dan Brickley, R. V. Guha, RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation 2004-02-10 http://www.w3.org/TR/2004/REC-rdf-schema-20040210/
    [68] Alexander Maedche, Steffen Staab. Mining Ontologies from Text. http://www.ira.uka.de/I3V_HTML/JB_LIT/16630835.htm
    [69] Steffen Staab, Michael Erdmann. Engineering Ontologies using Semantic Patterns. http://www.csd.abdn.ac.uk/~apreece/ebiweb/papers/staab.pdf
    [70] 孟祥增.基于语义的WEB图像检索研究.北京邮电大学博士学位论文.2004.05
    [71] 李法运.基于Web的信息过滤模型优化及系统实现研究.武汉大学博士学位论文,2004.04
    [72] 李源等.基于概念空间的文本语义索引.计算机科学.2002.vol.29.NO.1
    [73] 何儒云 面向网络信息资源的信息索引研究。硕士,北京师范大学,20030601
    [74] Soumen Chakrabarti. 张凯 王斌 译. Data mining for hypertext: A tutorial survey. http://lcc.ict.ac.cn/freshman/resources/%B3%AC%CE%C4%B1%BE%CA%FD%BE%DD%CD%DA%BE%F2%D0%DE%B8%C4%B8%E5.doc
    [75] 吴刚,唐杰,李涓子等.细粒度语义网检索.清华大学学报(自然科学版),2005,S1:1865~1872
    [76] Nadav Eiron and Kevin S. McCurley. Link Structure of Hierarchical Information Networks. http://www.theeirons.org/Nadav/pubs/entropy.pdf
    [77] Nadav Eiron and Kevin S. McCurley, Analysis of Anchor Text for Web Search. Proceedings of SIGIR'03, pp. 459-460, 2003. http://www.theeirons.org/Nadav/pubs/anchor_long.pdf
    [78] Rong Jin, Alex G. Hauptmann, and ChengXiang Zhai.Title language model for information retrieval. In Proc. of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pages 42-48, Tampere, Finland, August 2002. Association for Computing Machinery.
    [79] Bo-Yeong Kang, Dae-Won Kim, Sang-Jo Lee. Exploiting concept clusters for content-based information retrieval. Information Sciences 170 (2005) 443 - 462.
    [80] Bo-Yeong Kang. A Novel Approach to Semantic Indexing Based on Concept. http://ucrel.lancs.ac.uk/acl/P/P03/P03-2007.pdf
    [8l] 冯雁,王申康.Web站点层次结构抽取算法的分析和实现.浙江大学学报(工学版).vol.39No.10.Oct.2005
    [82] M. -F. Moens, Automatic Indexing and Abstracting of Document Texts, Kluwer Academic Publishers(2000).
    [83] J. Morris, Lexical cohesion, the thesaurus, and the structure of text, Master's thesis, Department of Computer Science, University of Toronto(1988).[84] J. Morris and G. Hirst, Lexical cohesion computed by thesaural relations as an indicator of the structure of text, Computational Linguistics 17(1)(1991) 21-43.
    [85] 陈界伟 中文图像文档高速过滤中的关键技术研究.北京邮电大学博士学位论文2005.05
    [86] http://www.w3.org/2001/Talks/0228-tbl/slide5-0.html
    [87] 曾伟忠 徐听,搜索引擎及元搜索引擎工作原理及存在的不足.图书馆学刊.NO.5 2004.
    [88] 周桃峰.元搜索引擎几个关键问题的研究.燕山大学硕士毕业论文.2004.05
    [89] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.
    [90] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41 (6): 391--407, 1990. Online at http://superbook.telcordia.com/~remde/lsi/papers/JASIS90.ps.
    [91] 孟祥增,钟义信,王翔英.信息的表征和测度方法及应用研究.情报学报,2004.1:91-96
    [92] 蔡自兴,徐光佑.人工智能及其应用(第二版)北京:清华大学出版社,1996.48-52
    [93] 钟义信.信息科学原理(第二版),北京邮电大学出版社,1996.
    [94] KMSphere, http://www.intsci.ac.cn/ckmpbbb/ko_semindex.jsp
    [95] 王耀南编著.智能信息处理技术.高等教育出版社.2003.
    [96] 唐靖琰,周良源.UNIX平台下C语言高级编程指南.北京希望电子出版社.2000.
    [97] Jim Beverrdge,Robert Wiener,侯捷 译.Win32多线程程序设计.华中科技大学出版社.2002.
    [98] 谭浩强编著 C语言程序设计 清华大学出版社 1995
    [99] 严蔚敏 吴伟民 数据结构(C语言版)清华大学出版社 2002
    [100] Andrews S Tanenbaum.熊桂喜、王小虎译.计算机网络.北京:清华大学出版社,1999.
    [101] W3C. OWL Web Ontology Language Overview. http://www.w3.org/TR/2004/REC-owl-features-20040210/
    [102] R. Guha, RobMc Cool, EricMiller. Semantic Search. http://www2003.org/cdrom/papers/refereed/p779/ess.html
    [103] Jeff Heflin, PanZhengxiang. Semantic Search - The SHOE Search Engine. http://www.cs.umd.edu/projects/plus/SHOE/search/
    [104] Egenhofer, M. J. Toward the Semantic Geospatial Web. In Proceedings of the Tenth ACM International Symposium on Advances in Geographic Information Systems, McLean, Vir-ginia. 2002.
    [105] UCGIS. THE GEOSPATIAL SEMANTIC WEB. http://www.ucgis.org/priorities/research/2002researchPDF/shortterm/e_geosemantic_web.pdf
    [106] Craswell N., D. Hawking, and S. Robertson. Effective site finding using link anchor information, http://research.microsoft.com/users/nickcr/pubs/craswell_sigir01.pdf.[107] Danzig P. B., Jongsuk Ahn, John Noll, and Katia Obraczka. Distributed indexing: A scalable mechanism for distributed information retrieval. http://seclab.cs.ucdavis.edu/projects/response/references/p220-danzig.pdf.gz
    [108] V. Kalogeraki, D. Gunopulos, and D. Zeinalipour-Yazti. A Local Search Mechanism for Peer-to-Peer Networks. http://www.cs.ucr.edu/~csyiazti/downloads/papers/cikm02/cikm02.pdf.
    [109] C. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, Search and Replication in Unstructured Peer-to-Peer Networks. http://www.cs.princeton.edu/~qlv/download/searchp2p_full.pdf
    [110] R. Baeza-Yates and B. Ribeiro-Neto. Modem Information Retrieval. Addison Wesley, Essex, England, 1999.
    [111] MapQuest. http://www.mapquest.com/maps/main.adp?countrycode=ca&cid=mqca
    [112] 王军,杨冬青,唐世渭.数字图书馆的检索技术.http://www.tongji.edu.cn/~yangdy/computer/D_Lib/paper3.htm
    [113] 杨晓航,张晓林.语义空间系统:语义Web技术的新应用-基于语义整合Web资源与服务.情报杂志.2006 vol.25 No.2
    [114] TBerners-Lee, JHendler, OLassila. Semanticweb. Scientific American, 2001; (1)
    [115] B. J. Jansen and U. Pooch.. A review of web searching studies and a framework for future research. Journal of the American Society of Information Science and Technology, 53(3): 235{246, 2000.
    [116] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604-632, 1999.
    [117] Haixun Wang, Sanghyun Park, Wei Fan. ViST: A Dynamic Index Method for Querying XML Data by Tree Structures. http://www.cs.ucr.edu/~tsotras/cs267/vist.pdf
    [118] J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In ICML, 1999. http://www.cs.cmu.edu/~mccallum/papers/rlspider-icm199s.ps.gz.
    [119] J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2): 155--170, Mar. 1996.
    [120] A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML, 1998. http://www.cs.cmu.edu/~mccallum/papers/hier-icm198.ps.gz.
    [121] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning techniques. In AAAI-99 Spring Symposium, 1999. http://www.cs.cmu.edu/~mccallum/papers/cora-aaaiss99.ps.gz.
    [122] S. Chakrabarti, B. E. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. IEEE Computer, 32(8): 60--67, Aug. 1999. Feature article.
    [123] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. In 7th World-wide web conference (WWW7), 1998. http://www7.scu.edu.au/programme/fullpapers/1898/com 1898,html.[124] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31: 1623--1640, 1999. First appeared in the 8th International World Wide Web Conference 17, Toronto, May 1999. http://www8.org/w8-papers/5a-search-query/crawling/index.html.
    [125] J. Dean and M. R. Henzinger. Finding related pages in the world wide web. In 8th World Wide Web Conference, Toronto, May 1999. http://www.cindoc.csic.es/cybermetrics/pdf/47.pdf
    [126] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6): 391--407, 1990. http://superbook.telcordia.com/~remde/lsi/papers/JASIS90.ps.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700