基于RSS和本体语义适配的自治主题页面采集

英文题名：RSS and Ontology Semantic Based Autonomic Web Page Collection in Vertical Search Engine
作者：张浩斌
论文级别：硕士
学科专业名称：计算机应用
中文关键词：垂直搜索 ; 主题页面 ; 异构集成 ; RSS ; XPath ; Web2.0
英文关键词：vertical search ; topic oriented web page ; heterogeneous integration ; RSS ; XPath ; Web2.0
学位年度：2008
导师：胡华
学科代码：081203
学位授予单位：浙江工商大学
论文提交日期：2008-03-01

摘要

搜索引擎是伴随着互联网信息扩展营运而生的,其任务是帮助网民在海量信息中去粗存精,快速找到自己所需的信息。调查表明,2006年搜索引擎已成为仅次于电子邮件,位居第二的互联网业务。通用搜索引擎在满足海量搜索信息的同时却难以兼顾搜索准确度与相关度质量,很难满足追求精准的个性化、专业化搜索需求。
     垂直搜索(Vertical Search)是针对某一个行业的专业搜索引擎,是搜索引擎的细分和延伸,是对网页库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。垂直搜索引擎是面向特定领域和主题信息检索的工具,面向主题的页面采集是其基础工作。本文针对其核心和基础性工作—主题页面采集进行了分析和研究,主要的工作有并重点从以下几方面进行:
     1、在DOM解析的基础上,提出了改进型的HPath页面抽取技术;针对DOM解析器异构现象,运用HPath基础解决不同解析器的集成应用难题,为商用化的主题页面采集与垂直搜索引擎研究奠定了理论和技术基础。
     2、面向新兴的Web2.0网络,提出基于Web2.0基础的高精度主题页面采集方案,并通过XPath解决RSS标准不统一问题。
     3、在主题页面采集的后期处理上,提出用本体语义适配来解决来自各种不同系统的主题语义异构问题,采用语义距离算法对页面主题进行归纳和分类。
     4、为了提高采集系统的实用性和可维护性,本文尝试采用IBM自治计算框架,结合改进主动数据仓库ECA规则,提出了具有一定自治能力的主题页面采集系统设计。
Search engines are important tools/programmes for people to fast locate online information. Users can obtain the appropriate information by keywords/full-text searching via search engines. While general-purpose engines bring forth the massive information to the user query, they have trouble in maintaining comprehensive and up-to-date search indexes. They fail to deliver high accurate and correlated results and couldn't satisfy the personalized and professional query.
     Vertical search can be regarded as the extension and customization of general ones. Such engines focus on a certain domain, identify and integrate the domain specialized information, extract the needed data, and wrap them into formatted information. Within which, topic oriented web page collection is the key and basic part. On the basis of the analysis on vertical search, the author has performed lots research and implementation of the web page collection. The main research work presented in this paper is as following:
     1. It prompts HPath web extraction method on the basis of DOM parsing, to solve the heterogeneous DOM parsing. By doing so, it presents a base for commercial topic oriented web page collection and vertical search engine both in theory and practice.
     2. It brings forth a scheme for high precision topic web page collection on the basis of Web2.0 technique, and solves the multi-standard problem in RSS.
     3. An ontology semantic adption solution is presented to cope with the heterogeneous semantic of web pages from various systems, and semantic distance function is defined for web page conclusion and classification.
     4. The ECA rule system is modified to fit IBM's automonic computing framework, and an automonic web page collection system is designed which targets on the applicability and maintenability.

引文

[1]Baidu.什么是RSS?[EB/OL]2007 http://www.baidu.com/search/whatisrss.html
    [2]Li.M.,Baker.M.网格计算核心技术[M]王相林.张善卿,王景丽译北京:清华大学出版社 2006年12月
    [3]Open-open.Java开源Html解析类库[EB/OL]2007 http://www.open-open.com/30.htm
    [4]Tim O'Relily.什么是Web2.0[J]互联网周刊.2005.11 38-40
    [5]TNT.论垂直搜索引擎存在意义[EB/OL]2007-6-20 http://www.20ju.com/content/V8501.htm
    [6]陈华,梁循,杨健.面向专题的智能化中文搜索引擎[J]广西师范大学学报:自然科学版,2007年6月,第25卷第2期,104-106
    [7]陈汉华.金海,宁小敏,袁平鹏.武浩.郭志鑫.SemreX:一种基于语义相似度的P2P覆盖网络[J]软件报,Vol.17,No.5,May 2006,1170?1181
    [8]陈晓锋,张凌.董守斌.基于XPath比较的Web数据抽取方法[J]郑州大学学报(理学版).2007年6月,161-166
    [9]陈志敏,沈洁.林颖.周峰.基于主题划分的网页自动摘要[J]计算机应用,2006年3月,第26卷第3期641-644
    [10]邓志鸿,唐世渭,张铭.杨冬青,陈捷.Ontology研究综述[J]北京大学学报(自然科学版),,2002.9 38(5)730-732
    [11]董旻,方曙,杨志萍.使用JTree和XPath构建动态网页信息抽取系统[J]情报杂志,2007年第6期73-78
    [12]方志坚,张瑞林,童小素.搜索引擎综合分析.[J]计算机工程与设计.2007-8-1 4038-4041
    [13]高军.王腾蛟,杨冬青.唐世渭.基于Ontology的Web内容二阶段半自动提取方法[J]计算机学报,2004年3月,第27卷第3期310-317
    [14]何波,杨武,黄贤英,张建勋.基于XML的个性化Web内容挖掘研究[J]计算机工程与应用,2006.04 168-170
    [15]贺智平.徐学洲.李爱玲.一种基于信息熵的Web页面主题信息抽取方法[J]计算机工程与应用,2007.43(4)164-166
    [16]计世网.RSS[EB/OL]2007年4月21日http://wiki.ccw.com.cn/RSS
    [17]季丽丹.基于本体的主动数据仓库语义匹配[D]杭州:浙江工商大学2006
    [18]孔令波.唐世渭,杨冬青.王腾蛟.高军.XML数据的查询技术.[J]软件学报,vol.18.No.6,J une 2007 1400-1415
    [19]雷琼.基于本体的异构数据集成研究[D]沈阳:东北大学
    [20]廖明宏.程光明.吴翔虎.一个智能搜索引擎[J]计算机应用研究,2001.5 29-31
    [21]林菡何,钦铭.基于OWL的网页视觉结构本体表示和Web检索[J]计算机工程与应用.2004.15.157-160
    [22]刘畅.综合搜索引擎与垂直搜索引擎的比较研究[J]情报科学.2007年1月第25卷第1期97-102
    [23]刘艳敏.刘飚.封化民,宋国森.方勇.Web页面主题信息抽取研究与实现[J]计算机工程与应用.2006.21.146-148
    [24]陆汝钤,石纯一.张松懋.毛希平,徐晋晖.杨萍.范路.面向Agent的常识知识库。[J]中国科学.Vol 30.2000.10 453-463
    [25]倪顺坚.面向中小企业的主动数据仓库[D]杭州:浙江工商大学2006
    [26]潘钧.面向Web日志的语义聚类算法.[J]计算机应用研究.第24卷第7期.2007年7月.267-269
    [27]宋宇,孟祥增.主题蜘蛛的设计与实现.[J]郑州大学学报(理学版).2007年6月,第39卷第2期,42-49
    [28]唐洪亮.IE漏洞病毒逐个数.[J]网络安全技术与应用.2002年9期27-28
    [29]王煜.Internet智能比较购物代理的研究与实现[D]杭州:浙江工商大学2005
    [30]王煜.王光明.比较购物现状之研究[J]计算机时代,2005-8-11月5日
    [31]网易.RSS:简单协议使得互联网可编程[EB/OL]2005-9-8 http://tech.163.com/05/0908/13/1T405P9S00091589.html
    [32]网易.RSS-古老而又新颖的技术[EB/OL]2006年9月2日http://tech.163.com/special/000915SN/simplerss.html
    [33]徐宝文,张卫丰.搜索引擎与信息获取技术[M]北京:清华大学出版社2003
    [34]杨文柱,徐林吴.陈少飞,郝亚南,李天柱.基于XPath的Web信息抽取的设计与实现[J]计算机工程,2003年29卷16期,82-83,113
    [35]张剑,李春平.基于WordNet概念向量空间模型的文本分类[J]计算机工程与应用.20064174-178
    [36]张玮莉.李冠宇.王珂.基于本体的Web数据语义集成体系结构的研究[J]计算机与数字工程.2007第7期第35卷50-51,186-187
    [37]中文全文检索网.什么是垂直搜索[EB/OL]2006-1-3 http://www.fullsearcher.com/n20051112144420735.asp
    [38]朱俊武.王建东.李斌.面向语义Web服务的本体及融合机制[J]南京理工大学学报,2006年12月,第30卷第6期742-747
    [39]Budanitsky A,Hirst G.Semantic distance in WordNet:An experimental,application-orie nted evaluation of five measures.[C]Budanitsky:Proc.of the Workshop on WordNet and other Lexical Resources.2001
    [40]D.Fensel.OIL in anutshell.[C]France:The 12th Int'l Conf on Knowledge Engineer ing and Knowledge Management 2000
    [41]E K Robert.Conceptual knowledge markup language:The central core.[C]Banff,Canada :The 12th Workshop on Knowledge Acquisition,Modeling and Management(KAW99),1999
    [42]Farshad Hakimpour,Andreas Geppert.Resolving semantic heterogeneity in schema integra tion:an ontology based approach[C]Ogunquit,Maine,USA:Proceeding of the internat ional conference on Formal Ontology in Information Systems October17-19,2001,297-308
    [43]G.Bisson.Why and how to define a similarity measure for object based representation systems[C]Amsterdam:Towards Very Large Knowledge Bases 1995 236-246
    [44]Guarino N,Masolo C,Veter G.Onto Seek:Content-based access to the Web[J]IEEE Int elligent Systems,1999 14(3):70-80
    [45] IBM . An architectural blueprint for autonomic computing., [EB/OL] 2006-7-11 http://www.ibm.com/autonomic/pdfs/AC_Blueprint_White_Paper_4th.pdf
    [46] idsharing . RSS vs ATOM [EB/OL] 2007-5-22 http://blog.pixnet.net/idsharing/post/4772163
    [47] intertwingly.net. Rss20AndAtom10Compared [EB/OL] 2007-11-23 http://www.intertwingly.net/wiki/pie/Rss20AndAtom10Compared
    [48] J Heflin, J Hendler . Searching the web with SHOE. [C] Menlo Park: Artificial Intelligence for Web Search. CA: AAAI Press 2000 35-40
    [49] J. Widom . Active Databases systems- Triggers and Rules For Advanced Database Processing [M] San Francisco: Morgan Kaufmann Publishers 1996
    [50] Jerome Euzenat . Eight Questions about Semantic Web Annotations, [J] IEEE Intelligent SYSTEMS, 2002 55-62
    [51] Jim Cuene Web2.0 : Is it a whole new internet ?, [EB/OL] 2005-5-18 http://cuene.typepad.com/blog/2005/05/web_20_at_mima.html
    [52] M. Paolucci, T. Kawamura, T. Payne and K. Sycara . Semantic Matching of Web Services Capabilities. [C] Sardinia, Italy: The Semantic Web ISWC 2002, First International Semantic Web Conference, volume 2342 of Lecture Notes in Computer Science Sardinia, Italy, 2002
    [53] Marc Ehrig , York Sure . Ontology Mapping - An Integrated Approach [C] Greece,Heraklion: Proceedings of the 1st European Semantic Web Symposium, Springer, LNCS, 2004 10-12
    [54] Matthew Horridge, Holger Knublauch, Alan Rector, Robert Stevens, Chris Wroe . A Practical Guide To Building OWL Ontologies Using The Protégé-OWL Plugin and CO-ODE Tools Edition 1.0 [J] The University of Manchester, 27-Aug-04 15-65
    [55] Michael Erdmann, Rudi Studer . How to structure and access XML documents with ontologies [J] DATA&KNOWLEDGE ENGINEERING, 36(2001): 317-335
    [56] Mihalcea, R. F., Mihalcea, S.I. . Word semantics for information retrieval: moving one step closer to the Semantic Web [J] Proceedings of the 13th International Conference on Tools with Artificial Intelligence, 2001 280-287

    [57] P D Karp, V K Chaudhri, J Thomere . XOL: An XML-based ontology exchange language. [R] AI Center, SRI International, Tech Rep: 1999 559
    [58] Rabarijaona, A. ,Dieng, R., Corby. O. , Ouaddari, R. . Building and searching an XML-based corporate memory [J] IEEE Intelligent Systems May-June 2000 15(3) 56—63
    [59] Scheuennann, P., Yu, C. , El magarmid, A, Garcia-Molina, H. , Manola, F. , McLeod, D., Ros enthal, A, and Templeton, M. . Report on the workshop on heterogeneous databases system s [R] ACM SIGMOD RECORD, 1989 19 (4) 23
    [60] Smallfish . Web QoS成为热点技术 [EB/OL] 2006-10-21 http://www.chinaser.net/Site/2006/1021/329. html
    [61] Stanford . protégé short courses [EB/OL] 2007 http://protege.stanford.edu/support/short-courses.html
    [62] Stanford Center for Biomedical Informatics Research . Protégé 3.1.1 [CP/OL] http://protege.stanford.edu/
    [63] Terada T. , Tsukamoto M., Nishio M. . Dynamic Construction Mechanism of a Trigger Graph on Active Databases [C] Prague, Czech Republic: Mobile Computing Environments. 14th International Workshop on Database and Expert Systems Applications. (DEXA'03), 2003
    [64] W3C . OWL Web Ontology Language, Reference [EB/OL] 2004-2-10 http://www.w3.org/TR/owl-ref/

    [65] W3C . XML Path Language (XPath) [EB/OL] 1999-11-16 http://www.w3.org/TR/xpath
    [66] Widom J. and Ceri S . Active Database Systems [M] San Francisco: Morgan Kaufmann Publishers
    [67] Zhan Cui, Paul O'Brien . Domain ontology management environment [R] 33rd awaii International Conference on System Sciences Volume8 January 04-07,2000
    [68] Zhan Cui, Dean Jones and Paul O'Brien . Issues in Ontology-based Information Integration. Seattle, [C] Washington, USA: Proceedings of Seventeenth International Joint Conference on Artificial Intelligence 2001(8) 187-190
    [69] Chau, M. ; Hsinchun Chen; Comparison of three vertical search spiders, Computer, Volume 36, Issue 5, May 2003 Page(s) :56 - 62

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700