Web信息获取技术研究

作者：吴东华
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：网络爬虫 ; Web ; Citeseer ; 文献质量评价 ; 语境图 ; PageRank ; 贝叶斯 ; 内容 ; 拓扑结构
英文关键词：Web Crawler ; Citeseer ; Quality Evaluation ; Context Foused Graph ; PageRank ; Content ; Lmk Structure ; Bayes
学位年度：2004
导师：孙怀江
学科代码：081203
学位授予单位：南京理工大学
论文提交日期：2004-06-01

摘要

随着互联网的兴起和信息时代的到来，Web信息获取技术成为当今世界上一大研究的热点。如何最准确的获得人们感兴趣的信息，成为Web信息获取技术研究的重中之重。然而由于互联网内部的多样性以及文档结构的复杂性，Web信息获取技术的研究具有一定的困难，很难涵盖所有范围，专业搜索引擎成为解决这一问题的主要方法。本文选取当今世界上公认最好的计算机专业科学文献搜索引擎Citeseer进行研究，试图提出一种方案，使科学工作者根据自己的兴趣能更加方便、准确的通过Citeseer网站获取计算机类文献。
     本文的工作包括：
     1．针对Citeseer网站的文献搜集和分析
     在对互联网上的信息进行处理时，常常要将分布在互联网各处的Web页面下载到本地供进一步处理，因此本文设计网络爬虫，根据Citeseer网站中文献页面对应的链接具有的特定形式，将文献页面的Html源代码下载到本地数据库中；再根据文献页面显示样式所具有的特定规律进行分析，根据需要从中提取各类信息，分类存储到数据库各个表中，以供进一步研究使用。
     2．基于内容和拓扑结构的文献质量评价
     本文在Citeseer搜索的结果文献集的基础上，分别根据内容和拓扑结构对这些文献进行重新评价，根据评价结果对文献集进行重新排序，以找到感兴趣的文献。本文中基于内容的文献质量评价根据事先提供的好文献构造“语境图”找到各类样本，分类算法采用朴素贝叶斯理论；基于拓扑结构的文献质量评价采用PageRank算法进行。实验结果表明，这两种评价方法分别从主观和客观角度体现了文献的质量。
     3．提出基于内容和拓扑结构相结合的知识决策系统框架
     由于基于内容和拓扑结构的方法分别从主观和客观的角度评价文献质量，本文将这两种方法相结合提出一种应用于Citeseer文献搜索引擎的知识决策系统框架。具体表现为根据Citeseer搜索的结果文献集先用基于内容的方法提取出相关文献，再根据PageRank算法对这些文献从客观上进行排序。本文选取比较熟悉的两个领域进行实验，结果表明这种方法具有一定的效果。
With the spring up of www and the advent of information-exploding age, technology of aquiring web information become a very active subject in the world. How to exactly get interesting information from web is the most important problem.However.since the complexity of web.the relevant research is hard, it is helluva to include all areas, appearance of topic-specific search engine become one of the best solutions.In this paper, we pick out the search engine Citeseer which is believed the best topic-specific search engine to get along with our research,try to put forword a scheme in order to promote scientists to aquire interesting computer papers from Citeseer more convenient and more exactly.
    Contrbution of this paper includes:
    1. Collecting and analyzing of paper on Citeseer
    When processing information on the web, we need to download html pages to native computer.In this paper, we design a web crawler on Citeseer to collect html source code of every paper, and storage it in native database,then analysis this information on the display rule of Citeseer,storage the result in corresponding table.The above work is a preparation for the following reseach.
    2. Qulity evaluation of paper on content and link structure
    In this paper,we choose content information and link structure to do our research,the work is based on result papers aquired from Citeseer. We try to find a good means to sort papers over again,in order to find interesting papers more exactly.In the means based on content ,we choose "context foused graph" to find sample texts,and bayes arithmetic as classification theory.in the means of link structure.we choose PageRank arithmetic to do our research .Experiment results show these two kind of methods can right evaluate papers from two different sides.
    3. A knowledge decision frame based on content and link structure
    Since the method based on content evaluate papers from subjective point of view,while the method based on link structure evaluate papers from objective point of view,in this paper we put forward a scheme,which combine above two methods,to evaluate papers. Concretely speaking, first we find relative papers based on content,shrinking the size of result papers via Citeseer,then we evaluate these papers based on link structure.bring about results in order of evaluation value .Results of Experiments show this


    method have determinate effect.

引文

[1] Web surpasses one billion documents: Inktomi/NEC press release, available at http://www.inktomi.com.Jan 18.2000
    [2] http://www.webcrawler.com
    [3] http://www.yahoo.com
    [4] http://www.altavista.com
    [5] http://www.infoseek.com
    [6] http://www.excite.com
    [7] http://www.google.com
    [8] http://research.compaq.com/SRC/mercator/
    [9] Robert C.Miller, Krishna Bharat,SPHINX: a framework for creating personal, site-specific Web crawlers, Computer Networks and ISDN Systems 30,1998,pp. 119-130
    [10] 李盛韬，赵章界，余智华，基于主题的Web信息采集系统的设计与实现，计算机工程，Volume29(17) 2003．10
    [11] K. Bharat and M. Henzinger, Improved algorithms for topic distillation in hyperlinked environments, Proceedings 21st Int'l ACM SIGIR Conference,1998
    [12] S. Chakrabarti, M. van der Berg, and B. Dom, Focused crawling: a new approach to topic-specific web resource discovery, Proceedings of the 8th International World-Wide Web Conference, 1999
    [13] Soumen Chakrabarti,Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery, Elsevier Science B.V.,1999
    [14] Selberg.E,Etzioni.O, Multi_service search and comparison using the MetaCrawler, Proceedings of the 1995 World Wide Web Conference,1995
    [15] Marc Najork, Janet L.Wiener, Breadth-First Search Crawling Yields High-Quality Pages,WWW 10,May 1-5,2001,Hong Kong,ACM 1-58113-348-0/01/0005
    [16] Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, Marco Gori, Focused Crawling using Context Graphs, 26th,International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt, pp. 527-534
    [17] 庞剑锋，卜东波，白硕，文本自动分类在搜索引擎上的应用，基于向量空间模型的文本自动分类系统的研究与实现,http:Hwww.ict.ac.cn/xueshu/2001/115.doc
    [18] 黄萱菁，吴立德，独立于语种的文本分类方法,2000 International Conference on Multilingual Information Processing,2000,pp.37-43
    [19] 鲁松，白硕，文本中词语权重计算方法的改进，2000 International Conference on Multilingual Information Processing,2000,pp.31-36
    [20] 王汉萍，孟庆春，文本自动分类在搜索引擎上的应用，http://www.google.com/intl/zh-CN/


    [21] 卜东波，聚类／分类理论研究及其在大规模文本挖掘中的应用，博士论文，2000．11
    [22] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, in Prec. of the 7th International World Wide Web Conference, Brisbane, Australia, April 1998,Elsevier Science, pages 107-117
    [23] Jon Kleinberg, Authoritative Sources in A Hyperlinked Environment, Journal of the ACM,1999,46(5)
    [24] L Page,S Brin,R Motwani,T Winograd,The PageRank citation ranking: Bringing order to the web,USA:Stanford University,1998
    [25] J. Rennie,A. McCallum, Using reinforcement learning to spider the web efficiently, in Prec. International Conference on Machine Learning (ICML),1999
    [26] Steve Lawrence,C.Lee Giles, Kurt Bollacker, Digital Libraries and Autonomous Citation Indexing,IEEE Computer, Volume 32,Numer 6,pp.67-71
    [27] Steve Lawrence, C,Lee Giles, Searching the World Wide Web,Science,Volume 280,Number 5360,April 3 1998,pp.98-100
    [28] C.Lee Giles, ,Kurt Bollacker, Steve Lawrence Citeseer:An Automatic Citation Indexing System, Digital Libraries 98:Third ACM Conf.on Digital Libraries,ACM Press, New York,1998,pp.89-98
    [29] 衡中青，引文索引及其索引原理，http://www.google.com/intl/zh-CN/
    [30] Ron Soukup & Kalen Delaney,SQL Scrver 7.0技术内幕，北京大学出版社，2000．3
    [31] 叶允明，于水，马范援，宋晖，张岭，分布式Web Crawler的研究：结构、算法和策略，电子学报，Vol．30．No 12A，Dec 2002
    [32] Heaton.J网络机器人Java编程指南
    [33] M. Porter, An algorithm for suffix stripping, Program,vol. 14, no. 3, 1980, pp. 130-137
    [34] J. Cho, H. Garcia-Molina, L. Page, Efficient crawling through URL ordering, Proceedings of the Seventh World-Wide Web Conference, 1998.
    [35] Steve Lawrence, Kurt Bollacker, C.Lee Giles, Indexing and Retrieval of Scientific Literature, Eighth International Conference on Information and Knowledge Management,CIKM99,Kansas City, Missouri,November206,1999,pp. 139-146
    [36] A.Abdollahzadeh Barfourosh,H.R.Motahary Nezhad,M.L.Anderson, D.Perlis,Information Retrieval on the World Wide Web and Active Logic:A Survey and Problem Definition
    [37] Steve Lawrence,C.Lee Giles,Searching the Web: General and Scientific Information Access,IEEE Communications,37(1), 1999 ,pp. 116-122
    [38] Steve Lawrence, Online or invisible?,Nature,Volume 411,Number 6837,2001,pp.521
    [39] Steve Lawrence, Access to scientific literature, The Nature Yearbook of Science and Technology,2001,pp.86-88


    [40] Chen Hsinchun,Chung Yi-Ming, Marshall Ramsey, Christopher C.Yang,An intelligent personal spider(agent) fro dynamic Internet/Intranet searching, Decision Support Systems 23,1998,pp.41-58
    [41] Steve Lawrence,C.Lee Giles, Kurt D.Bollacker, Autonomous Citation Matching, Proceedings of the Third International Conference on Autonomous Agents,Seattle,Washington,Mayl-5,ACM Press, New York,NY, 1999
    [42] O. Heinonen, K. Hatonen, and K. Klemettinen, WWW robots and search engines,Seminar on Mobile Code, Report TKO-C79, 1996
    [43] Hongfei Yan,Jianyong Wang,Xiaoming Li,Lin Guo,Architectural design and evaluation of an efficient Web-crawling system,The Journal of Systems and Software 60 2002,pp. 185-193
    [44] H.Small,B.Griffith,The structure of scientific literatures:Identifying and graphing specialities,Science Studies,4(17),1974,pp.17-40
    [45] Alexandrin Popescul,Gary William Flake,Steve Lawrence,Lyle H.Ungar, C.Lee Giles,Clustering and Identifying Temporal Trends in Document Databases,IEEE Advances in Digital Libraries,ADL2000 ,Washington,DC,May22-24,2000,pp. 173-182
    [46] Chen Hsinchun,Chung Yi-Ming, An intelligent personal spider(agent) for dynamic Internet/Intranet searching, Decision Support System,23,1998,pp.41-58
    [47] Kurt D.Bollacker, Steve Lawrence, Discovering Relevant Scientific Literature on The Web, IEEE Intelligent Systems,15(2),2000,pp.42-47
    [48] S. Chakrabarti, D. Gidson, K. McCurley, Surfing backwards on the web,in Proc 8th World Wide Web Conference (WWWS), 1999
    [49] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, Text classification from labeled and unlabelled documents using EM, Machine Learning,1999
    [50] David D.Lewis,Native Bayes at forty:The independence assumption in information retrieval,ECML,1998
    [51] Krishna Bharat,Bay-Wei Chang,Monika Henzinger, Matthias Ruhl,Who Links to Whom:Mining Linkage between Web Sites,
    [52] ANDRAS LORINCZ,ISTVAN KOKAI,INTELLEGENT HIGH-PERFORMANCE CRAWLERS USED TO REVEAL ROPIC-SPECIFIC STRUCTURE OF THE WWW, International Journal of Foundations of Computer Science Vol. 13,NO.4,2002,pp.477-495
    [53] Sergey Brin,Larry Page,Google search engine,http://google.stanford.edu
    [54] The PageRank Algorithm, http://pr. efactory. de/e-pagerank-algorithm. shtml
    [56] The Implementation of PageRank in the Google Search Engine, http://pr. efactory. de/e-pagerank-implementation. shtml
    [57] Google PageRank算法解析,http://www.googie8.net/archives/000934.html


    [58] Google--PageRank(网页级别)技术解密，http://www.21cnbj.com/industrynews/articles_2003/SEO-PageRankl.html
    [59] 曹军，Google的PageRank技术剖析，情报杂志，2002(10)
    [60] 朱炜，王超，李俊，潘金贵，WEB超链分析算法研究，http//www.google.com/intl/zh-CN/
    [61] 丁宁，Google搜索引擎算法的秘密，程序员增值合订本2002下，电子工业出版社，2002，pp．232-235
    [62] 冯国臻，白硕，程学旗，异构数据统一检索技术研究与系统实现，微电子学与计算机，2001(4)
    [63] 边肇祺，张学工，模式识别，清华大学出版社，1999
    [64] 汤韬，SQL调整与优化，程序员2003合订本(下)，2003，pp．14-20
    [65] 候捷，Java编程思想，机械工业出版社，2002．9