Web信息智能检索研究

英文题名：The Research on Intelligent Web Information Retrieval
作者：韩巍
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：互联网 ; 万维网 ; 信息检索系统 ; 搜索引擎 ; 分类算法
英文关键词：Internet ; World Wide Web ; Information Retrieval system ; Search Engine ; Classification Algorithm
学位年度：2004
导师：吴国凤
学科代码：081202
学位授予单位：合肥工业大学
论文提交日期：2004-05-01

摘要

随着Web的不断增长，人们对Web信息检索系统提出了更高的要求。Web信息检索也逐渐成了互联网研究中的一个热点。近年来，又有一些学者提出了面向特定主题的Web信息检索方法，以满足一些专业用户的信息需求，同时也克服了综合搜索引擎的一些不足。
     本文对面向特定主题的Web信息检索所涉及到的关键技术进行了深入的讨论。对面向特定主题的Web信息检索系统中的网页主题识别方法(网页分类方法)作了深入的研究。目前对网页的分类主要是采用基于网页内容的分类方法，这种分类方法没有充分利用web的链接信息，因而分类效果不是很好。本文给出了一个结合网页链接结构的网页分类方法。同时，在对网页分类技术进行研究的基础上，本文构造了一个基于网页链接结构的面向特定主题的Web信息搜索系统。
     最后本文使用vc++6.0开发环境实现了一个实验系统平台，并在这一平台上进行了相关的实验。
With the increasing of WWW, Web information retrieval systems with higher performance are required. Subsequently, the research on Web information retrieval has being a focus. Recently, Focus Crawling system was presented to satisfy people who need professional knowledge from WWW.
    In this dissertation all key aspects of a Focus Crawling system are introduced and then the classification problem in Focus Crawling system is deeply discussed. Now, most classification methods for Web Page only use the contents of Web Page. These methods ignore links between pages completely. In fact, links between Web Pages sometimes reflect topics of these linked pages. So this dissertation designs a new method to classify Web Pages. This method uses links and contents of Web Page to decide a page's class. The result of experiment shows an improvement on methods, which consider contents of Web Page only. Then this dissertation designs a better Focus Crawling system, which use a classifier based on contents and links of a Web Page to decide the page's class, and the result of experiments shows an improvement on common method.
    In order to check our methods, we develop a focus crawling system using vc++ 6.0.

引文

[1]. S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. 8th International World Wide Web Conference, Amherst: Elsevier, 1999, 545-562
    [2]. Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. 7th World Wide Web Conference(WWW7),Australia: Brisbane, April 1998
    [3]. M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. Gori. Focused Crawling Using Context Graphs. 26th International Conference on Very Large Databases, VLDB 2000
    [4]. Krishna Bharat and Monika R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. 21st International Conference on Research and Development in Information Retrieval(SIGIR 1998).
    [5]. Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. 24th Annual International Conference on Research and Development in Information Retrieval, 2001.
    [6]. Soumen Chakrabarti, Byron E. Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of the ACM International Conference on Management of Data, SIGMOD 1998, pages 307-318.
    [7]. Avrim Blum, Shuchi Chawla Learning from Labeled and Unlabeled Data using Graph Mincuts. International Conference on Machine Learning(ICML),2001.
    [8]. T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite Kernels for Hypertext. International Conference on Machine Learning(ICML), 2001.
    [9]. J. Kleinberg, E. Tardos. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. Proc. 40th IEEE Symposium on Foundations of Computer Science,1999.
    [10]. B. Taskar, P. Abbeel and D. Koller. Discriminative Probabilistic Models for Relational Data. Eighteenth Conference on Uncertainty in Artificial Intelligence(UAI 2002).
    [11]. Jeffrey Dean and Monika R. Henzinger. Finding Related Web Pages in the World Wide Web. 8th International World Wide Web, 1999.
    [12]. David Cohn and Huan Chang. Probabilistically Identifying Authoritative Documents. 17th International Conference on Machine Learning, 2000.


    [13]. A. Borodin, J. S. Rosenthal, G. O. Roberts, P. Tsaparas. Finding Authorities and Hubs From Link Structures on the World Wide Web. 10th International World Wide Web Conference, May 2001.
    [14]. Weiyi Meng, Clement Yu, King-Lup Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys 34(2002).
    [15]. T.Joachims. Optimizing Search Engines Using Clickthrough Data. Eighth International Conference on Knowledge Discovery and Data Mining,KDD-2002.
    [16]. Cynthia Dwork, Ravi Kumar, Moni Naor, D. Sivakumar. Rank Aggregation Methods for the Web. 10th International World Wide Web Conference, May 2001.
    [17]．郭琰．元搜索引擎的关键技术研究及系统实现．南京理工大学硕士学位论文，南京，2001，12．
    [18]. The anatomy of large-scale hypertextual Web search engine.
    [19]. J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc.9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
    [20]．石晶，龚震宇，裘杭萍．一种更稳定的链接分析算法—子空间HITS算法．吉林大学学报(理学版)，2003，vol．41，No．9．
    [21]. Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. Proc. 7th International World Wide Web Conference, 1998.
    [22]. Brian Amento, Loren Terveen, and Will Hill. Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents. 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000.
    [23]. B.D. Davison. Recognizing Nepotistic Links on the Web. AAAI Workshop on Artificial Intelligence for Web Search, 2000.
    [24]. Arvind Arasu, Jasmine Novak, Andrew Tomkins, John Tomlin. PageRank Computation and the Structure of the Web: Experiments and Algorithms.11th International World Wide Web Conference, 2002.
    [25]. David Cohn and Huan Chang. Probabilistically Identifying Authoritative Documents. 17th International Conference on Machine Learning, 2000.
    [26]. Erik Selberg, Oren Etzioni. Multi-service search and comparison using the meta crawler. The Proceedings of the 1995 world wide web conference,1995.


    [27]. Marti A Hearst, Jan O Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. The Proceedings of ACM SIGIR'96, August 1996, Zurich.
    [28]. Oren Zamir, Oren Etzioni. Web document clustering: A feasibility demonstration. The Proceedings of ACM SIGIR'98, August 1998, Melbourne, Australia.
    [29]．杨昂．文本分类算法研究．湖南大学硕士学位论文，2002．4．
    [30]. David D. Lewis, Marc Ringuette. Comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-93, Las Vegas, NV, April 11-13 1994. ISRI; Univ. of Nevada, Las Vegas
    [31]. T. Joachims Text. categorization with support vector machines: learning with many relevant features. Proceedings of the European Conference on Machine Learning(ECML), Springer, 1998.
    [32]. William W. Cohen, Yoram Singer. Context-sensitive learning methods for text categorization. SIGIR 1996: 307-315.
    [33]. Wiener, Pedersen, Weigend. A neural network approach to topic spotting. Proc of the Fourth Annual Symp on Document Analysis and Info, 1995,pages: 317-332
    [34]. Adam Berger. Error-correcting output coding for text classification. Proceedings of Machine Learning for Information Filtering Workshop,IJCAI'99, Stockholm, Sweden, 1999.
    [35]. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. Proceedings of Machine Learning for Information Filtering Workshop, IJCAI'99, Stockholm, Sweden, 1999.
    [36]. R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell. Web Watcher: A learning apprentice for the World Wide Web. In Proc. AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. 1995.
    [37]. L.Pelkowitz. A continuous relaxation labeling algorithm for markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20: 709-715,1990.
    [38]. D. Eppstein. Finding the k shortest paths. In Symposium on the Foundations of Computer Science IEEE, 1994.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700