面向专业搜索引擎的主题爬行技术研究

英文题名：Research on Topical Web Crawling Technique for Topic-Specific Search Engine
作者：彭涛
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：主题爬行 ; 专业搜索引擎 ; 链接上下文 ; BWPSO ; 算法增量 ; 数据增量 ; 隧道穿越 ; 增量索引结构
学位年度：2007
导师：左万利
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2007-04-01
答辩委员会主席：钟绍春

摘要

本文针对面向专业搜索引擎的主题网页信息获取问题,对主题爬行技术进行了深入的研究,提出了基于链接上下文的自适应主题爬行方法,该方法采用(?)ζ-IDOM链接上下文方法,在主题爬行过程中不断使主题特征集合自我完善。实验结果表明,该方法在不断增强自适应性的情况下,不会发生主题漂移,所以具有一定的鲁棒性。
     将原始的粒子群优化算法针对本文研究内容进行了改进,即BWPSO。测试实验显示,BWPSO和标准的PSO算法相比,在得到相同结果的情况下所需迭代次数更少。可见,采用BWPSO来求解最优化问题是可行的,而且效率要更高。对训练过程中迭代产生的网页分类器利用BWPSO进行优化组合,产生最终分类器。实验结果表明,通过对迭代产生的分类器进行优化组合,大大提高了网页分类性能。
     针对互联网上的网页频繁发生着增加、更改及删除等变化,提出具有增量特性的主题爬行方法,即算法增量和数据增量。算法增量解决在初始训练集不完备的状况下,通过训练过程来自我完善;数据增量研究主要寻找和识别网页的动态变化规律,通过主题爬行保持网页的时新性。实验验证了该方法的有效性。
     将隧道穿越(Tunneling)分为灰色隧道穿越(Grey Tunneling)和黑色隧道穿越(Black Tunneling),同时提出了两种隧道的穿越方法,实验结果显示,对上述两种隧道的穿越达到了预期的效果。
     构建了一个专业搜索引擎:LookClearTSSE。通过本文建立的基于多种爬行策略主题爬行器LciSpider来获取特定领域网页信息,之后采用本文提出的增量索引结构来建立检索查询接口,对查询结果进行排序。实验验证了该方法的优越性。
With the rapid expand and growth of web pages information from the WWW, it gets harder to retrieve the information and knowledge relevant to a specific domain. Therefore, topical web crawling technique for retrieving the specific-domain information has got more attention and development in recent years. Topical web crawling has been applied for not only a topic-specific search engine, but also other field such as digital library, etc. Accordingly, the research on topical web crawling will be an academic signification and a broad application perspective. The major contents could be summarized as follows:
     (1)This dissertation makes a general summary of the research on topical web crawling and the correlative techniques, analyzes the derivation background and the course of development. After introducing and analyzing the development of search engines and the text classification, the virtues and necessary of a topic-specific search engine be presented. Furthermore, the future of search engines is also discussed in this dissertation. The basic theory and strategies of topical web crawling and text classification technique are also introduced and analyzed, which are the groundwork of farther research works.
     (2)It is an effective topical web crawling approach that the relevance of a target web page is evaluated by using web page information. Generally, a knowledgeable human being can identify the target web page before its display, which is natural. How to make a computer imitate a human being to identify the target web page will be a challenge. Some anchor text and link information are too short, not enough informative, so it is not reliable forecasting target web pages only by anchor text. We expand it to link-contexts, which is the best method to enrich anchor text. There are several link-contexts extracting methods now. After analyzing and comparing them, a new link-contexts extracting method ?ζ?IDOM is presented in this dissertation. The experimental results show that the performance of the topical web crawling using the method with some appropriate parameters is improved. To get the initial feature set of the topical web crawler, we adopt a method to extract the contexts of backward links of seed URLs, which are ordinarily the summary of the content of seed URLs. By this method we can get small but exact feature set of the topic, which is used to instruct the crawling of the topical web crawler. Based on the strategy of link-contexts-based topical web crawling, this dissertation presents a self-adapted method, which evolves the feature set of the topic during the process of crawling. The experimental results show that the method is able to strengthen the self-adaptability of the crawler and meanwhile avoids the phenomenon of topic drift.
     (3)Particle Swarm Optimization (PSO) is applied in many fields because it is effective, understandable and easy to realize. PSO is a computational intelligence method motivated by the social behavior of organisms, which is an organism-based and iteration-based optimization strategy. Based on original PSO algorithm, this dissertation presents an improved algorithm named BWPSO. The experimental results show that BWPSO require less times of iteration than standard PSO to get the same results, so we can conclude that BWPSO is applicable and more efficient to solve optimization problems.
     (4)Text classification that is used to instruct topical web crawler is a key technique in the research of web information retrieval. In this dissertation we assign different weights to different content in the same web page according to its structure, and search and utilize the rules that web pages have in common to compute the feature weights of the pages. Because the web pages are diverse and the training set is always not representative enough, we build a set of classifiers by iteratively applying SVM algorithm on training set. Because PSO is an iteration-based optimization strategy, we use BWPSO to synthesize classifiers result from the iteration of training to get the final classifier. The experimental results show that through the synthesization of classifiers, the partition of web page structure and the using of common rules to compute feature weight, the performance of the classifier is great improved.
     (5)Since the web pages are frequently increased, deleted and modified, the research on incremental topical crawling is of great importance. In this dissertation, the technique of incremental topical web crawling contains two parts: incremental learning in algorithm and incremental web pages updating. Incremental leaning in algorithm presents the self-adaptability and self-improvement of incremental learning. In topical web crawling, it is impossible to predefine the entire sample set relevant to a specific topic. In this situation, incremental learning has to use some strategies to select favorable training data to improve itself during the process of training. In the research of incremental learning in algorithm, this dissertation applies improved 1-DNF algorithm to extract reliable negative data from the training set of PU classification problem, and then constructs classifier by using SVM iteratively in incremental learning. In SVM classification algorithm, the determinant data is Support Vector (SV). The partition of Support Vector set equals to the partition of the whole data set. Accordingly, during the process of training, this dissertation uses SV set, samples of false classification results, and unclear samples of the correct classification results generated from the process of iteration as the training set in the next iteration. In doing so, we can save training time and space, and this method is not detrimental to classification accuracy. The research on incremental web pages updating concentrates on the deducing and identifing of changed (increased, modified) web pages in latter crawling. The key problem is to search and identify the rules by which the web pages change. According to the rules, the crawler can only crawl the pages that have changed, then we can save the bandwidth of internet and meanwhile make the web pages up-to-date. After the definition of incremental web pages updating, this dissertation presents the rules used to determine whether the page has changed. We use a method based on DOM tree to estimate the content of web page, and then analyze the randomness of the change of web pages. Through the sampling estimation based on experiments, we evaluate the change frequency of web pages according to relevant mathematic models. Finally, we present the crawling structure and algorithm using incremental web pages updating. Experimental results show that through self-adjusting parameters, this method is able to get the rules by which web pages change.
     (6)Due to the complexity of the web environment and topic-multiplicity of the contents of web pages, it is quite difficult to get all the web pages relevant to a specific topic. It is possible for an irrelevant web page to link a relevant web page, so we need to traverse the irrelevant web page to get more relevant pages. This procedure is called Tunneling. This dissertation partitioned Tunneling into Grey Tunneling and Black Tunneling. Grey Tunneling resolves the problem that the topic-multiplicity of a web page makes the relevance of the highly relevant page been weakened. So during the process of crawling, in order to avoid the effect caused by the web page that is irrelevant to the specific topic as a whole but relevant partially, we divide a multi-topical page into several blocks and process the blocks individually, and then we can traverse the page that is irrelevant as a whole to expand the scope crawer reached and get more relevant pages. In Black Tunneling, we present a probing method, in which we first store all the irrelevant pages temporarily, and meanwhile extract all the out-links to crawl. During the process of crawling, we assign a depth value used to determine whether the page should be kept to each irrelevant page according to the relevance of its father page, and then we can broaden the scope of the topical crawler. The experimental results show that the two tunneling methods have achieved the effect we expected.
     (7)Similar to general search engine, the construct of topical search engine is composed of three parts: the collection of information relevant to a specific topic on internet, the building of index and the service of information retrieval. Based on topical crawling techniques mentioned above, this dissertation constructs a topical search engine: LookClearTSSE (LookClear Topic-Specific Search Engine). We constructs a topical web crawler LciSpider (LookClear Intelligent Spider) using several crawling strategies introduced in the dissertation to collect topic-relevant information. During the process of crawling, we compute the weights of uncrawled urls extracted from downloaded pages to determine their priorities. In LciSpider, we realize several strategies such as Breadth-First, Best-First, Link-context and Content Block Partition. Then we build incremental index structure for the web pages crawled after pre-processing, and provide information retrieval interface. Before the construct of the incremental index structure, firstly the web pages are pre-processed to get the set of original pages, and then inverted index is built using the result of forward index. This dissertation presents a block-based link structure to store index. This structure makes the time cost in index updating only relate to the quantity and sizes of added pages. The space-for-time approach outperforms continuous storage of index in update rate, and provides much higher query efficiency than native linked list structure, which also supports real-time update.

引文

[1] Murray B H,Moore A. Sizing the Internet [Z]. A White Paper :Cyveillance, Inc. 2000.
    [2] Lawrence S, Giles L. Accessibility and distribution of informationon the Web[J]. Nature ,1999 , 400 :107-109.
    [3] Cho J,Garcia-Molina H.The evolution of the web and implications for an incremental crawler.In:Proceedings of the 26th International Conference on Very Large Databases (VLDB), Cairo, Egypt,2000.117-128
    [4] Google Information for Webmasters. http://www.google.com/webmasters/2.html.
    [5] Ester M, Grob M, Kriegel H. Focused Web crawling: a generic framwork for specifying the user interest and for adaptive crawling stratrgies[C]. In: Proc of the International Conference on Very Large Database (VLDB’01), 2001.
    [6] A.K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “ Automating the Construction of Internet Portals with Machine Learning,” Information Retrieval, 2000, vol. 3, no. 2, pp. 127-163.
    [7] G. Pant, K. Tsioutsiouliklis, J. Johnson, and C.L. Giles, “Panorama: Extending Digital Libraries with Topical Crawlers,” Proc. Fourth ACM/IEEE-CS Joint Conf. Digital Libraries, 2004, pp. 142-150.
    [8] J. Qin, Y. Zhou, and M. Chau, “Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method,” Proc. Fourth ACM/IEEE-CS Joint Conf. Digital Libraries, 2004.
    [9] P. D. Bra, G. Houben, Y. Kornatzky, and R. Post. Information retrieval in distributed hypertexts. In Procs. of the 4th RIAO Conference, New York, 1994, pages 481–491.
    [10] P D Bra, et al. Searching for arbitrary information in the WWW: The fish-search for Mosac. WWW Conference, 1994.
    [11] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pellegb, M. Shtalhaima, and S. Ura. The shark-search algorithm. an application: tailored web site mapping. In WWW7, 1998.
    [12] J. Cho, H. Efficient Crawling Through URL Ordering,Garcia-Molina, L. Page. In Proceedings of the 7th International WWW Conference, Brisbane, Australia, April 1998.
    [13] L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Technologies Project, 1998.
    [14] Menczer F,Pant G,Ruiz M,Srinivasan P.Evaluating topic-driven Web crawlers.In: 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001, New York, 241-249.
    [15] S. Chakrabarti, M. van den Berg and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,In Proceedings of the 8th International WWW Conference, Toronto, Canada, May 1999.
    [16] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning technique. In Procs. of AAAI Spring Symposium on Intelligents Engine in Cyberspace, 1999.
    [17] Rennie J , McCallum A . Using reinforcement learning to spider the web efficiently.In:Proceedings of ICML-99, 16th International Conference on Machine Learning, Bled, Slovenia, 1999, 335-343.
    [18] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Procs. of the 26th VLDB Conference, Cairo, Egypt, 2000.
    [19] Najork M, Heydon A. High-Performance Web Crawling. Compaq Systems Research Center Sep. 2001
    [20] F Menczer, G Gant, P Srinivasan. Topic-driven crawlers: Machine Learning Issues. ACM TOIT, 2002.
    [21] Chiasen Chung, Charles L A Clarke. Topic-Oriented Collaborative Crawling, In: CIKM’02, November McLean, Virginia, USA, 2002.
    [22] Jan Fiedler, Joachim Hammer. Using the Web Efficiently: Mobile Crawlers. In: Proc of 7th AoM/IaoM Intl Confrence on Computer Science, San Diego, CA, 1999.
    [23] Silva, I., B. Ribeiro-Neto, P. Calado, N. Ziviani, and E. Moura. Link-based and content-based evidential information in a belief network model. In: Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000. pp. 96-103.
    [24] Amento, B., L. Terveen, and W. Hill. Does “Authority” Mean Quality? Predicting Expert Quality Ratings of Web Documents. In: Proc. 23rd ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000. pp. 296-303.
    [25] Pant, G., P. Srinivasan, and F. Menczer. Exploration versus Exploitation in Topic Driven Crawlers. In: Proc. WWW-02 Workshop on Web Dynamics, 2002.
    [26] Spink, A., D. Wolfram, B. Jansen, and T. Saracevic. Searching the Web: The public and their queries. Journal of the American Society for Information Science, 2001, 52(3), 226–234.
    [27] M. Ehrig, A. Maedche. Ontology-focused Crawling of Web Documents,In Proceedings of the 2003 ACM symposium on Applied computing, 2003
    [28] R. Kosla and H. Blockeel, "Web mining research:a survey," SIG KDD Explorations, 2000, vol. 2, pp.1-15.
    [29] O. Etzioni. The world wide web: Quagmire or gold mine. Communications of the ACM,1996, 39(11): 65-68.
    [30] Jiawei Han, and Micheline Kamber. 数据挖掘——概念与技术. 机械工业出版社, 2001.
    [31] 孙建军,成颖等. 信息检索技术. 科学出版社, 2004.
    [32] 李晓明,闫宏飞,王继民. 搜索引擎—原理、技术与系统. 科学出版社,2004
    [33] 徐宝文,张卫丰. 搜索引擎与信息获取技术. 清华大学出版社,2003
    [34] Menczer F,Belew R K.Adaptive retrieval agents: Internalizing local context and scaling up to the Web.Machine Learning, 2000, 39(2/3): 203-242.
    [35] Chakrabarti S,Punera K,Subramanyam M.Accelerated focused crawling through online relevance feedback.In:WWW2002, Hawaii, 2002. 148~159.
    [36] Aggarwal C C,Al-Garawi F,Yu D.Intelligent crawling on the world wide web with arbitrary predicates.In:WWW2001, Hong Kong, 2001, 96-105.
    [37] Sebastiani F. A tutorial on Automated Text Categorization. In: Analia Amandi and Alejandro Zunino (eds.), Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence. Buenos Aires, AR, 1999, 7-35.
    [38] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002,34(1):1-47.
    [39] D. Levis and M. Ringuette. A comparison of two learning algorithms for text classification, In Third Annual Symposium on Document Analysis and Information Retrieval, 1994, PP. 81-93.
    [40] E. Wiener, J. O. Pedersen and A. S. Weigend. A neural network approach to topic spotting, In Proc. 4th annual symposium on document analysis and information retrieval, 1993, PP. 22-34.
    [41] Y. Yang and J. P. Pedersen. Feature selection in statistical learning of text categorization, In the 14th Int. Conf. On Machine Learning,1997, PP. 412-420.
    [42] T. Joachims. Text categorization with support vector machines: Learning with many relevant features, In ECML ,1998
    [43] Kjersti Aas and Line Eikvil. Text Categorisation : A Survey. Technical Report #941, Norwegian Computing Center, 1999.
    [44] Y. Yang and J.P. Pedersen. A comparative study on feature selection in text categorization. In Jr. D. H. Fisher, editor, The Fourteenth International Conference on Machine Learning (ICML'97), Morgan Kaufmann, 1997, 412-420.
    [45] J. R Quinlan. Induction of decision trees. Machine Learning,1986,1(1):81-106.
    [46] Kenneth Ward Church and Patrick Hanks. Word association norms mutual information and lexicography, In Proceedings of ACL27, 1998, 76-83.
    [47] T. E. Dunning. Accurate methods for the statistics of surprise and coincidence, In Computational Linguistics, 1993, volume 19:1, 61-74.
    [48] J. W. Wilbur and K. Sirotkin, The automatic identification of stop words. J. Inf. Sci. , 1992,18:45-55.
    [49] Y. Yang, Noise reduction in a statistical approach to text categorization and retrieval, In 17th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, Page 13-22.
    [50] Y. Yang and W. J. Wilbur, Using corpus statistics to remove redundant words in text categorization, In J Amer Soc Inf Sci, 1996.
    [51] C. Cortes and V. Vapnik, Support vector networks, Machine learning, 1995, 20(3): 273-297.
    [52] Ceci, M., Appice, A., & Malerba, D. Mr-SBC: A multi-relational nave Bayes classifier. Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003). Springer-Verlag, 2003, pp. 95-106.
    [53] Rocchio, J. Relevance feedback in information retrieval. In The Smart RetrievalSystem: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Englewood Cliffs, NJ, 1971, 313-323.
    [54] E.S. Han, G. Karypis, and V. Kumar. Text categorization using weight adjusted k-nearest neighbor classification. Computer Science Technical Report TR99-019, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota, 1999.
    [55] Oh HJ, Myaeng SH, Lee MH. A practical hypertext categorization method using links and incrementally available class information. In: Belkin NJ, Ingwersen P, Leong MK, eds. Proc. of the 23rd ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-00). Athens: ACM Press, 2000, 264-271.
    [56] Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 2002, 18(2-3):219-241.
    [57] Kan MY, Thi HON. Fast Webpage classification using URL features. In: Otthein H, Hans JS, Norbert F, Abdur C, Wilfried T, eds. Proc. of the 14th ACM Conf. on Information and Knowledge Management (CIKM-05). Bremen: ACM Press, 2005. 325-326.
    [58] Shih LK, Karger DR. Using URLs and table layout for Web classification tasks. In: Feldman SI, Uretsky M, Najork M, Wills CE, eds. Proc. of the 13th Int’l Conf. on the World Wide Web (WWW-2004). New York: ACM Press, 2004. 193-202.
    [59] Nadav Eiron, Kevin S. McCurley: Analysis of anchor text for web search. SIGIR 2003: 459-460
    [60] Jon Kleinberg, Authoritative sources in a hyperlinked environment, in: Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998
    [61] Oliver A. McBryan. GENVL and WWWW: Tools for taming the Web. In Proceedings of the First International Conference on the World Wide Web, Geneva, Switzerland, May 1994. CERN
    [62] Gautam Pant, Padmini Srinivasan. Link Contexts in Classifier-Guided Topical Crawlers. IEEE Transactions on Knowledge and Data Engineering. 2006, 18(1): 107-122
    [63] N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In Proc. 24thAnnual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
    [64] E.J. Glover, K. Tsioutsiouliklis, S. Lawrence, D.M. Pennock, and G.W. Flake, “Using Web Structure for Classifying and Describing Web Pages,” Proc. 11th Int’l World Wide Web Conf.,ACMPress, 2002.
    [65] M. Iwazume, K. Shirakami, K. Hatadani, H. Takeda, and T. Nishida. Iica: An ontology-based internet navigation system. In AAAI-96 Workshop on Internet Based Information Systems, 1996.
    [66] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1-7):107-117.
    [67] G. Attardi, A. Gulli, and F. Sebastiani, “Automatic Web Page Categorization by Link and Context Analysis,” Proc. THAI-99, First European Symp. Telematics, Hypermedia, and Artificial Intelligence, 1999.
    [68] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In WWW7, 1998.
    [69] G. Pant. Deriving Link-context from HTML Tag Tree. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.
    [70] P. Srinivasan, F. Menczer, and G. Pant. A General Evaluation Framework for Topical Crawlers, Information Retrieval 2005, 8(3): 417-447.
    [71] Furnkranz J. Exploiting structural information for text classification on the WWW. IDA 99. Amsterdam: SpringerVerlag, 1999. 487-497
    [72] Ghani R, Slattery S,Yang Y. Hypertext categorization using hyperlink patterns and meta data. BrodleyC, ICML’01. San Francisco: Morgan Kaufmann, 2001
    [73] Hwanjo Yu, Kevin Chen-Chuan Chang, Jiawei Han. Heterogeneous Learner for Web Page Classification. Second IEEE International Conference on Data Mining (ICDM'02), 2002. 538-545.
    [74] R Eberhart, J Kennedy. A new optimizer using particle swarm theory. In: Proc of the 6th Intl Symposium on Micro Machine and Human Science. Piscataway, NJ: IEEE Service Center, 1995. 39-43
    [75] Kennedy J, Eberhart R C. Particle Swarm Optimization. Proc. IEEE Int’l Conf. on Neural Networks, IEEE Service Center, Piscataway, NJ, 1995, 4: 1942-1948
    [76] Shi Y and Eberhart RC. A modified particle swarm optimizer. Proceedings of the 1998IEEE Conference on Evolutionary Computation. AK. Anchorage[C] . 1998.
    [77] R. Eberhart, and Y. Shi. Comparison between Genetic Algorithms and Particle Swarm Optimization. Proceedings of the Seventh Annual Conf. on Evolutionary Programming, 1998. 611-619
    [78] 谢晓锋, 张文俊, 杨之廉. 微粒群算法综述. 控制与决策, 2003, 18 (2): 129-134
    [79] 杨维, 李歧强. 粒子群优化算法综述. 中国工程科学, 2004, 6(5): 87-94
    [80] Parsopoulos, K. E, AND Vrahatis, M. N. Particle swarm optimizer in noisy and continuously changing environments. Artificial Intelligence and Soft Computing, 2001. 289-294
    [81] 张岩, 李文辉, 孟宇, 庞云阶. 应用 PSO 的快速纹理合成算法. 计算机研究与发展. 2005, 42(3): 424-430
    [82] Eberhart R C, Hu X. Human Tremor Analysis Using Particle Swarm Optimization. Proceedings of the IEEE Congress on Evolutionary Computation (CEC 1999). Piscataway, NJ: IEEE Service Center, 1999. 1927-1930
    [83] H. Yoshida, K. Kawata, Y. Fukuyama, S. Takayama, and Y. Nakanishi. A particle swarm optimization for reactive power and voltage control considering voltage security assessment. IEEE Transactions on Power Systems, 2000, 15(4):1232-1239
    [84] Manevitz L , Yousef M. One-class SVMs for document classification. J. of Machine Learning research, 2001
    [85] Yu, H, Han, J. & Chang, K. PEBL: Positive example based learning for Web page classification using SVM. KDD-02, 2002
    [86] 张利彪, 周春光, 刘小华, 马铭. 粒子群算法在求解优化问题中的应用.吉林大学学报(信息科学版). 2005, Vol. 23 No. 4, 385-389.
    [87] Hardin T., Cui X., Ragade R. K., Graham J. H., and Elmaghraby A. S. A Modified Particle Swarm Algorithm for Robotic Mapping of Hazardous Environments, The 2004 World Automation Congress, SEVILLE, Spain, 2004.
    [88] Merwe V. D. and Engelbrecht, A. P. Data clustering using particle swarm optimization. Proceedings of IEEE Congress on Evolutionary Computation 2003 (CEC 2003), Canbella, Australia, 2003. pp. 215-220.
    [89] Omran, M., Salman, A. and Engelbrecht, A. P. Image classification using particle swarm optimization. Proceedings of the 4th Asia-Pacific Conference on SimulatedEvolution and Learning 2002 (SEAL 2002), Singapore, 2002. pp. 370-374.
    [90] Shi, Y. H., Eberhart, R. C. Parameter Selection in Particle Swarm Optimization, The 7th Annual Conference on Evolutionary Programming, San Diego, CA, 1998.
    [91] Nigam, K., McCallum, A., Thrun, S., & Mitchell, T., Learning to classify text from labeled and unlabeled documents, AAAI-98, 1998, pp. 792-799.
    [92] Denis, F. PAC learning from positive statistical queries. ALT-98, 1998.
    [93] Denis, F. Gilleron, R and Tommasi, M. Text classification from positive and unlabeled examples. IPMU, 2002.
    [94] Liu, B., Lee, W. S., Yu, P., and Li, X. Partially supervised classification of text documents. ICML-02, 2002.
    [95] Liu, B., Dai, Y., Li, X., Lee, W.S. & Yu, P. Building Text Classifiers Using Positive and Unlabeled Examples. Proceedings ICDM-03, 2003.
    [96] Brewington BE, Cybenko G. How dynamic is the Web? In: Proc. of the 9th Int’l World Wide Web Conf. North-Holland: Elsevier Science Publishers, 2000. 257-276.
    [97] Huaxiang Zhang, Shangteng Huang. An Incremental Approach to Link Evaluation in Topic-Driven Web Resource Discovery. AAIM 2005, LNCS 3521, 2005, 301-310.
    [98] R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.
    [99] J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of WWW ’01, 2001, 106-113.
    [100] A. Ntoulas, J. Cho, and C. Olston. What’s new on the web?: the evolution of the web from a search engine perspective. In Proc. 13th International World Wide Web Conference, 2004.
    [101] P. G. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content changes in text databases. In Proc. 21st IEEE International Conference on Data Engineering (ICDE), 2005.
    [102] B. E. Brewington and G. Cybenko. Keeping up with the changing Web. IEEE Computer, 2000, 33(5):52–58.
    [103] Ronald L. Rivest. The MD5 Message-Digest Algorithm. RFC 1321. April 1992. http://www. ietf.org/rfc/rfc1321.txt
    [104] Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. SIGKDD 2002.New York:ACM Press,2002, 588-593.
    [105] 张志刚,陈静,李晓明. 一种 HTML 网页净化方法[J]. 情报学报,2004, 23(4):387-393.
    [106] 常育红,姜哲,朱小燕. 基于标记树表示方法的页面结构分[J].计算机工程与应用,2004,40(16):129-132.
    [107] Wang Jiying, Lochovsky F H. Data-rich section extraction from HTML pages. Proceeding of the Third International Conference on Web Information Systems Engineering (Workshops). Singapore:IEEE Computer Society, 2002, 313-322.
    [108] 欧健文,董守斌,蔡斌. 模板化网页主题信息的提取方法[J]. 清华大学学报(自然科学版),2005,45(9),1743-1747.
    [109] Cai Deng, Yu Shi-peng, Wen Ji-rong et al. Extracting content structure for Web pages Based on visual representation[C]// Proceeding of the 6th Asia Pacific Web conference. Xian:Springer,2003, 406-417.
    [110] 孟涛,王继民,闫宏飞. 网页变化与增量搜索技术. 软件学报, 2006, Vol.17, No.5, pp. 1051-1067.
    [111] Cho J, Garcia-Molina H. Synchronizing a database to improve freshness. In: Proc. of the 2000 ACM Int’l Conf. on Management of Data. New York: ACM Press, 2000. 117-128.
    [112] M. Najork and J. Wiener. Breadth-first search crawling yields high-quality pages. In Proc. of the 10th International World Wide Web Conference, May 2001.
    [113] Bar-Ilan J, Peritz BD. Evolution, continuity, and disappearance of documents on a specific topic on the Web: A longitudinal study of “informetrics”. Journal of the American Society for Information Science and Technology, 2004,55(11):980-990.
    [114] Pandey S, Olston C. User-Centric Web crawling. In: Proc. of the 14th Int’l Conf. on World Wide Web. New York: ACM Press, 2005. 401-411.
    [115] 同济大学概率统计教研组. 概率统计(第三版). 同济大学出版社,2004.
    [116] Cho J, Garcia-Molina H. Estimating frequency of change. ACM Trans. on Internet Technology, 2003,3(3):256-290.
    [117] Donna Bergmark, Carl Lagoze, Alex Sbityakov. Focused Crawls, Tunneling, and Digital Libraries. Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Lecture Notes In Computer Science, 2002, Vol. 2458, 91-106.
    [118] Wong, W. and Fu, A. W. Finding Structure and Characteristics of Web Documents for Classification, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Dallas, TX, USA, 2000.
    [119] Embley, D. W, Jiang, Y., and Ng, Y.-K. Record-boundary discovery in Web documents, In Proc. the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999.
    [120] Chakrabarti, S. Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction, In the 10th International World Wide Web Conference, 2001.
    [121] Krishna Bharat and George A. Mihaila. When experts agree: Using non-affiliated experts to rank popular topics. In ACM Transactions on Information Systems, January 2002, Vol. 20, No. 1, pp. 47-58.
    [122] Erik Hatcher, Otis Gospodnetic. Lucene in Action. Manning Publications, 2004.http://developers.sun.com/learning/javaoneonline/2004/newcooltech/TS-2994.pdf
    [123] Brown E W, Callan J P, Croft W B. Fast Incremental Indexing for Full-Text Information Retrieval. Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994, 192-202.
    [124] 赫枫龄,左万利,张雪松,高性能网页索引器 JU_Indexer 的实现[J].吉林大学学报(理学版),2006,44(1):50-56.
    [125] 李凯,赫枫龄,左万利,PageRank-Pro— 一种改进的网页排序算法[J].吉林大学学报(理学版),2003,41(2):175-179.
    [126] Tzi-cker Chiueh, Lan Huang. Efficient Real-Time Index Updates in Text Retrieval Systems[R]// ECSL Technical Report 66, August 1998.
    [127] Yuan Wencui, Zuo Wanli, Xu Qingyang. Generation of Classifier for Domain-Specific Hidden Web Search Interface. In: Proc of the 11th Joint Intl Computer Conference, Chengdu, 2005. 657-660.
    [128] Sriram Raghavan, Hector Garcia-Molina. Crawling the HiddenWeb. Proceedings ofthe 27th VLDB Conference,Roma, Italy, 2001
    [129] Lin, K.I. and Chen, H. Automatic Information Discovery from the Invisible Web. International Conference on Information Technology: Coding and Computing,2002.
    [130] Luis Gravano and Panagiotis G. Ipeirotis. QProber: A System for Automatic Classification of Hidden-Web Databases,ACM Transactions on Information Systems, 2003, vol. 21, no.1, 1-41.
    [131] Luciano Barbosa, Juliana Freire. Searching for HiddenWeb Databases. Eighth International Workshop on the Web and Databases (WebDB 2005), Baltimore, Maryland, June,2005.
    [132] B. He and K. C.C. Chang. Automatic Complex Schema Matching across Web Query Interfaces: A Correlation Mining Approach. ACM Transactions on Database Systems (TODS), 2006, 31(1): 346-395.
    [133] Ramanand R, Lin KI. Discovering the biomedical deep web. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005 LECTURE NOTES IN COMPUTER SCIENCE, 2005, 3806: 616-617.
    [134] Su WF, Wang JY, Lochovsky F. Automatic hierarchical classification of structured deep Web databases. WEB INFORMATION SYSTEMS - WISE 2006, PROCEEDINGS LECTURE NOTES IN COMPUTER SCIENCE, 2006, 4255: 210-221.
    [135] Caverlee J, Liu L, Rocco D. Discovering interesting relationships among deep web databases: A source-biased approach. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2006, 9 (4): 585-622.
    [136] Ngu AHH, Buttler D, Critchlow T. Automatic generation of data types for classification of Deep Web sources. DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS LECTURE NOTES IN COMPUTER SCIENCE, 2005, 3615: 266-274.
    [137] Shestakov D, Bhowmick SS, Lim EP. DEQUE: querying the deep web. DATA & KNOWLEDGE ENGINEERING, 2005, 52 (3): 273-311.
    [138] Caverlee J, Liu L. QA-Pagelet: Data preparation techniques for large-scale data analysis of the Deep Web. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (9): 1247-1262.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700