有效的爬行Ajax页面的网络爬行算法

英文篇名：Efficient Algorithm for Crawling Ajax Web Pages
作者：李华波 ; 吴礼发 ; 赖海光 ; 郑成辉 ; 黄康宇
英文作者：LI Hua-bo,WU Li-fa,LAI Hai-guang,ZHENG Cheng-hui,and HUANG Kang-yu(College of Command Information System,PLAUST Nanjing 210007)
关键词：Ajax ; 爬行算法 ; 消重策略 ; 搜索引擎
英文关键词：Ajax;crawling algorithm;replicas-detecting policy;search engine
中文刊名：DKDX
英文刊名：Journal of University of Electronic Science and Technology of China
机构：解放军理工大学指挥信息系统学院;
出版日期：2013-01-30
出版单位：电子科技大学学报
年：2013
期：v.42
基金：江苏省自然科学基金(BK2010132)
语种：中文;
页：DKDX201301026
页数：6
CN：01
ISSN：51-1207/T
分类号：117-122

摘要

Ajax页面的生成和页面导航需要执行客户端的JavaScript代码,传统网络爬行算法无法获取Ajax页面全部内容。分析了Ajax的工作方式,阐述了爬行Ajax网页所面临的主要问题,提出并实现了一种有效爬行Ajax页面的网络爬行算法。该算法可控制客户端浏览器动态生成页面内容和完成页面导航,为爬行过的页面分配标识编号并生成相应静态页面。实验结果表明,提出的算法所爬行的Ajax页面数量明显多于传统方法,同时,采用的双重消重策略可有效减少算法的时间耗费。
The generation of Ajax web pages and the Ajax page navigation must execute the client JavaScript,thus it is impossible to extract the complete content of an Ajax page through the traditional crawling algorithms.In this paper,the working mode of Ajax is analyzed,the problem of crawling Ajax web pages is elaborated,and an effective algorithm for crawling Ajax pages is proposed.The algorithm can realize the dynamic generation of Ajax web contents in client browser and the navigation of Ajax web pages,and also it can assign identification number for the crawled pages whose static pages can be generated.Experimental result shows that the number of Ajax pages crawled by the proposed algorithm is obvious bigger than the traditional ones',and the presented replicas-detecting policies can effectively reduce the time consumption of the algorithm.

引文

[1]SHAH S.Crawling Ajax-driven Web 2.0 applications[EB/OL].[2011-01-18].http://www.infosecwriters.com/texts.php?op=display&id=539.
    [2]FREY G.Indexing AJAX Web applications[D].Zurich,Switzerlang:Swiss Federal Institute of Technology,2007.
    [3]MESBAH A,BOZDAG E,DEURSEN.VAN A.CrawlingAJAX by inferring user interface state changes[C]//Proceedings of the 8th International Conference on WebEngineering.New York,USA:[s.n.],2008.
    [4]罗兵.支持AJAX的互联网搜索引擎爬虫设计与实现[D].杭州:浙江大学,2007.LUO Bing.The design and implement of AJAX-enabledinternet search engine crawler[D].Hangzhou:ZhejiangUniversity,2007.
    [5]王映,于满泉,李盛韬.JavaScript引擎在动态网页采集技术中的应用[J].计算机应用,2004,24(2):33-36.WANG Ying,YU Man-quan,LI Sheng-tao.Extractingdynamic URLs using JavaScript engine[J].Journal ofComputers Applications,2004,24(2):33-36.
    [6]金晓鸥,钟宝燕,李翔.基于Rhino的JavaScript动态页面解析研究与实现[J].计算机技术与发展,2008,18(2):1-4.JIN Xiao-ou,ZHONG Bao-yan,LI Xiang.Research andimplementation of interpreting JavaScript dynamic Webpage based on Rhino engine[J].Computer Technology andDevelopment,2008,18(2):1-4.
    [7]张世永.网络安全原理与应用[M].北京:科学出版社,2006.ZHANG Shi-yong.Network security principle andapplication[M].Beijing:Science Press,2006.
    [8]李晓明,凤旺森.两种对URL的散列效果很好的函数[J].软件学报,2004,15(2):179-184.LI Xiao-ming,FENG Wang-sen.Two effective functions onhashing URL[J].Journal of Software,2004,15(2):179-184.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700