基于Nutch的移动WEB搜索系统的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着3G时代的到来,移动电话,便携计算机等移动设备的普及,越来越多的用户使用移动终端就能够便捷的访问网络。这样用户对于个性化和智能化搜索引擎的需求更加明显。现有的移动终端的搜索引擎,大都是直接把本地搜索引擎转移到移动终端。这些移动搜索引擎仅仅利用纯粹的文本相关度进行搜索,甚至把用户输入的位置信息也当做普通的文本关键字,并没有很好的和用户地理位置等移动空间信息结合起来,而人们在使用移动设备搜索时大多数需求都与空间位置密切相关。移动用户进行搜索查询时,一般希望搜索引擎不仅可以提供与查询内容密切相关的网页,而且可以提供与用户所在位置空间距离相近的网页。因此,现有的移动搜索引擎很难使用户获得理想的查询结果。
     本文针对移动搜索引擎所面临的问题入手,研究同时基于文本相关性搜索和地理位置相近性搜索的解决方案,提出了一个基于Nutch的移动WEB搜索系统的实现方案,搭建了一个基于位置和关键字双重搜索的移动WEB搜索系统,实现了位置相关的空间搜索。根据网页所描述内容的地理位置信息对网页进行地理标记,该方案可以搜索到与用户所在位置相关的网页,可以用于解决移动用户搜索附近相关性结果的难题。通过使用Lucene和R-tree的混合索引,系统实现了对搜索排序结果的有效优化,验证了混合索引结构能够更快速的为用户提供综合文本相关和距离相近性的结果。
     本文阐述了整套系统的整体框架结构设计和各个主要模块的实现细节,详细介绍了网页预处理模块,索引建立模块和搜索模块的各个关键技术,包括对网页进行地理标记,基于文本聚类的混合索引插入算法,以及节点优先队列的搜索算法。最后,在功能方面和性能方面对系统进行验证测试。测试结果表明,移动WEB搜索系统具备了综合地理位置和文本信息的双重搜索功能,并具备较好的性能。
With the popularity of the3G technology, mobile phones, portable computers and other mobile devices are becoming more common. More and more users can able to access the internet via mobile terminals conveniently. Thus, users have a more clear demand to get an intelligent and personalized search engine. The existing mobile search engines are mostly directly transferred from the local search engine. These search engines can only be used to search text relevant result, since they just regard the position information input by user as a normal text keyword. They can't combine themselves with user's location and other mobile information.However, mobile usersalways need to search some location related results. When they search a query, they hope they can get both text-related and location-closed web pagesfrom the search engine. Therefore, the existing mobile search engine can hardly provide ideal search results for mobile users.
     This paper is aiming to resolve this problem for mobile users and mainly research the resolution to get both text-related and location-closed web pages. This paper proposes a space search method to get the location-closed web pages by geotagging all webpages according to web pages' description location in advance. Eventually, this paper implements a mobile WEB search system based on the existing open search engine-Nutch. This paper proposes a hybrid index structure based on Lucene and R-tree, as well as a "Node Priority Traversal Algorithm"which is corresponding to the hybrid structure. The mobile WEB search system uses this hybrid index to index both location and text content of web pages, and then uses the "Node Priority Traversal Algorithm" to give out located and text-related results to mobile users.
     This paper firstly describes the overall framework and structural design of the mobile WEB search system. Then the paper introduces the implementation details about each module, including geo-tagging in the web page preprocessing module, cluster enhancing hybrid index in the indexing module, and "Node Priority Traversal Algorithm" in the searching module. After that, this paper evaluates the function and performance of the mobile WEB search system. Finally, this paper proves the system can provide both text-related and location-closed web pages for mobile users and have a good performance.
引文
[1]江慧娜.中文搜索引擎的关键技术研究[D].北京化工大学,2007.
    [2]李晓明,闫宏飞,王继民.搜索引擎-原理、技术与系统[M].北京:科学出版社,2005:63-65.
    [3]Fuyong Yuan, Chunxia Yin, Jian Liu.Improvement of pageRank for focused crawler[C].In Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing,2007. SNPD 2007. Eighth ACIS International Conference on,2007(5):797-802.
    [4]Xianchao Zhang, Hong Yu, Cong Zhang, Xinyue Liu. An improved weighted HITS algorithm based on similarity and popularity[C].Computer and Computational Sciences,2007.IMSCCS 2007. Second International Multi-Symposiums on,2007: 477-480.
    [5]Gerard Salton, Edward A.Fox, Harry Wu. Extended boolean information retrieval [J]. Communications of the ACM.1983,26(11):1022-1036.
    [6]G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing[J]. Communications of the ACM.1975,18(11):613-620.
    [7]Z. Cao, T. Qin, T.Y. Liu, M.F Tsai, and H. Li. Learning to rank:from pairwise approach to listwise approach[C]. In Proc. of the 24th international conference on Machine learning,2007:129-136.
    [8]腾讯 CTO:“四化”是搜索引擎未来发展趋势http://cloud.itl68.com/a2010/0717/1078/000001078860.shtml.
    [9]袁琦.移动搜索技术与业务发展研究[J].电信网技术,2007(4):38-42
    [10]皋磊,任立红,丁永生等.基于WAP的移动电子商务系统的设计与实现[J].计算机工程与应用,2003,39(1):215-217.
    [11]张桂刚.一种类自然语言驱动的语义服务搜索方法[J].计算机科学,2009,36(7):107-112.
    [12]史磊峰.移动垂直搜索系统的研究[D].北京:北京交通大学,2010.
    [13]李景.基于DOM树信息抽取的移动网站开发研究[D].中国海洋大学,2011.
    [14]Justin Zobel, Alistair Moffat. Inverted files for text search engines[J]. ACM Computing Surveys (CSUR).2006,38(2):6-12.
    [15]A. Guttman. R-trees:a dynamic index structure for spatial searching[C].In Proc. of the 1984 ACM SIGMOD international conference on Management of data. 1984:47-57.
    [16]杨滋荣.基于Web数据挖掘的面向领域高性能信息检索研究[D].贵州大学,2008.
    [17]姚林涛.基于Lucene的Web搜索引擎实现[D].西安电子科技大学,2008.
    [18]王亮.搜索引擎零距离[M].北京:清华大学出版社,2009:26-27.
    [19]曹卫峰.中文分词关键技术研究[D].南京理工大学,2009.
    [20]岳中原.词典与统计相结合的中文分词的研究[D].武汉理工大学,2010.
    [21]Amitay E, Har'El N, Sivan R, Soffer A. Web-a-Where:Geotagging Web content[C]. In Proc. of the 27th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2004),2004:273-280.
    [22]翁岩青.网页抓取策略研究[D].哈尔滨工程大学,2010.
    [23]于鲁波.半格式化网页信息提取与应用[D].中国科学技术大学,2008.
    [24]陈少明.基于用户行为与本体的查询词扩展研究[D].西华大学,2010.
    [25]杨晓东.中文命名实体识别及若干相关问题的研究[D].江苏大学,2010.
    [26]钱晶,张杰,张涛.基于最大熵的汉语人名地名识别方法研究[J].小型微型计算机系统,2006,9.
    [27]高红,黄德根,杨元生.汉语自动分词中中文地名识别[J].大连理工大学学报,2006,4.
    [28]牟力科.Web中文信息抽取技术与命名实体识别方法的研究[D].西北大学,2008.
    [29]马龙.基于条件随机域模型的中文地名识别的研究[D].大连理工大学,2009.
    [30]王鹏.移动搜索引擎原理与实践[M].北京:机械工业出版社,2009:49-52.
    [31]Ian H. Witten, Alistair Moffat, Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images[M]. U.S.A:Academic Press, 1999:146-147.
    [32]陈镇虎.面向空间数据库引擎的空间索引系统[D].北京工业大学,2002.
    [33]Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger, The R*-tree:an efficient and robust access method for points and rectangles[C].In Proc. of the 1990 ACM SIGMOD international conference on Management of data,1990:322-331.
    [34]Du Mouza, C. Litwin, W.Rigaux, P. SD-Rtree:A Scalable Distributed Rtree[C]. In Proc. of the 23th Int'l Conf. on Data Engineering (ICDE 2007), 2007:296-305.
    [35]陈敏.基于R-树空间索引的优化研究与应用[D].福州大学,2006.
    [36]Shashi,Shekhar,Sanjay,Chawla空间数据库[M].谢昆青,马修军,杨冬青等译.北京:机械工业出版社,2004.
    [37]Eltabakh, M.Y. Eltarras, R. Aref, W.G. Space-Partitioning Trees in PostgreSQL:Realization and Performance[C].In Proc. of the 22th Int'l Conf. on Data Engineering (ICDE 2006),2006:100.
    [38]Felipe ID, Hristidis V, Rishe N. Keyword search on spatial databases[C]. In Proc. of the 24th Int'l Conf. on Data Engineering (ICDE 2008),2008:656-665.
    [39]Zhang D, Chee YM, Mondal A, Tung AKH, Kitsuregawa M. Keyword search in spatial databases:Towards searching by document[C]. In Proc. of the 25th Int'l Conf. on Data Engineering (ICDE 2009),2009:688-699.
    [40]Zhang D, Ooi BC, Tung AKH. Locating mapped resources in Web 2.0[C]. In Proc. of the 26th Int'l Conf. on Data Engineering (ICDE 2010),2010:521-532.
    [41]Cong G, Jensen CS, Wu D. Efficient retrieval of the top-k most relevant spatial Web objects[J]. Journal Proc. of VLDB Endowment (PVLDB 2009), 2009,2(1):337-348.
    [42]华秀丽,朱巧明,李培峰.语义分析与词频统计相结合的中文文本相似度量方法研究[J].计算机应用研究,2012,03:833-836.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700