P2P网络中Top-k查询算法的设计与实现

英文题名：Design and Implement of Top-k Query in P2P Network
作者：于文婷
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：P2P网络 ; 直方图 ; 聚合Top-k查询 ; 非聚合Top-k查询
英文关键词：P2P Network ; Histogram ; Aggregate Top-k ; Non-Aggregate Top-k
学位年度：2009
导师：李克秋
学科代码：081201
学位授予单位：大连理工大学
论文提交日期：2009-11-15

摘要

近年来,随着信息技术的迅猛发展,信息资源极大丰富,如何在动态的P2P网络环境中对海量数据进行查找引起了很大的关注。Top-k查询就是从数量巨大的信息中选择最符合查询条件的k个结果呈现给用户,Top-k查询作为一种新的查询技术引起了学术界广泛的关注,主要包括聚合式top-k查询和非聚合式top-k查询两种。然而现有的聚合式top-k查询算法只按照分值标准选择合适的对象返回给查询节点,相同的阈值标准没有考虑到节点数据分布情况,非聚合式top-k查询算法只能排除非法节点,不能排除有效节点中的非法对象。
针对聚合式top-k查询的缺陷,本论文提出了一种基于P2P网络的混合非一致阈值聚合top-k搜索算法HNUTA(hybrid non-uniform threshold algorithm),HNUTA结合位置选择标准和分值选择标准,通过对每个节点重新定义阈值,并且对每个对象估计极大值和极小值,通过比较当前top-k和候选集中对象的极大值,除去候选集中的非法对象,达到减少非法对象传输的效果。针对非聚合式top-k查询,提出了一种依托于超级立方体骨干P2P网络的控制答复数量top-k算法CRNTop-k(control reply number top-k),该算法利用向量空间模型在本地查询top-k,然后在父节点上合并结果,通过控制查询答复数量的方式来减少带宽消耗。最终通过实验评估和性能分析表明本论文提出的算法在网络带宽消耗和查询响应时间方面要优于其他同类方法。
In recent years, with the rapid development of the information technology, information resources have been greatly enriched. How to retrieve the small amount of the most valuable information from the massive data in P2P network quickly has become a great challenge for the database field. Top-k queries based on ranking elements stop query processing when the top-k ranked results can be safely determined including aggregate top-k query and non-aggregate top-k query. Given an aggregate operation, a top-k query is to find the k objects with the highest (or the lowest) aggregate values. Existing top-k aggregate query algorithm has the uniform threshold and selects the object according to the value standard neglecting the data distribution. The existing non-aggregation top-k algorithm only supports peer pruning search, and users can not search the data based on pruning the illegal object.
This paper proposes hybrid non-uniform threshold algorithm in P2P network, it refines the threshold according to the value and position standard. HNUTA can estimate the best and worst value of the object. Compared between the top-k and the best value in the candidate set, it can remove the illegal object from the candidate set. This paper proposes top-k query in P2P network. It dynamically adjusts of the upper score according to the loss function. This method reduces the reply traffic by controlling the number of query replies. In addition, through experimental simulation and theoretical analysis, the proposed algorithm is superior to the current top-k algorithm. It consumes less bandwidth and has better efficiency.

引文

[1]邓波.分布式序敏感查询处理关键技术研究[D].长沙:国防科学技术大学,2006.
    [2]庄明.基于P2P的路由选择技术研究[D].上海:上海大学,2005.
    [3]徐传运,张杨,毛华扬.P2P全文搜索引擎中的路由算法[J].计算机工程,2008,34(17):85-87.
    [4]Risson J,Moors T.Survey of research towards robust peer-to-peer networks:Search methods[J].Computer Networks,2006,50(17):3485-3521.
    [5]Joung YJ,Yang LW.Wildcard search in structured peer-to-peer networks[J].IEEE Trans.on Knowledge and Data Engineering,2007,19(11):1524-1540.
    [6]Zeinalipour-Yazti D,Kalogeraki V,Gunopulos D.pFusion:A P2P architecture for Internet-scale content-based search and retrieval[J].IEEE Trans.on Parallel and Distributed Systems,2007,18(6):804-817.
    [7]张一鸣,卢锡城,郑倩冰.一种面向大规模P2P系统的快速搜索算法[J].软件学报,2008,19(6):1473-1480.
    [8]方启明,杨广文,武永卫.基于P2P的Web搜索技术[J].软件学报,2008,19(10):2706-2719.
    [9]杨天路,刘字宏,张文.P2P网络技术原理与系统开发案例[M].北京:人民邮电出版社,2007.
    [10]庄雷,潘春简,郭永强.Gnutella网络的连接管理[J].软件学报,2005,16(4):540-551.
    [11]P.Druschel,A.Rowstron.Past:A large-scale,persistent peer-to-peer storage utility[C].The Eighth IEEE Workshop on Hot Topics in Operating Systems(HotOS-Ⅷ),Schoss Elmau,Germany,2001:75-80.
    [12]Kubiatowicz J,Bindel D,Chen Y,et al.OceanStore:an architecture for global-scale persistent storage[C].The Ninth international Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2000),Boston,MA,2000:190-201.
    [13]Ratnasamy S,Francis P,Handley M,et al.A scalable content-addressable network[C].ACM SIGCOMM,San Diego,CA,2001:161-172.
    [14]Stoica I,Morris R,Karger D,et al.Chord:A scalable peer-to-peer lookup service for Internet applications[C].ACM SIGCOMM' 01,San Diego,CA,2001:149-160.
    [15]A.Rowstron,P.Druschel.Pastry:Scalable,distributed object location and routing for large-scale peer-to-peer systems[C].IFIP/ACM Middleware 2001,Heidelberg,Germany,2001:329-350.
    [16]Zhao BY,Kubiatowicz J,Joseph AD.Tapestry:An infrastructure for fault-tolerant wide-area location and routing[R].Technical Report UCB/CSD-01-1141,Computer Science Division,U.C.Berkeley,April 2001.Available at http://www.cs.berkeley.edu/ravenben/publications/CSD-01-1141.pdf.
    [17]Harvey NJA,Jones MB,Saroiu S,et al.SkipNet:A scalable overlay network with practical locality properties[C].The 4th USENIX Symp.on Interact Technologies and Systems(USITS 2003),Berkeley,2003:113-126.
    [18]Saia J,Fiat A,Gribble S,et al.Dynamically fault-tolerant content addressable networks[C].The 1st International Workshop on Peer-to-Peer Systems,Cambridge,MA,2002:270-279.
    [19]S.Nepul,M.V.Ramakfislma.Query processing issues in image(multimedia) databases [C].The 15th International Conference on Data Engineering(ICDE),Sydney,Australia,1999:22-29.
    [20]U.G(u|¨)ntzer,W.-T.Balke,W.Kiessling.Optimizing Multi-Feature Queries for Image Databases[C].The 26th Very Large Databases(VLDB) Conference,Cairo,Egypt,2000:419-428.
    [21]R.Fagin,A.Loten,M.Naor.Optimal aggregation algorithms for middleware[J].J.Comput.Syst.Sci,2003,66(4):614-656.
    [22]P.Cao,Z.Wang.Efficient top-K query calculation in distributed Networks[C].The twenty-third annual ACM symposium on Principles of distributed computing,Newfoundland,Canada,2004:206-215.
    [23]何盈捷,王珊,杜小勇.纯Peer to Peer环境下有效的Top-k查询[J].软件学报,2005,16(4):540-552.
    [24]R.Akbarinia,E.Pacitti and P.Valduriez.Reducing network traffic in unstructured P2P systems using Top-k queries[J].Distributed and Parallel Databases,2006,19(2):67-86.
    [25]Liang NY.The knowledge of Chinese words segmentation[C].Journal of Chinese Information on Processing(in Chinese),1990,4(2):37-43.
    [26]严蔚敏,吴伟民.数据结构(C语言版)[M].北京:清华大学出版社,2001,260-262.
    [27]B.Bloom.Space/time trade-off in hash coding with allowable errors[J].Communication of the ACM,1970,13(7):422-426.
    [28]M.Theobald,G.Weikum,and R.Schenkel.Topk query evaluation with probabilistic guarantees[C].The Thirtieth international conference on Very large data bases,Toronto,Canada,2004:648-659.
    [29]S.Michel,P.Triantafillou,and G.Weikum.KlEE:A Framework for distributed top-k query algorithms[C].The 31st international conference on Very large data bases,Trondheim,Norway,2005:637-648.
    [30]G.Skobeltsyn,T.Luu,I.P.Zarko,et al.Web text retrieval with a P2P query-driven index[C].The 30th annual international ACM SIGIR conference on Research and development in information retrieval,Amsterdam,The Netherlands,2007:679-686.
    [31]W.Balke,W.Nejdl,W.Siberski.Progressive distributed top-k retrieval in peer-to-peer networks[C].The 21st International Conference on Data Engineering(ICDE 2005),Tokyo,Japan,2005:174-185.
    [32]王雷,林亚平,陈治平.超立方体系统中基于安全通路向量的容错路由[J].软件学报,2004,15(5):783-790.
    [33]Kanellopoulos D.,Panagopoulos A..Exploiting tourism destinations' knowledge in an RDF-based P2P network[J].Journal of Network and Computer Applications,2008,31(2):179-200.
    [34]W.Nejdl,M.Wolpers,W.Siberski,et al.Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-To-Peer Networks[J].Web Semantics,2004,1(2):177-186.
    [35]B.Yang,H.Garcia-Molina.Designing a super-peer network[C].The 19th international conference on Data Engineering(ICDE),Bangalore,India,2003:49-60.
    [36]W.Nejdl,B.Wolf,C.Qu,et al.EDUTELLA:a P2P Networking Infrastructure based on RDF[C].The eleventh international conference on World Wide Web,Honolulu,Hawaii,USA,2002:604-615.
    [37]M.Schlosser,M.Sintek,S.Decker,et al.HyperCuP-Hypercubes,Ontologies and Efficient Search on P2P Networks[C].Intern.Workshop on Agents and P2P Computing,Bologna,Italy,2002:112-124.
    [38]陈文博.中文文本自动分类系统的研究与实现[D].长春:吉林大学,2008.
    [39]IRCache.Traces from the ircache system[R].Technical Report http://www.ircache.net,National Laboratory for Applied Network Research,2003.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700