海量多版本文档的加权持久性top-k检索
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Weighted Durable Top-kSearch on Large Versioned Document Set
  • 作者:兰超 ; 张勇 ; 邢春晓
  • 英文作者:Lan Chao;Zhang Yong;Xing Chunxiao;Department of Computer Science and Technology,Tsinghua University;Research Institute of Information Technology,Tsinghua University;
  • 关键词:多版本文档 ; top-k查询 ; 时态查询 ; 文书类档案 ; 多版本查询
  • 英文关键词:versioned documents;;top-ksearch;;temporal queries;;document archive;;time-travel search
  • 中文刊名:JFYZ
  • 英文刊名:Journal of Computer Research and Development
  • 机构:清华大学计算机科学与技术系;清华大学信息技术研究院;
  • 出版日期:2013-12-15
  • 出版单位:计算机研究与发展
  • 年:2013
  • 期:v.50
  • 基金:国家“九七三”重点基础研究发展计划基金项目(2011CB302302)
  • 语种:中文;
  • 页:JFYZ2013S2018
  • 页数:11
  • CN:S2
  • ISSN:11-1777/TP
  • 分类号:128-138
摘要
提出并研究了针对海量多版本文档的加权持久性top-k检索问题.加权持久性top-k检索能够返回在一个限定时间区间内与查询关键词组持续相关的k个结果,并且考虑不同时间区间有不同的权重.针对这一问题,把现有时空查询和针对多版本文档查询的方法进行扩展,使其支持加权持久性top-k检索问题,并分析总结了该方法的缺点,进而又提出了一种新的基于时间区间窗口的算法.基于时间区间窗口的算法能够支持多种经典top-k算法并有效地解决加权持久性top-k检索问题.最后使用Wikipedia多版本数据进行了一系列性能试验,对比测试了基于区间窗口的算法和扩展算法.结果表明区间窗口算法在各个测试下的效率和可扩展性明显优于扩展算法.
        We study and propose a new problem called weighted durable top-k search on versioned documents.Weighted durable top-k search finds the documents that consistently in top-k of a query during a time period and take the weights of different time intervals in account.We extend existing works about time travel search and versioned document query to support the problem and analyze its performances.We analyze the disadvantages of the extended method.Based on that,we furfure propose a new interval window based method to address the problem.Our method is adapted from classic top-krank algorithm,using interval windows to support time and weight constrains.We use data from Wikipedia to demonstrate the efficiency and scalability of our method.Results show that our interval window based method performances far more efficient than the extended method.
引文
[1]黄荣荣,舒继武,陈康,等.基于连续多版本的可审计文件系统.计算机研究与发展,2009,46(11):1830-1838
    [2]Fagin R,Lotem A,Naor M.Optimal aggregation algorithms for middleware.Journal of Computer and System Sciences,2003,66(4):614-656
    [3]Mamoulis N,Yiu M L,Cheng K H,et al.Efficient top-k aggregation of ranked inputs.ACM Trans on Database Systems,2007,32(3):19
    [4]Mouratidis K,Bakiras S,Papadias D.Continuous monitoring of top-k queries over sliding windows//Proc of2006ACM SIGMOD Int Conf on Management of Data.New York:ACM,2006:635-646
    [5]Theobald M,Weikum G,Schenkel R.Top-k query evaluation with probabilistic guarantees//Proc of the 13th Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2004:648-659
    [6]王意洁,李小勇,祁亚斐,等.不确定数据查询技术研究.计算机研究与发展,2012(07):1460-1466
    [7]Berberich K,Bedathur S,Neumann T,et al.A time machine for text search//Proc of 2007ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2007:519-526
    [8]Herscovici M,Lempel R,Yogev S.Efficient indexing of versioned document sequences//Amati G,Carpineto C,Romano G,eds.Advances in Information Retrieval.Berlin:Springer,2007,4425:76-87
    [9]He J,Suel T.Faster temporal range queries over versioned text//Proc of 2011 ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2011:565-574
    [10]Broder A,Eiron N,Fontoura M,et al.Indexing shared content in information retrieval systems//Ioannidis Y,Scholl M,Schmidt J,et al.eds.Advances in Database Technology(EDBT 2006).Berlin:Springer,2006,3896:313-330
    [11]Anand A,Bedathur S,Berberich K,et al.Efficient temporal keyword search over versioned text//Proc of 2010ACM Int Conf on Information and Knowledge Management.New York:ACM,2010:699-708
    [12]He J,Suel T.Optimizing positional index structures for versioned document collections//Proc of ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2012:245-254
    [13]Anand A,Bedathur S,Berberich K,et al.Temporal index sharding for space-time efficiency in archive search//Proc of2011 ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2011:545-554
    [14]Lee M,Hsu W,Li L,et al.Consistent top-k queries over Time//Zhou X,Yokota H,Deng K,et al.eds.Database Systems for Advanced Applications.Berlin:Springer,2009,5463:51-65
    [15]U L H,Mamoulis N,Berberich K,et al.Durable top-k search in document archives//Proc of 2010 ACM SIGMOD Int Conf on Management of Data.New York:ACM,2010:555-566
    [16]Robertson S E,Walker S,Jones S,et al.Okapi at TREC-3//Proc of the 3rd Text REtrieval Conf(TREC-3).Gaithersburg:NIST Special Publication,1995:109-126