搜索引擎中索引剪枝的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

搜索引擎中索引剪枝的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study on Index Pruning for Web Search Engine
作者：单栋栋
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：倒排索引 ; 静态索引剪枝 ; 动态索引剪枝 ; Top ; K查询处理
英文关键词：Inverted Index ; static index pruning ; dynamic index pruning ; top-k query
英文关键词：porcessing
学位年度：2013
导师：李晓明
学科代码：081201
学位授予单位：北京大学
论文提交日期：2013-05-01

摘要

搜索引擎作为人们获取网络信息的主要入口，正在被越来越多的人使用。不断增长的网页数量和查询请求量使得搜索引擎面临着巨大的性能挑战。通常，搜索引擎每秒需要在数百亿的网页数据上处理成千上万的查询。因此，如何高效地处理查询一直是搜索引擎和信息检索领域中重要的研究问题。
     本文从索引剪枝的角度出发来研究提升查询处理效率的方法。索引剪枝通常分为静态索引剪枝和动态索引剪枝的方法。静态索引剪枝方法主要用在索引构建阶段。它在索引构建时，去除索引中一些对查询不重要的信息来缩短倒排链长度，减小倒排索引的大小，从而提升查询的速度。动态索引剪枝的方法主要用在查询的处理阶段。它在查询的处理时，通过使用相应的索引辅助结构来跳过一些对查询不重要的文档来提升查询的处理速度。本文分别从静态索引剪枝和动态索引剪枝两方面来研究提升查询处理的方法，并提出了一些新的索引结构和算法来支持高效的查询处理。
     本文的工作、创新性和主要贡献主要体现在以下几个方面：
     1.本文提出了一种基本网页结构和重要度的静态索引剪枝方法。在这种剪枝方法中，它一方面利用网页的结构来区别对待在不同结构中信息的重要程度，优先保留重要结构中的信息。另一方面利用网页与网页之间的重要度不同，对重要度高的网页保留更多比例的信息。通过在GOV2数据集上的相关实验，我们发现网页结构和重要度信息对文档的剪枝都有非常大的帮助。结合这两种信息的静态索引剪枝方法在去除80%的倒排项时，P@10指标与使用完整索引的效果相当，存储开销减小了73%，查询处理的速度提升了2倍。
     2.本文提出了一种基于块最大索引的动态索引剪枝算法的处理框架。在此框架下，本文提出了3个新的动态索引剪枝算法，分别为Block-MaxMaxScore (BMM)， Local Block-Max WAND (LBMW)和Local Block-MaxMaxScore (LBMM)。在这个框架下查询处理主要分为三步：1).选择候选文档。本文提出的算法利用块局部最大分数而不是词的全局最大分数来选择候选文档，大大减少了选择候选文档的数量。2).过滤文档。利用文档所在块最大分数来过滤候选文档。本文提出一种更高效的过滤方法，它能充分的利用查询处理局部性，提升过滤的效率。3).候选文档打分。本文提出的算法利用块最大分数来动态估算文档分数上限，从而可以尽快的结束对不重要的候选文档的打分。实验表明，新提出剪枝算法在查询处理速度上比已有的算法都有明显的提升，它比经典的剪枝方法快近1倍。
     3.本文提出了两种在块最大索引中存储文档静态分数与词相关性分数的方法。使得动态索引剪枝算法的打分函数在包含文档静态分数时也能对查询进行高效的剪枝。一种方法是对每个块分别存储块中最大的词相关性分数和文档静态分数；另一种方法是先把词相关性分数与文档静态分数合并，然后存储块中最大合并分数。第二种方法相对于第一种方法能更加精确地估算候选文档的分数，但它存在低估候选文档分数的情况。本文分析了产生文档分数低估的原因并给出了相应的解决方法。同时，本文还讨论了在不同文档重排序下对动态索引剪枝算法的影响。实验表明在打分公式中文档静态分数比例较小时，按URL序重排序的索引查询处理的速度比较快。当文档静态分数的比重增加时，则按文档静态分数序排序的索引的查询处理更有优势。
     4.本文提出一种在区间最大索引中最优化区间划分的方法，并在这种区间划分的方式下，提出了4种动态索引剪枝算法：Space-Max WAND (SMW)，Space-Max MaxScore (SMM)，Space-Max AND (SMA)和Space Max AND withCheck (SMAC)。这种最优化区间划分的方法结合了文档分数估算误差和区间自身的处理开销等因素。它能在同等条件下对索引中的每个倒排链划分出最优的区间。利用区间最大的分数，新提出剪枝算法能更加精确的估算出候选文档的分数，减少找到候选文档的数量，同时对候选文档打分也能更早的结束。实验表明，这种基于最优化区间的方法的剪枝算法比根据倒排链长度来划分区间的方法处理查询的效率高。而基于区间最大索引的剪枝算法比基于块最大索引的剪枝算法效率高。
     5.在上述的研究成果的基础上，重新设计并实现了天网搜索引擎。使得天网搜索引擎在索引量和查询处理的效率都有非常大的提升。
Currently, more and more people use search engines as the main entrance toobtain information on the Web. But the growing number of web pages and queriesmake the performance of search engine a big challenge. Usually, a moden search en-gine needs to process thousands of queries on tens of billions of web pages per sec-ond. Therefore, how to efficiently process queries has been an important researchproblem in the field of search engine and information retrieval.
     In this paper, we study the methods of index pruning to enhance the efficiencyof query processing. Usually, the index pruning methods are divided into static indexpruning and dynamic index pruning. Static index pruning methods are used in indexconstruction phase. They remove the less important information in document toshorten the length of inverted list and reduce the size of the inverted index. As theinverted list is shorted, the speed of query processing can be enhanced. Dynamic in-dex pruning method are used in the query processing phase. They usually use somespecial auxiliary structure to skip less important documents. So the query processingspeed can be enhanced a lot. In this paper, we study efficiently query processing bythe static index pruning and dynamic index pruning methods. We propose some newindex structures and algorithms to support efficient query processing.
     The main work, primary innovations and contribution are as follows:
     1. In this paper, we propose a web page structure and importance based staticindex pruning method. On the one hand, this method use the web pagestructure to distinguish the importance of terms in different structures. Itpreferentially retained terms in the important structure. On the other hand,it treats the importance of web pages differently and reserved more termsfor important web pages. Experiments on GOV2datasets show that the webpage structure and web page importance can help to remove less importantterms and enhance the query effectiveness. The effectiveness of the pruningmethod which combine these two information as good as using full index when removing80%of the terms. At this time, the storage overhead is re-duced by73%and the speed of query processing enhanced2times.
     2. In this paper, we propose a framework for dynamic index pruning methodson the Block-Max index. We also propose three new dynamic index pruningalgorithms, respectively: Block-Max MaxScore (BMM)， Local Block-MaxWAND (LBMW) and Local Block-Max MaxScore (LBMM). In this framework,the query processing can be divided into three steps:1). Choose the candi-date documents. In this paper, we uses the local block maximum scores in-stead of terms global maximum scores to select the candidate documentswhich can reduce the number of candidate document a lot;2). Filter candi-date documents by using block’s maximum score. We presents a more effi-cient filtering method which can take full advantage of the locality of queryprocessing, and to enhance the efficiency of the filter;3). Scoring candidatedocuments. The proposed algorithms use block-max score to estimate theupper bound score of documents dynamically. It can stop scoring when theupper bound score below current threshold. Experiments show that theproposed algorithms can process queries more efficiently and they are onetimes faster than classical pruning methods.
     3. In this paper, we propose two kinds of methods to store block maximumstatic score and maximum relevant score. Using those scores, the dynamicindex pruning algorithms with static score rank function can process queriesmore efficiently. One approach is that it stores the maximum relevant scoreand static score for each block respectively; Another method is that it com-bines the relevant score and static score for each document and store max-imum combined score for each block additionally. Obviously, the secondmethod can estimate the upper bound scores of the candidate documentmore accurately. But using combined method, document’s upper boundscore will be underestimate in some case. We analyze those cases and pro-pose some solution to solve it. At the same time, we also discusses the dy-namic index pruning algorithm using document reordering technology. Ex-periments show that when the static scores proportion in ranking function issmaller, using URL ordering index can process queries faster. As the propor-tion increases, it is more suitable to use static score ordering index.
     4. In this paper, we propose a optimization method to divide intervals for each indexed term in the Space-Max index. We also propose four new dynamicindex pruning algorithms on the Space-Max index, respectively: Space-MaxWAND (SMW), Space-Max MaxScore (SMM), Space-Max AND (SMA) andSpace Max AND with Check (SMAC). The proposed interval divided methoduses two kinds of information: the estimation error of document scores andinterval processing overhead. Under the same conditions, the optimizedmethod can divied an optimal interval for each inverted index. Using thoseinterval’s maximum scores, the new proposed pruning algorithms can bemore accurate to estimate the upper bound score of a document. Thus, theycan find higher quality of candidate documents, and the candidate docu-ments can be stopped scoring earlier. The experiments show that pruningmethods using optimized interval divided are much faster than that usinginterval divided according to the length of inverted list. Meanwhile, theSpace-Max index based pruning methods are more efficient than block-maxindex based methods.
     5. On the basis of the above research, we redesign and re-implement the“TianWang” search engine The proposed methods can make a search en-gine which can index more documents and processing queries more effi-ciency.

引文

[1]李晓明.对中国曾有过静态网页数的一种估计[J].北京大学学报(自然科学版),2003,39(3):394–398.
    [2] Silverstein C, Marais H, Henzinger M, et al. Analysis of a very large web searchengine query log[C]//ACM,1999,33(1):6–12.
    [3] Zhang J, Long X, Suel T. Performance of compressed inverted list caching insearch engines[C]//J. Huai, R. Chen, H.-W. Hon, et al. Proceeding of the17thinternational conference on World Wide Web-WWW’08New York, NewYork, USA: ACM Press,2008:387–396.
    [4] Long X, Suel T. Three-level caching for efficient query processing in large Websearch engines[C]//Proceedings of the14th international conference onWorld Wide Web. New York, NY, USA: ACM,2005:257–266.
    [5] Gan Q, Suel T. Improved techniques for result caching in web searchengines[C]//Proceedings of the18th international conference on World wideweb. New York, NY, USA: ACM,2009:431–440.
    [6] Fagni T, Perego R, Silvestri F, et al. Boosting the performance of Web searchengines: Caching and prefetching query results by exploiting historical usagedata[J]. ACM Trans. Inf. Syst., New York, NY, USA: ACM,2006,24(1):51–78.
    [7] Baeza-Yates R, Gionis A, Junqueira F, et al. The impact of caching on searchengines[C]//Proceedings of the30th annual international ACM SIGIRconference on Research and development in information retrieval. New York,NY, USA: ACM,2007:183–190.
    [8] Cambazoglu B B, Junqueira F P, Plachouras V, et al. A refreshing perspective ofsearch engine caching[C]//Proceedings of the19th international conferenceon World wide web. New York, NY, USA: ACM,2010:181–190.
    [9] Markatos E P. On caching search engine query results[J]. ComputerCommunications,2001,24(2):137–143.
    [10] Dean J. Challenges in building large-scale information retrieval systems:invited talk[C]//Proceedings of the Second ACM International Conference onWeb Search and Data Mining. ACM,2009:1–1.
    [11] Badue C, Ribeiro-neto B, Baeza-Yates R, et al. Distributed query processingusing partitioned inverted files[C]//String Processing and InformationRetrieval,2001. SPIRE2001. Proceedings.Eighth International Symposium on.:10–20.
    [12] Melink S, Raghavan S, Yang B, et al. Building a distributed full-text index forthe web[J]. ACM Trans. Inf. Syst., New York, NY, USA: ACM,2001,19(3):217–241.
    [13] Puppin D, Silvestri F, Laforenza D. Query-driven document partitioning andcollection selection[C]//Proceedings of the1st international conference onScalable information systems. New York, NY, USA: ACM,2006.
    [14] Witten I H, Moffat A, Bell T C. Managing gigabytes: compressing and indexingdocuments and images[M]. Morgan Kaufmann,1999.
    [15] Trotman A. Compressing Inverted Files[J]. Information Retrieval, SpringerNetherlands,2003,6(1):5–19.
    [16] Zukowski M, Heman S, Nes N, et al. Super-Scalar RAM-CPU CacheCompression J22nd International Conference on Data Engineering (ICDE’06),Ieee,2006:59–59.
    [17] Yan H, Ding S, Suel T. Inverted index compression and query processing withoptimized document ordering[C]//Proceedings of the18th internationalconference on World wide web.2009:401–410.
    [18] LongXiaohui, SuelTorsten, Long X, et al. Optimized query execution in largesearch engines with global page ordering[C]//Proceedings of the29thinternational conference on Very large data bases.2003:129–140.
    [19] Carmel D, Cohen D, Fagin R, et al. Static Index Pruning for InformationRetrieval Systems.[C]//SIGIR. New York, NY, USA: ACM,2001:43–50.
    [20] Büttcher S, Clarke C L A. A document-centric approach to static index pruningin text retrieval systems[C]//Proceedings of the15th ACM internationalconference on Information and knowledge management. New York, NY, USA:ACM,2006:182–189.
    [21] Nguyen L T. Static Index Pruning for Information Retrieval System: APosting-Based Approach[C]//LSDS-IR Workshop.2009.
    [22] Skobeltsyn G, Junqueira F, Plachouras V, et al. ResIn: a combination of resultscaching and index pruning for high-performance web search engines[C]//S.-H.Myaeng, D.W. Oard, F. Sebastiani, et al. SIGIR. ACM,2008:131–138.
    [23] Blanco R, Barreiro A. Static Pruning of Terms in Inverted Files[C]//G. Amati, C.Carpineto, G. Romano. ECIR. Springer,2007,4425:64–75.
    [24] Zheng L, Cox I J. Entropy-Based Static Index Pruning[C]//M. Boughanem, C.Berrut, J. Mothe, et al. ECIR. Springer,2009,5478:713–718.
    [25] Zheng L, Cox I J. Document-Oriented Pruning of the Inverted Index inInformation Retrieval Systems[C]//AINA Workshops. IEEE Computer Society,2009:697–702.
    [26] Alting vde I S, Ozcan R, Ulusoy. Exploiting query views for static indexpruning in web search engines[C]//D.W.-L. Cheung, I.-Y. Song, W.W. Chu, et al.CIKM. ACM,2009:1951–1954.
    [27] Garcia S. Search Engine Optimization Using Past Queries[D]. RMIT University,2007.
    [28] Lam H T, Perego R, Silvestri F. On Using Query Logs for Static IndexPruning[C]//Web Intelligence and Intelligent Agent Technology (WI-IAT),2010IEEE/WIC/ACM International Conference on.2010,1:167–170.
    [29] Altingovde I S, Ozcan R, Ulusoy. Static index pruning in web search engines:Combining term and document popularities with query views[J]. ACM Trans.Inf. Syst., New York, NY, USA: ACM,2012,30(1):2:1–2:28.
    [30] Moura E S de, Santos C F dos, Araujo B D santos de, et al. Locality-Basedpruning methods for web search[J]. ACM Transactions on Information Systems,New York: ACM,2008,26(2):1–28.
    [31] De Moura E S, Dos Santos C F, Fernandes D R, et al. Improving Web searchefficiency via a locality based static pruning method[C]//WWW. New York,New York, USA: ACM Press,2005:235.
    [32] Blanco R, Barreiro A. Probabilistic static pruning of inverted files[J]. ACMTransactions on Information Systems, New York: ACM,2010,28(1):1–33.
    [33] Thota S, Carterette B. Within-Document Term-Based Index Pruning withStatistical Hypothesis Testing[G]//P. Clough, C. Foley, C. Gurrin, et al.Advances in Information Retrieval. Springer Berlin/Heidelberg,2011,6611:543–554.
    [34] Blanco R, Barreiro A. Boosting static pruning of inverted files[C]//W. Kraaij, A.P.de Vries, C.L.A. Clarke, et al. SIGIR. ACM,2007:777–778.
    [35] Ntoulas A, Cho J. Pruning policies for two-tiered inverted index withcorrectness guarantee[C]//Proceedings of the30th annual international ACMSIGIR conference on Research and development in information retrieval. NewYork, NY, USA: ACM,2007:191–198.
    [36] Persin M, Zobel J, Sacks-Davis R. Filtered Document Retrieval withFrequency-Sorted Indexes[J]. JASIS,1996,47(10):749–764.
    [37] Anh V N, Moffat A. Pruned query evaluation using pre-computedimpacts[C]//Proceedings of the29th annual international ACM SIGIRconference on Research and development in information retrieval. New York,NY, USA: ACM Press,2006:372–379.
    [38] Anh V N, De Kretser O, Moffat A. Vector-space ranking with effective earlytermination[C]//Proceedings of the24th annual international ACM SIGIRconference on Research and development in information retrieval. New York,NY, USA: ACM Press,2001:35–42.
    [39] Turtle H, Flood J. Query evaluation: strategies and optimizations[J]. Inf.Process. Manage., Tarrytown, NY, USA: Pergamon Press, Inc.,1995,31(6):831–850.
    [40] Broder A Z, Carmel D, Herscovici M, et al. Efficient query evaluation using atwo-level retrieval process[C]//Proceedings of the twelfth internationalconference on Information and knowledge management. New York, NY, USA:ACM,2003:426–434.
    [41] Ding S, Suel T. Faster top-k document retrieval using block-maxindexes[C]//Proceedings of the34th international ACM SIGIR conference onResearch and development in Information. New York, NY, USA: ACM,2011:993–1002.
    [42] Kaushik Chakrabarti; Surajit Chaudhuri; Venkatesh Ganti. Interval-BasedPruning for Top-k Processing over Compressed Lists[C]//ICDE.2011.
    [43] Buckley C, Lewit A F. Optimization of inverted vectorsearches[C]//Proceedings of the8th annual international ACM SIGIRconference on Research and development in information retrieval.1985:97–110.
    [44] Campinas S, Delbru R, Tummarello G. SkipBlock: Self-indexing for Block-BasedInverted List[C]//P. Clough, C. Foley, C. Gurrin, et al. Advances in InformationRetrieval. Springer Berlin/Heidelberg,2011,6611:555–561.
    [45] Ilyas I F, Beskales G, Soliman M A. A survey of top-k query processingtechniques in relational database systems[J]. ACM Comput. Surv., New York,NY, USA: ACM,2008,40(4):11:1–11:58.
    [46] Fagin R. Optimal aggregation algorithms for middleware[J]. Journal ofComputer and System Sciences,2003,66(4):614–656.
    [47] Strohman T, Croft W B. Efficient document retrieval in mainmemory[C]//Proceedings of the30th annual international ACM SIGIRconference on Research and development in information retrieval. New York,NY, USA: ACM,2007:175–182.
    [48] Anh V N, Moffat A. Pruning strategies for mixed-mode querying[C]//P.S. Yu,V.J. Tsotras, E.A. Fox, et al. CIKM. ACM,2006:190–197.
    [49] Strohman T, Croft W B. Efficient document retrieval in mainmemory[C]//Proceedings of the30th annual international ACM SIGIRconference on Research and development in information retrieval-SIGIR’07New York, NY, USA: ACM Press,2007:175–182.
    [50] Jonassen S, Bratsberg S. Efficient Compressed Inverted Index Skipping forDisjunctive Text-Queries[J]. Advances in Information Retrieval, Springer,2011:530–542.
    [51] Zhang F, Shi S, Yan H, et al. Revisiting globally sorted indexes for efficientdocument retrieval[C]//Proceedings of the third ACM internationalconference on Web search and data mining. New York, NY, USA: ACM Press,2010:371–380.
    [52] Tonellotto N, Macdonald C, Ounis I. Efficient dynamic pruning with proximitysupport[C]//Large-Scale Distributed Systems for Information Retrieval.2010:33–37.
    [53] Zhu M, Shi S, Yu N, et al. Can phrase indexing help to process non-phrasequeries?[C]//Conference on Information and Knowledge Management.2008:679–688.
    [54] Yan H, Shi S, Zhang F, et al. Efficient term proximity search with term-pairindexes[C]//Proceedings of the19th ACM international conference onInformation and knowledge management-CIKM’10New York, NY, USACM Press,2010:1229–1238.
    [55] Cambazoglu B B, Zaragoza H, Chapelle O, et al. Early exit optimizations foradditive machine learned ranking systems[C]//Proceedings of the third ACMinternational conference on Web search and data mining. New York, NY, USA:ACM,2010:411–420.
    [56] Wang L, Lin J, Metzler D. A cascade ranking model for efficient rankedretrieval[C]//Proceedings of the34th international ACM SIGIR conference onResearch and development in Information. New York, NY, USA: ACM,2011:105–114.
    [57]李晓明，闫宏飞，王继民.搜索引擎——原理,技术与系统（第二版）[M].科学出版社,2012.
    [58] Marin M, Gil-Costa V. High-performance distributed invertedfiles[C]//Proceedings of the sixteenth ACM conference on Conference oninformation and knowledge management. New York, NY, USA: ACM,2007:935–938.
    [59] Zobel J, Moffat A. Inverted files for text search engines[J]. ACM Comput. Surv.,New York, NY10036-5701, United States: Association for ComputingMachinery,2006,38(2).
    [60] Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J].Communications of the ACM, ACM,1975,18(11):613–620.
    [61] Robertson S E, Walker S, Beaulieu M. Okapi at TREC-7: automatic ad hoc,filtering, VLC and interactive track[J]. Nist Special Publication SP, Citeseer,1999:253–264.
    [62] Song F, Croft W B. A general language model for informationretrieval[C]//Proceedings of the eighth international conference onInformation and knowledge management.1999:316–321.
    [63] Zobel J, Moffat A, Ramamohanarao K. Inverted files versus signature files fortext indexing[J]. ACM Transactions on Database Systems (TODS), ACM,1998,23(4):453–490.
    [64] Anh V N, Moffat A. Structured Index Organizations for High-Throughput TextQuerying[C]//F. Crestani, P. Ferragina, M. Sanderson. SPIRE. Springer,2006,4209:304–315.
    [65] Yan H, Ding S, Suel T. Compressing term positions in webindexes[C]//Proceedings of the32nd international ACM SIGIR conference onResearch and development in information retrieval-SIGIR’09New York, NewYork, USA: ACM Press,2009:147–154.
    [66] Anh V N, Moffat A. Inverted index compression using word-aligned binarycodes[J]. Information Retrieval, Springer,2005,8(1):151–166.
    [67] Shan D, Zhao W X, He J, et al. Efficient phrase querying with flat positionindex[C]//CIKM.2011:2001–2004.
    [68] Chakrabarti K, Chaudhuri S, Ganti V. Interval-based pruning for top-kprocessing over compressed lists[C]//Data Engineering (ICDE),2011IEEE27thInternational Conference on.2011:709–720.
    [69] BRIN S, PAGE L. The anatomy of a large-scale hypertextual Web searchengine[J]. Computer Networks and ISDN Systems,1998,30(1-7):107–117.
    [70] Salton G, Fox E A, Wu H. Extended Boolean information retrieval[J]. Commun.ACM, New York, NY, USA: ACM,1983,26(11):1022–1036.
    [71] Dean J A, Haahr P G, Sercinoglu O, et al. Multi-stage query processing systemand method for use with tokenspace repository[J]. Google Patents,2004.
    [72] Büttcher S, Clarke C L A, Lushman B. Term proximity scoring for ad-hocretrieval on very large text collections[C]//E.N. Efthimiadis, S.T. Dumais, D.Hawking, et al. Proceedings of the29th annual international ACM SIGIRconference on Research and development in information retrieval-SIGIR’06ACM,2006:621–622.
    [73] Liu T-Y. Learning to rank for information retrieval[M]. Springer,2011,13.
    [74]董守斌,袁华.网络信息检索[M].西安电子科技大学出版社,2010.
    [75] Croft W B, Metzler D, Strohman T. Search engines: Information retrieval inpractice[M]. Addison-Wesley,2010.
    [76]何靖,李晓明.搜索引擎效果评测--基于用户点击日志分析的方法与技术[M].高等教育出版社,2012.
    [77] Fagin R, Kumar R, Sivakumar D. Comparing top k lists[C]//Proceedings of thefourteenth annual ACM-SIAM symposium on Discrete algorithms.2003:28–36
    [78] Hu Y, Xin G, Song R, et al. Title extraction from bodies of HTML documents andits application to web page retrieval[C]//Proceedings of the28th annualinternational ACM SIGIR conference on Research and development ininformation retrieval. New York, NY, USA: ACM,2005:250–257.
    [79] Changuel S, Labroche N, Bouchon-Meunier B. A General Learning Method forAutomatic Title Extraction from HTML Pages[G]//P. Perner. Machine Learningand Data Mining in Pattern Recognition. Springer Berlin Heidelberg,2009,5632:704–718.
    [80]毛先领,何靖,闫宏飞.网页去噪:研究综述[J].计算机研究与发展,2010,47(12):2025–2036.
    [81] Dang V, Croft B W. Query reformulation using anchor text[C]//Proceedings ofthe third ACM international conference on Web search and data mining. NewYork, NY, USA: ACM,2010:41–50.
    [82] Eiron N, McCurley K S. Analysis of anchor text for web search[C]//Proceedingsof the26th annual international ACM SIGIR conference on Research anddevelopment in informaion retrieval. New York, NY, USA: ACM,2003:459–460.
    [83] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. J. ACM,1999,46(5):604–632.
    [84] Gy ngyi Z, Garcia-Molina H, Pedersen J. Combating web spam withtrustrank[C]//Proceedings of the Thirtieth international conference on Verylarge data bases-Volume30. VLDB Endowment,2004:576–587.
    [85] Zhu M, Shi S, Li M, et al. Effective top-k computation in retrieving structureddocuments with term-proximity support[C]//Proceedings of the sixteenthACM conference on Conference on information and knowledge management.New York, NY, USA: ACM,2007:771–780.
    [86] Strohman T, Turtle H, Croft W B. Optimization strategies for complexqueries[C]//Proceedings of the28th annual international ACM SIGIRconference on Research and development in information retrieval. New York,NY, USA: ACM,2005:219–225.
    [87] Blanco R, Barreiro A. Document Identifier Reassignment ThroughDimensionality Reduction[G]//Advances in Information Retrieval. SpringerBerlin/Heidelberg,2005,3408:375–387.
    [88] CHEN C, HE J, SHAN D, et al. Optimize Document Identifier Assignment forInverted Index Compression[J]. Journal of Computational InformationSystems6,2010,2:339–346.
    [89] Silvestri F. Sorting out the document identifier assignment problem[J].Advances in Information Retrieval,2007:101–112.
    [90] Blanco R, Barreiro A. TSP and cluster-based solutions to the reassignment ofdocument identifiers[J]. Information Retrieval, Springer Netherlands,2006,9(4):499–517.
    [91] Shieh W. Inverted file compression through document identifierreassignment[J]. Information Processing&Management,2003,39(1):117–131.
    [92] Craswell N, Robertson S, Zaragoza H, et al. relevance weighting for queryindependent evidence[C]//Annual ACM Conference on Research andDevelopment in Information Retrieval. New York, NY, USA: ACM Press,2005:416–423.
    [93] Song R, Wen J R, Shi S, et al. Microsoft research asia at web track and terabytetrack of trec2004[C]//Proceedings of the Thirteenth Text REtrievalConference Proceedings (TREC-2004).2004.
    [94] Dimopoulos C, Nepomnyachiy S, Suel T. Optimizing top-k document retrievalstrategies for block-max indexes[C]//Proceedings of the sixth ACMinternational conference on Web search and data mining. New York, NY, USA:ACM,2013:113–122.
    [95] Schenkel R, Broschart A, Hwang S, et al. Efficient text proximitysearch[C]//String Processing and Information Retrieval.2007:287–299.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700