大规模异构Web的方面搜索研究

英文题名：Faceted Search Over Large-scale Heterogeneous Web
作者：朱凡微
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：方面实体搜索 ; 大规模Web搜索 ; 方面元数据 ; 扩展式搜索模型 ; 快速PPV算法 ; 动态方面排序
英文关键词：faceted entity search ; large-scale Web search ; faceted metadata ; expansible search paradigm ; efficient PPV algorithm ; dynamic facet ranking
学位年度：2012
导师：应晶 ; 吴明晖
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2012-07-01

摘要

随着互联网应用的深入普及和多元化发展,Web上的信息呈现爆炸式增长的趋势。然而过于庞杂的数据也增加了用户信息检索的难度,使得用户对查询的描述和结果的定位变得愈加困难。因此,如何有效地支持用户对Web信息的检索成为互联网搜索领域的一个研究热点。
     方面搜索作为一种典型的探索式搜索技术,集成了目录式浏览提供的搜索导航能力和关键字搜索具有的搜索灵活性,为大规模数据空间的信息搜索提供了一种便捷高效的模式。然而方面搜索技术的实现要求数据集具有一个良好的方面分类,对于Web这样缺乏元数据的跨领域数据集,方面分类的有效构建是方面搜索技术在Web上应用所面临的一个难题。同时,Web数据的异构性和大规模特点,以及Web搜索中用户搜索意图的转移性等都为方面搜索的实现提出了更高的要求和挑战。
     本文研究并解决了在Web上实现方面搜索技术存在的困难,提出了一套完整的从数据准备、搜索模型到排序算法的Web方面搜索方法,并在真实Web数据集上实现了以命名实体为搜索对象的Web方面搜索系统原型：FacetedWeb,通过基准实验比较和用户评测来评估提出的各项技术的性能和有效性。
     本文首先针对Web文本缺乏有效的元数据,难以构造方面分类的问题,提出一种基于命名实体(Named entity)的Web结构化标注方法,将无结构的Web文档转化成结构化的实体元组,在支持以实体为粒度的语义搜索的同时实现了对Web的方面分类和搜索。同时,结合Web数据集上进行实体识别和元组构建时存在的不确定性,提出一个基于用户导航开销的Web方面搜索框架。接着,本文分析Web数据集的大规模、异构等特点对方面搜索提出的新挑战,依次研究了适用于Web方面搜索系统的搜索模型、排序算法等,并提出相应的改进方法：
     1)扩展式搜索模型：针对Web搜索中用户意图的不确定性和转移性,以及大规模的Web数据给搜索效率带来的挑战,提出一种扩展式搜索模型。扩展式搜索优先搜索与查询大量共现的、相似度高的实体来构成初始的结果集,在保证较高的结果精确度的同时极大地提高了搜索的效率；同时,在迭代搜索过程中,扩展式搜索模型不仅对初始结果集进行精化,而且依据用户通过选择方面所表达的查询意图的变化,动态地获取与用户新的查询意图相关的实体,对初始结果集进行扩展,有效地提高了迭代搜索过程中结果的有效性。
     2)快速实体排序算法：为提高大规模Web数据集中实体的排序效率,提出一种增量式的快速实体相关性算法：FastPPV。FastPPV将精确PPV计算中所涉及的访问路径划分成重要程度不同的子集,通过调度路径子集来对PPV计算的效率和精确度进行调控。同时还提出了基于hub结点的FastPPV的高效实现。以路径的hub长度作为路径重要性的衡量标准来有效地划分路径子集,并利用离线阶段预计算的hub结点初始PPV来组合任意查询、任意迭代时的PPV增量,实现了不同PPV计算过程中的组合复用,极大地提高了计算的效率。
     3)动态方面排序算法：针对异构Web方面搜索过程中用户关注点的转移性以及扩展式搜索模式可能存在的实体遗漏情况,提出一种结合方面的局部相关性和全局相关性的动态方面排序算法。通过计算方面的局部相关性,即,基于当前搜索结果集计算的方面与查询的相关性,可以确保方面列表能够反映用户浏览重心的动态变化,提供满足用户搜索意图的结果；另一方面,基于整个数据集来计算方面的全局相关性,发掘方面之间的固有联系,为用户访问在初始结果集中缺失的实体提供了路径。
     在原型系统FacetedWeb上的大量实验验证了上述扩展式搜索模型及实体、方面排序算法在结果有效性和搜索效率上的优势。
As the Web becomes increasingly popular and diverse, we have witnessed an explosive growth of Web information. While such large amount of data published and shared on the Web provides us a various and enormous information repository, the scale and diversity of Web data also increases the difficulty in Web information retrieval. Therefore, how to develop an effective and efficient Web search approach becomes a practical research topic.
     Combining free-text search and faceted navigation, faceted search guides a user's search by providing valid query refinements iteratively so that the user can flexibly navigate through the result set without feeling lost or reaching a "dead end". However, the successful development of faceted search approach requires a well-defined faceted classification to organize and classify the dataset, and thus it is difficult to apply faceted search technique to the largest corpus:the Web. In addition, the heterogeneous and unstructured characteristics of Web data also make faceted search problem more challenging.
     To tackle the challenges in faceted web search, in this paper, we thoroughly study the main issues in faceted Web search and integrate the proposed techniques to generate a systematic solution to faceted Web search, including the data model, search paradigm and the ranking algorithms. We also develop a faceted search prototype, FacetedWeb, over the real Web corpus.
     We first propose to leverage the named entities in the Web to model the unstructured Web data as a structured entity tuple database so that an entity tuple database could be built conceptually to support fine-grained faceted search. We integrate the real web uncertainties such as entity extraction uncertainty and tuple construction uncertainty to develop a minimum-cost faceted search framework. Then, we analyze the main techniques in implementing faceted search system over the large-scale Web corpus, and propose correspondent search model and ranking algorithms, as follows:
     1) Expansible search paradigm:Considering the drawbacks of traditional refined search paradigm, i.e., search within search, we propose a new expansible search paradigm to capture the changing user intent in heterogeneous Web search and improve the efficiency of the search system. The proposed search paradigm aims at returning the most promising results, which appear in the neighborhood of the query node, to efficiently construct an initial result set, and dynamically re-search the data set according the change of user intent, by filtering uninterested results as well as expanding new relevant answers.
     2) Fast entity ranking algorithm:To improve the efficiency of entity ranking, we propose FastPPV, an approximate PPV computation method that is incremental and accuracy-aware. The computation is partitioned and will be scheduled for processing in an "organized" way, such that we can gradually improve our PPV approximation and quantify the accuracy of our estimation at query time. We also develop a hub based solution to efficiently partition and prioritize computation so that the shared sub-structure between different tour partitions can be reused to speed up computation.
     3) Dynamic facet ranking algorithm:To ensure that the iterative faceted search is query-intent aware and the initially missing entities can be retrieved if needed in subsequent faceted search, we propose a dynamic facet ranking approach for iterative faceted search. Our approach re-ranks the facets by their relevance w.r.t. both the initial query and the iteratively chosen facets. We propose a hybrid model which combines the local relevance and the global relevance in computation to achieve the best performance.
     We conduct comprehensive experiments on FacetedWeb, and the results validate the effectiveness and efficiency of the proposed techniques and algorithms.

引文

[1]P. L. Krapivsky, S. Redner. A statistical physics perspective on Web growth[J]. Computer Networks,2002,39(3):261-276.
    [2]V. N. Gudivada, V. V. Raghavan, W. I. Grosky, R. Kasanagottu. Information retrieval on the world wide web[J]. Internet Computing, IEEE,1997, 1(5):58-68.
    [3]M. Kobayashi, K. Takeda. Information retrieval on the web[J]. ACM Computing Surveys (CSUR),2000,32(2):144-173.
    [4]H. Berghel. Cyberspace 2000:Dealing with information overload[J]. Communications of the ACM,1997,40(2):19-24.
    [5]S. Agrawal, S. Chaudhuri, G. Das. DBXplorer:A system for keyword-based search over relational databases[C]. Proceedings of the 18th International Conference on Data Engineering,2002:5-16.
    [6]V. Hristidis, Y. Papakonstantinou. DISCOVER:Keyword search in relational databases [C]. Proceedings of the 28th international conference on Very Large Data Bases,2002:670-681.
    [7]B. Zhao, X. Lin, B. Ding, J. Han. TEXplorer:Keyword-based Object Search and Exploration in Multidimensional Text Databases[C]. Proceedings of the 20th ACM international conference on Information and knowledge management,2011: 1709-1718.
    [8]V. Hristidis, H. Hwang, Y. Papakonstantinou. Authority-based keyword search in databases[J]. ACM Transactions on Database Systems,2008,33(1):1-40.
    [9]M. R. Henzinger, R. Motwani, C. Silverstein. Challenges in web search engines[J]. SIGIR Forum,2002,36(2):11-22.
    [10]X. Li, Y.-Y. Wang, A. Acero. Learning query intent from regularized click graphs[C]. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval,2008:339-346.
    [11]B. Kules, R. Capra, M. Banta, T. Sierra. What do exploratory searchers look at in a faceted search interface?[C]. Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries,2009:313-322.
    [12]Peter Bruza, Robert McArthur, Simon Dennis. Interactive Internet search: keyword, directory and query reformulation mechanisms compared[O]. Available: http://dl.acm.org/citation.cfm7id-345598.[Accessed:08-Mar-2012].
    [13]D. Tunkelang. Faceted Search. Morgan & Claypool publishers,2009.
    [14]S. Perugini. Supporting multiple paths to objects in information hierarchies: Faceted classification, faceted search, and symbolic links [J]. Information Processing & Management,2010:46(1):22-43.
    [15]C. Holscher, G. Strube. Web search behavior of Internet experts and newbies[J]. Computer networks,2000:33(1):337-346.
    [16]S. B. Roy, H. Wang, U. Nambiar, G. Das, M. Mohania. DynaCet:Building Dynamic Faceted Search Systems over Databases[C]. IEEE 25th International Conference on Data Engineering,2009:1463-1466.
    [17]A. H. C. Jensen. Minimum-effort driven dynamic faceted search in structured databases[C]. Proceedings of the 17th ACM conference on Information and knowledge management,2003:13-22.
    [18]D. Dash, J. Rao, N. Megiddo, A. Ailamaki, G. Lohman. Dynamic faceted search for discovery-driven analysis[C]. Proceeding of the 17th ACM conference on Information and knowledge management,2008:3-12.
    [19]W. Muller, M. Zech, A. Henrich. VisualFlamenco:Faceted Browsing for Visual Features[C]. Ninth IEEE International Symposium on Multimedia Workshops, 2007:71-72.
    [20]O. Ben-Yitzhak, N. Golbandi, N. Har'El, R. Lempel, A. Neumann, S. Ofek-Koifman, D. Sheinwald, E. Shekita, B. Sznajder, S. Yogev. Beyond basic faceted search[C]. Proceedings of the international conference on Web search and web data mining,2008:33-44.
    [21]W. Dakka, P. G. Ipeirotis, K. R. Wood. Automatic construction of multifaceted browsing interfaces[C]. Proceedings of the 14th ACM international conference on Information and knowledge management,2005:768-775.
    [22]A. Kashyap, V. Hristidis, M. Petropoulos. Facetor:cost-driven exploration of faceted query results[C]. Proceedings of the 19th ACM international conference on Information and knowledge management,2010:719-728.
    [23]S. Liberman and R. Lempel. Approximately optimal facet selection.submission, 2011.
    [24]C. Li, N. Yan, S. B. Roy, L. Lisham, G. Das. Facetedpedia:dynamic generation of query-dependent faceted interfaces for wikipedia[C]. Proceedings of the 19th international conference on World wide web,2010:651-660.
    [25]R. Hahn, C. Bizer, C. Sahnwaldt, C. Herta, S. Robinson, M. Burgle, H. Duwiger, U. Scheel. Faceted wikipedia search[J]. Business Information Systems,2010:1 11.
    [26]D. Xing, G. R. Xue, Q. Yang, Y. Yu. Deep classifier:automatically categorizing search results into large-scale hierarchies[C]. Proceedings of the international conference on Web search and web data mining,2008:139-148.
    [27]A. Broder, Marcus Fontoura, Evgeniy Gabrilovich, Amruta Joshi, Vanja Josifovski, Tong Zhang. Robust Classification of Rare Queries Using Web Knowledge[C]. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007:231-238.
    [28]T. Haveliwala. Topic-sensitive PageRank[C]. Proceedings of the Eleventh International World Wide Web Conference,2002:784-796.
    [29]S. Overell, B. Sigurbjornsson, R. Van Zwol. Classifying tags using open content resources[C]. Proceedings of the Second ACM International Conference on Web Search and Data Mining,2009:64-73.
    [30]Schonhofen, P. Identifying Document Topics Using the Wikipedia Category Network[C]. IEEE/WIC/ACM International Conference on Web Intelligence, 2006:456-462.
    [31]P. Wang, J. Hu, H.-J. Zeng, Z. Chen. Using Wikipedia knowledge to improve text classification[J]. Knowledge and Information Systems,2008,19(3):265- 281.
    [32]ODP-Open Directory Project. [Online]. Available:http://www.dmoz.org/. [Accessed:13-Jun-2012].
    [33]P. A. Chirita, S. Costache, W. Nejdl, S. Handschuh. P-tag:large scale automatic generation of personalized annotation tags for the web[C]. Proceedings of the 16th international conference on World Wide Web,2007:845-854.
    [34]B. Y. Kuo, T. Hentrich, B. M. Good, M. D. Wilkinson. Tag clouds for summarizing web search results[C]. Proceedings of the 16th international conference on World Wide Web,2007:1203-1204.
    [35]C. Lee, Y. G. Hwang, and M. G. Jang. Fine-grained named entity recognition and relation extraction for question answering[C]. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007:799-800.
    [36]A. Arasu and H. Garcia-Molina. Extracting structured data from web pages[C]. Proceedings of the 2003 ACM SIGMOD international conference on Management of data,2003:337-348.
    [37]M. Banko, M. J. Cafarella, S. Soderl, M. Broadhead, O. Etzioni. Open information extraction from the web[C]. IN IJCAI,2007:2670-2676.
    [38]R. Baumgartner, S. FLESCA, G. Gottlob. Visual Web Information Extraction with Lixto[C]. Proceedings of the 27th International Conference on Very Large Data Bases,2001:119-128.
    [39]R. Bunescu, R. J. Mooney. Collective information extraction with relational Markov networks[C]. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,2004:438-446.
    [40]D. Cai, S. Yu, J.-R. Wen, W.-Y. Ma, Extracting Content Structure for Web Pages Based on Visual Representation[C]. Web Technologies and Applications: 5th Asia-Pacific Web Conference,2003:596-598.
    [41]C. Kohlsch\utter, P. A. Chirita, W. Nejdl. Using link analysis to identify aspects in faceted web search. SIGIR'2006 Faceted Search Workshop,2006.
    [42]J. Pound, S. Paparizos, P. Tsaparas. Facet discovery for structured web search:a query-log mining approach[C]. SIGMOD Conference,2011:169-180.
    [43]Flamenco Home. [Online]. Available:http://flamenco.berkeley.edu/. [Accessed: 11-Jun-2012].
    [44]A. Elliott. Flamenco Image Browser:Using Metadata to Improve Image Search During Architectural Design[C]. CHI'01 extended abstracts on Human factors in computing systems.2001:69-70.
    [45]M. A. Hearst, J. O. Pedersen. Reexamining the cluster hypothesis:scatter/gather on retrieval results[C]. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval.1996:76-84.
    [46]Nobel Prize Winners (Flamenco). [Online]. Available: http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/nobel/Flamenco. [Accessed: 11-Jun-2012].
    [47]Flamenco Fine Arts Search (Flamenco). [Online]. Available: http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/famuseum/Flamenco. [Accessed:11-Jun-2012].
    [48]Flamenco UC Berkeley Architecture Slide Library Search (Flamenco). [Online]. Available:http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/spiro/Flamenco. [Accessed:11-Jun-2012].
    [49]moritz.stefaner.eu-Relation browser. [Online]. Available: http://moritz.stefaner.eu/projects/relation-browser/. [Accessed:03-Jan-2012].
    [50]D. F. Huynh, D. Karger. Parallax and companion:Set-based browsing for the data web, WWW Conference. ACM,2009.
    [51]SIMILE Project. [Online]. Available:http://simile.mit.edu/. [Accessed: 11-Jun-2012].
    [52]Freebase. [Online]. Available:http://www.freebase.com/. [Accessed: 08-Mar-2012].
    [53]eBay| Electronics, Cars, Clothing, Collectibles and More Online Shopping. [Online]. Available:http://www.ebay.com/. [Accessed:03-Jan-2O12].
    [54]Amazon.com:Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more. [Online]. Available:http://www.amazon.com/. [Accessed: 03-Jan-20121.
    [55]L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking:Bringing Order to the Web. Stanford University, Technical report, Nov 1999.
    [56]Q. He, D. Jiang, Z. Liao, S. C. H. Hoi, K. Chang, E.-P. Lim, and H. Li. Web Query Recommendation via Sequential Query Prediction[C].2009 IEEE 25th International Conference on Data Engineering,2009:1443-1454.
    [57]S. Buttcher, C. L.. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections[C]. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval,2006:621-622.
    [58]R. Schenkel, A. Broschart, S. Hwang, M. Theobald, G. Weikum. Efficient text proximity search[C]. Processing and Information Retrieval,2007:287-299.
    [59]T. Tao, C. X. Zhai. An exploration of proximity measures in information retrieval[C]. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007:295-302.
    [60]B. J. Fogg, J. Marshall, A. Osipovich, C. Varma, O. Laraki, N. Fang, J. Paul, A. Rangnekar, J. Shon, P. Swani. Elements that affect web credibility:Early results from a self-report study[C]. CHI'00 extended abstracts on Human factors in computing systems,2000:287-288.
    [61]The ClueWeb09 Dataset. [Online]. Available: http://lemurproject.org/clueweb09.php/. [Accessed:08-Mar-2012].
    [62]B. Kules and R. Capra. Creating exploratory tasks for a faceted search interface[C]. Second Workshop on Human-Computer Interaction (HCIR 2008). 2008,19:vol.19,121-124.
    [63]Yahoo! Directory. [Online]. Available:http://dir.yahoo.com/. [Accessed: 13-Jun-2012].
    [64]DBLP Computer Science Bibliography. [Online]. Available: http://dblp.uni-trier.de/. [Accessed:03-Jan-2012].
    [65]Mu-Hee Song, Sang-Jo Lee, Dong-Jin Kang, Soo-Yeon Lim. Automatic classification of Web pages based on the concept of domain ontology[C].12th Asia-Pacific Software Engineering Conference,2005.
    [66]S. Tiun, R. Abdullah, T. Kong. Automatic topic identification using ontology hierarchy[J]. Computational Linguistics and Intelligent Text Processing,2001: 444-453.
    [67]T. Cheng, X. Yan, K. Chang. EntityRank:Searching Entities Directly and Holistically[C]. Proceedings of the 33nd international conference on Very large data bases,2007.
    [68]J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, W.-Y. Ma.2D Conditional Random Fields for Web information extraction[C]. Proceedings of the 22nd international conference on Machine learning, Bonn, Germany,2005:1044-1051.
    [69]V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner:Towards Automatic Data Extraction from Large Web Sites[C]. Proceedings of the 27th International Conference on Very Large Data Bases,2001:109-118.
    [70]P. Sarkar A. W. Moore. Fast nearest-neighbor search in disk-resident graphs[C]. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining,2010,513-522.
    [71]Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan. Keyword search on external memory data graphs[C]. Proceedings of the VLDB Endowment, 2008,1(1):1189-1204.
    [72]J. S. Vitter. External memory algorithms and data structures:Dealing with massive data[J]. ACM Computing surveys (CsUR),2001,33(2):209-271.
    [73]S. Melink, S. Raghavan, B. Yang, H. Garcia-Molina. Building a distributed full-text index for the web[J]. ACM Transactions on Information Systems (TOIS), 2001,19(3):217-241.
    [74]B. Hendrickson, T. G. Kolda. Graph partitioning models for parallel computing[J]. Parallel computing,2000,26(12):1519-1534.
    [75]F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, R. E. Gruber. Bigtable:A distributed storage system for structured data[J]. A CM Transactions on Computer Systems (TOCS), 2008,26(2):1-26.
    [76]P. G. Ipeirotis, L. Gravano. Distributed search over the hidden web:Hierarchical database sampling and selection[C]. Proceedings of the 28th international conference on Very Large Data Bases.2002:394-405.
    [77]M. Pasca. Extraction of open-domain class attributes from text:building blocks for faceted search[C]. Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval,2010:909-909.
    [78]L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, P. Papotti. Redundancy-driven web data extraction and integration[C]. Procceedings of the 13th International Workshop on the Web and Databases.2010:71-76.
    [79]C. Gnoli and H. Mei. Freely faceted classification for Web-based information retrieval[J]. New Review of Hypermedia and Multimedia.2006,12(2):63-81.
    [80]B. Vickery. Faceted Classification for the Web[J], Axiomathes, 2007,18(2):145-160.
    [81]C.-H. Chang, M. Kayed, R. Girgis, and K. F. Shaalan. A Survey of Web Information Extraction Systems[C]. IEEE Transactions on Knowledge and Data Engineering,2006,18(10):1411-1428.
    [82]A. Arasu, H. Garcia-Molina. Extracting structured data from Web pages[C]. Proceedings of the 2003 ACM SIGMOD international conference on Management of data.2003:337-348.
    [83]M. Zhou, T. Cheng, K. C. C. Chang. Data-oriented content query system: searching for data into text on the web[C]. Proceedings of the third ACM international conference on Web search and data mining,2010:121-130.
    [84]A. K. Jain, Y. Zhong. Page segmentation using texture analysis[J]. Pattern Recognition.1996,29(5):743-770.
    [85]C. Kohlsch u tter, W. Nejdl. A densitometric approach to web page segmentation[C]. Proceeding of the 17th ACM conference on Information and knowledge management,2008:1173-1182.
    [86]S. Mao,T. Kanungo. Empirical performance evaluation methodology and its application to page segmentation algorithms[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(3):242-256.
    [87]Sauvola, J, Pietikainen, M., Page segmentation and classification using fast feature extraction and connectivity analysis[C]. Proceedings of the Third International Conference on Document Analysis and Recognition,1995:1127 1131.
    [88]F. Shafait, D. Keysers, T. Breuel. Performance comparison of six algorithms for page segmentation[J]. Document Analysis Systems VII,2006:368-379.
    [89]Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma. Block-based web search[C]. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval,2004:456-463.
    [90]S. Kamvar, T. Haveliwala, C. Manning, G. Golub. Exploiting the block structure of the web for computing pagerank. Stanford University Technical Report,2003.
    [91]Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents[C]. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,2002:588-593.
    [92]R. Song, H. Liu, J. R. Wen, W. Y. Ma. Learning block importance models for web pages[C]. Proceedings of the 13th international conference on World Wide Web,2004:203-211.
    [93]H. L. Chieu, H. T. Ng. Named entity recognition:a maximum entropy approach using global information[C]. Proceedings of the 19th international conference on Computational linguistics,2002:1-7.
    [94]R. Ananthanarayanan, V. Chenthamarakshan, P. M. Deshpande, R. Krishnapuram. Rule based synonyms for entity extraction from noisy text. [C] Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008:31-38.
    [95]M. Collins. Ranking algorithms for named-entity extraction:Boosting and the voted perceptron[C]. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics,2002:489-496.
    [96]S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data[C]. Proceedings of EMNLP-CoNLL,2007:708-716.
    [97]D. Nadeau, P. Turney, S. Matwin. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity[C]. Advances in Artificial Intelligence,2006:266-277.
    [98]G. D. Zhou, J. Su. Named entity recognition using an HMM-based chunk tagger[C]. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics,2002:473-480.
    [99]D. Tunkelang. Dynamic category sets:An approach for faceted search[C]. ACM SIGIR, vol.6.
    [100]S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs[C]. Proceedings of the 16th international conference on World Wide Web,2007:571-580.
    [101]M. Gupta, A. Pathak, S. Chakrabarti. Fast algorithms for topk personalized pagerank queries[C]. Proceedings of the 17th international conference on World Wide Web,2008:1225-1226.
    [102]D. Fogaras and B. Racz. Towards scaling fully personalized pagerank[C]. Algorithms and Models for the Web-Graph,2004:105-117.
    [103]L. Lovasz. Random Walks on Graphs:A Survey[J]. Combinatorics, Paul Erdos is Eighty,1993,2(1):1-46.
    [104]H. Tong, C. Faloutsos, J.-Y. Pan. Fast Random Walk with Restart and Its Applications[C]. Proceedings of the Sixth International Conference on Data Mining,2006:613-622.
    [105]N. Craswell, M. Szummer. Random walks on the click graph[C]. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007:239-246.
    [106]M. Franceschet. PageRank:Stand on the shoulders of giants. Arxiv preprint arXiv:1002.2858,2010.
    [107]G. Jeh, J. Widom. Scaling personalized web search[C]. Proceedings of the 12th international conference on World Wide Web,2003:271-279.
    [108]G. Jeh and J. Widom. SimRank:a measure of structural-context similarity[C]. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,2002:538-543.
    [109]E. Michelakis, R. Krishnamurthy, P. J. Haas, S. Vaithyanathan. Uncertainty management in rule-based information extraction systems[C]. Proceedings of SIGMOD international conference on Management of data,2009:101-114.
    [110]A. D. Sarma, L. Dong, A. Halevy. Uncertainty in data integration[J]. Managing and Mining Uncertain Data,2009.
    [111]A. H. Doan, R. Ramakrishnan, S. Vaithyanathan. Managing information extraction:state of the art and research directions[C]. Proceedings of ACM SIGMOD international conference on Management of data,2006:799-800.
    [112]J. Cowie, W. Lehnert. Information extraction[C]. ACM Communication, 1996,39(1):80-91.
    [113]S. Chakrabarti, A. Pathak, M. Gupta. Index design and query processing for graph conductance search[J]. The VLDB Journal,2010,20(3):445-470.
    [114]P. Sarkar, A. W. Moore, A. Prakash. Fast incremental proximity search in large graphs[C]. Proceedings of the 25th international conference on Machine learning,2008:896-903.
    [115]B. Kimelfeld, Y. Sagiv. Efficient engines for keyword proximity search[C]. WebDB,2005:67-72.
    [116]K. Golenberg, B. Kimelfeld, Y. Sagiv. Keyword proximity search in complex data graphs[C]. Proceedings of the 2008 ACM SIGMOD international conference on Management of data,2008:927-940.
    [117]Apache Lucene-Welcome to Apache Lucene. [Online]. Available: http://lucene.apache.org/. [Accessed:08-Mar-2012].
    [118]A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG:User behavior as a predictor of a successful search[C]. Proceedings of the third ACM international conference on Web search and data mining,2010:221-230.
    [119]Z. Nie, Y. Zhang, J. Wen, W. Ma. Object-Level Ranking:Bringing Order to Web Objects[C]. STUDY OF THE EXPLICIT CONTROL PROTOCOL (XCP). IEEE INFOCOM,2005.
    [120]M. Richardson,P. Domingos. The Intelligent Surfer:Probabilistic Combination of Link and Content Information in PageRank.2002.
    [121]A. Pathak, S. Chakrabarti, M. Gupta. Index design for dynamic personalized pagerank.IEEE 24th International Conference on Data Engineering,2008:1489-1491.
    [122]V. Hristidis, L. Gravano, Y. Papakonstantinou. Efficient IR-style keyword search over relational databases. Proceedings of the 29th international conference on Very large data bases-Volume 29,2003:850-861.
    [123]D. Fogaras, B. Racz, K. Csalogany, T. Sarlos. Towards scaling fully personalized pagerank:Algorithms, lower bounds, and experiments[J]. Internet Mathematics,2005,2(3):333-358.
    [124]P. Berkhin. Bookmark-Coloring Algorithm for Personalized PageRank Computing[J]. Internet Mathematics,2006,3(1):41-62
    [125]R. Fagin, R. Kumar, and D. Si vakumar. Comparing top k lists. SI AM Journal on Discrete Mathematics,2004,17(1):134-160,.
    [126]K. Jarvelin, J. Kekalainen. IR evaluation methods for retrieving highly relevant documents[C]. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,2000:41-48.
    [127]D. Bonino, F. Corno, L. Farinetti. FaSet:A Set Theory Model for Faceted Search[C],2009:474-481.
    [128]S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs[C]. Proceedings of the 16th international conference on World Wide Web,2007:571-580.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700