自适应网络信息获取服务技术研究

英文题名：Research on Adaptive Techniques for Web Information
作者：刘康苗
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：网络信息获取 ; 自适应技术 ; 信息拉取 ; 信息推送 ; 查询歧义性 ; 个性化建模 ; 分布式索引组织策略
英文关键词：Web Information Acquisition ; Adaptive Technique ; Information Pull ; Information Push ; Query Ambiguity ; User Modeling ; Indexing Organization Strategy
学位年度：2008
导师：陈纯 ; 卜佳俊
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2008-04-01

摘要

网络技术的发展带来了可获取信息资源的极大丰富，但是网络资源的无序、良莠不齐等缺点也给用户获取网络信息带来了困难。网络信息获取服务是指在互联网上，针对个人用户的网络信息需求，以现代信息技术为手段，向用户提供所需的互联网信息产品及服务，其服务模式包括信息拉取和信息推送。自适应网络信息获取服务技术，则是根据用户需求、信息源特征、系统负载等因素，自适应地动态调整自身行为，高效、人性化地提供高质量的信息。
     准确、全面地感知用户需求，是实现网络信息获取服务的基础。网络用户既是网络信息资源的利用者又是提供者，因此可以通过分析用户的浏览内容、行为和发布的信息等来获取用户需求。获得用户需求后，如何在浩瀚的网络信息资源中筛选出相关的信息，并以更人性化的方式展现给用户，是网络信息获取服务成功的关键。此外，用户对信息获取的时效性通常有较高的要求，如何保障信息获取系统的性能也是网络信息获取服务的重要研究内容之一。
     为解决上述问题，本文首先提出了一种基于查询歧义性衡量的自适应信息拉取技术。对用户请求进行歧义性衡量，根据其歧义性自适应地决定结果的展现方式；在结果筛选和展现方面，分别提出了多特征融合排序算法和聚类算法；并在互联网颇具代表的新兴资源：多媒体信息(以图像为例)和更新频繁的动态资源(以博客为例)上得到了验证。
     其次，本文针对网络活动中的信息发布者和信息浏览者各提出了一种基于个性化建模的自适应信息推送技术：对于信息发布者，以当前网络流行的博客这一个性化信息发布平台为研究环境，提出了一种利用博客文章对用户进行长短期兴趣建模的方法，并对博客空间进行社群划分，实现了兴趣相似好友的推荐；对于信息浏览者，利用用户当前浏览网页的内容作为用户个性信息的表征，提出了一种基于情感和主题分析的上下文广告推荐技术，使推送的广告不仅主题相关，而且与网页内容中潜在的用户情感相符合，从而更具针对性。
     接着，针对网络信息获取服务在性能、可扩展性等方面的需求，以信息拉取服务的典型应用——搜索引擎为切入点，提出了一种具有较好可扩展性的混合型分布式索引组织策略(Loc-Glob)。并在Loc-Glob索引组织策略之上进行性能优化：基于索引词负载及动态变化查询流，重新分布和冗余索引；基于索引服务器的实时系统负载，实现查询路径的自适应优化。
     基于上述研究，本文设计并实现了一个采用自适应技术的博客空间信息获取原型系统，提供了博客搜索引擎、博客好友推荐、广告推荐等多种应用服务，验证了本文针对信息拉取和信息推送两类服务模式提出的多项自适应技术的可行性。
     文章最后对本文的研究工作进行了总结和展望。
The rapid development of web technology greatly enriches accessible information resources. However, these resources come with some inherent insufficiencies such as disorder and mixture of junk, making user acquisition of information difficult. The Web Information Acquistion Service (WIAS) means to provide users with Web information products and services to meet their personal network information needs through modern information technology, with pull and push being the main two strategies. Adaptive techniques for WIAS adjust the service behavior to users' information needs, information source characteristics, system load and other factors dynamically, and provide high quality information efficiently and humanizedly.
     Accurate and complete understanding of users' information needs lays foundations of WIAS. Web users are simultaneously consumers and producers of Web information, therefore it is feasible to obtain users' needs through the analysis of their browsing content, behavior and also published information and etc. Once the informaion needs are obtained, retrieving relevant results from the vast amount of Web resources and then presenting them in a more humanized style are keys to the success of WIAS. Besides, as users usually require high time validity on information acquisition, ensuring the performance of WIAS shall also be a vital part of the research on information acquistion.
     To address the above issues, an adaptive information pull technique based on the measurement of user requests' ambiguity is firstly proposed. The demonstration styles of pulling results are decided adaptively according to the quantified ambiguity of user requests. For result filtering and demonstration styles, a ranking algorithm and a clustering algorithm based on the combination of multi-features are proposed correspondingly. These two algorithms are validated using two kinds of respresentive emerging Internet resources: multimedia resources (images for example in the paper) and dynamic resources with frequent updating (blog for example in the paper).
     Secondly, an adaptive information push technique is proposed based on user modeling for information publishers and browsers. Blogs, the popular personal information publishing platform, are taken as the research environment for information publishers and a modeling approach using blog posts is proposed, based on which communities of bloggers with similar preferences in the blogspace are partitioned and recommended as friends. Meanwhile, for information browers, current browsing content is regarded as the evidences for users' profiles and a contextual advertising method based on sentiment and topic analysis is proposed, which ensures the promoted advertisments are not only topic relevant but also conformable the underlying users' attitudes and therefore makes them more targeted.
     After then, we propose a hybrid strategy to distributed index organization in search engine (a typical information pull application), which named Loc-Glob. It is both high performance and scalable. Some optimization strategies are proposed on Loc-Glob further. To smooth the workload across index servers, index is re-distributed and duplicated based on the analysis of index terms workload and user query streams. Query path across index servers is also optimized based on the real-time workload to improve system load-balancing level.
     Based on the above work, a blog information acquistion prototype system adopting adaptive techniques is designed and implemented. This system provides novel applications such as blog search engine, blog friends recommending, advertisement promoting and etc. to validate the feasibility of the adaptive techniques proposed in this paper for the two types of information acquistion services.
     Finally, conclusions and future work are presented.

引文

1http://www.pewinternet.org/,皮尤互联网与美国生活项目。

    2http://www.jupiterresearch.com/,美国市场调研公司。

    3http://www.cnnic.cn/,中国互联网络信息中心。

    4http://www.google.com/adsense,Google公司推出的广告推送服务。

    1http//www.google.com,Google公司主页。

    2http://www.baidu.com,Baidu公司主页。

    3http://vivisimo.com,聚类元搜索引擎。

    4http://www.ask.com,支持自然语言查询的搜索引擎。

    5http//www.pointcast.com/,最早的提供推送服务的公司。

    http://www.douban.com/,提供图书介绍、读者评论及网上书店的网络社区。

    7http://www.cipher-sys.com/,Cipher System竞争情报系统。

    8http://chinawi.tixa.com/index.html,天下互联——中国网络情报中心。

    9http://labs.google.com/personalized,Google个性化服务主页。

    10http://eigencluster.csail.mit.edu/,MIT实现的一个聚类搜索引擎。

    11http://www.bbmao.com/,国内聚类元搜索引擎。

    12http://www.flixster.com／,电影推送网站。

    1http://dmoz.org,开放目录项目。

    2http://trec.nist.gov,文本检索会议。

    3http://www.lemurproject.org,基于语言模型的文本检索工具包。

    4http://www.csie.ntu.edu.tw/～cjlin/libsvm/,开源SVM工具包。

    5http://www.sogou.com/labs/,搜狗实验室。

    1http://www.nlp.org.cn/,中文自然语言处理开放平台。

    2http://www.cs.cmu.edu/～mccallum/bow/rainbow/,基于统计的文本分类工具包。

    3http://ir.hit.edu.cn/,哈尔滨工业大学信息检索研究室。

    1http://lucene.apache.org,全文检索工具包。

    1http://blog.zj.com/,浙江博客网。

    [1]．刘渊．互联网信息服务理论与实证——用户使用、服务提供与行业发展．科学出版社，2007．

    [2]．阿尔文．托夫勒(吴迎春译)．权力的转移．中信出版社，2006．

    [3]．胡泳，范海燕．网络为王．海南出版社，1997．

    [4]．中国互联网信息中心．第21次中国互联网络发展状况统计报告．2008， http：／／www．cnnic．net．cn／uploadfiles／pdf／2008／1／17／104156．pdf．

    [5]．任志纯，李恩科，李东．穆尔斯定律及其扩展．情报杂志，2002．21(11)：39-40．

    [6]．张晓静．论网络信息资源管理．现代情报，2003．23(8)：70-71．

    [7]. N. Belkin, B. Croft. Information Filtering and Information Retrieval: Two Sides of the Same Coin?. Communications of ACM, 1992. 35(12):29-38.

    [8]．杨震．个性化信息获取方法的研究[博士学位论文]．大连理工大学，2004．

    [9]．王辉，陈凌，张丽娟．信息推拉技术．情报科学，2004．22(12)：1440-1443．

    [10]．鄢朝晖，方宜仙．个性化信息服务的新形式——论信息推拉服务．吉首大学学报：社会科学版，2007．28(3)：150-153．

    [11]. J. Allan, J. Aslam, N. Belkin, et al. Challenges in Information Retrieval andLanguage Modeling. Report of a Workshop held at CIIR, University ofMassachusetts Amherst, 2002.

    [12]. A. Singhal. Challenges in Running a Commercial Search Engine. Proceedings ofthe 28~9(th) annual international ACM SIGIR conference on Research anddevelopment in information retrieval, 2005, pp.432-432.

    [13]. M. R. Henzinger, R. Motwani, G Silverstein. Challenges in Web Search Engines.Proc. of the 18~(th) International Joint Conference on Artificial Intelligence, 2003,pp.1573-1579.

    [14]. D. H. Widyantoro, T. R. Ioerger, J. Yen. Learning User Interest Dynamics with aThree-Descriptor Representation. Journal of the American Society for InformationScience and Technology, 2001. 52(3):212-225.

    [15]. P. Anick. Using terminological feedback for Web search refinement: a log-basedstudy. Proc. of 13~(th) International World Wide Web Conference, 2004, pp.89-95.

    [16]. X.H. Shen, B. Tan, C.X. Zhai. Context-Sensitive Information Retrieval UsingImplicit Feedback. Proc. of the 28th annual international ACM SIGIR conferenceon Research and development in information retrieval, 2005, pp.43-50.

    [17]. F. Qiu, J.H. Cho. Automatic Identification of User Interest For PersonalizedSearch. Proc. of 15~(th) International World Wide Web Conference, 2006,pp.727-736.

    [18]. K. Sugiyama, K. Hatano, M. Yoshikawa. Adaptive Web Search Based on UserProfile Constructed without Any Effort from Users. Proc. of 13~(th) InternationalWorld Wide Web Conference, 2004, pp.675-684.

    [19]. J.T. Sun, H.J. Zeng, H. Liu. CubeSVD: A Novel Approach to Personalized WebSearch. Proc. of 14~(th) International World Wide Web Conference, 2005,pp.382-390.

    [20]. T. Joachims. Optimizing Search Engines using Click-through Data. Proc. of the11~(th) ACM international conference on Knowledge discovery in data mining, 2005,pp.133-142.

    [21].J. Teevan, S.T. Dumais, E. Horvitz. Personalizing Search via Automated Analysisof Interests and Activities. Proc. of the 28~(th) annual international ACM SIGIRconference on Research and development in information retrieval, 2005,pp.449-456.

    [22]．李晓明，闫宏飞，王继民．搜索引擎——原理、技术与系统．科学出版社，2004．

    [23]. B. Baeza-Yates, B. RIbeiro-Neto. Modern Information Retrieval. Addison-Wesley,1999.

    [24]. G Salton, M.E. Lesk. Computer Evaluation of Indexing and Text Processing.Journal of the ACM, 1968.15(1):8-36.

    [25]. C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.

    [26]. S.E. Robertson, C.J. van Rijsbergen, M.F. Porter. Probabilistic models of indexingand searching. Proceedings of the 3~(rd) annual ACM conference on Research anddevelopment in information retrieval, 1980, pp.35-56.
    [27]. H.R. Turtle, W.B. Croft. Inference Networks for Document Retrieval. In Proceedings of the 13~(th) Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1990, pp.1-24.
    [28]. H.R. Turtle, W.B. Croft. Evaluation of an Inference Network-based Retrieval Model. ACM Transactions on Information Systems, 1991. 9(3):187-222.
    [29]. J.P. Callan, W.B. Croft, S.M. Harding. The INQUERY retrieval system. In Proceedings of the 3th International Conference on Database and Expert Systems Applications, 1992, pp.78-83.
    [30]. S. Brin, L. Page. The Anatomy of a Large Scale Hypertextual Web Search Engine. Proc. of 7~(th) International World Wide Web Conference, 1998, pp. 107-117.
    [31]. L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project TR, 1999.
    [32]. J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th Ann. ACM-SIAM Symp. Discrete Algorithms, 1998, pp.668-677.
    [33]. B.J. Jansen, A. Spink, J. Bateman, T. Saracevic. Real Life Information Retrieval: A Study of User Queries on the Web. ACM SIGIR Forum, 1998. 32(1):5-17.
    [34]. H. Joho, J.M. Jose. A Comparative Study of the Effectiveness of Search Result Presentation on the Web. Proc. of the 28th European Conference on Information Retrieval, 2006, pp.302-313.
    [35]. P. Jacso. Clustering search results, Part I: web-wide search engines. Online Information Review, 2007. 31(1):85-91.
    [36]. P, Jacso. Clustering search results, Part II:search engines for highly structured databases. Online Information Review, 2007. 31(2):234-241.
    [37]. O. Zamir, O, Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp.46-54.
    [38]. H.J. Zeng, Q.C. He, Z. Chen, W.Y. Ma, J.W. Ma. Learning to Cluster Web Search Results. Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pp.210-217.
    [39]. G. Adomavicius, A. Tuzhilin. Toward the Next Generation of Recommender System: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering, 2005. 17(6):734-749.
    [40]. G. Adomavicius, R. Sankaranarayanan, S. Sen, A. Tuzhilin. Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach. ACM Transactions on Information Systems, 2005.23(1):103-145.
    [41]. R. Burke. Hybrid Recommender Systems: Survey and Experiments. User Modeling and User-Adapted Interaction, 2002.12(4):331-370.
    [42]. K. Aas, L. Eikvil. Text Categorisation: A Survey. Technical report, Norwegian: Norwegian Computer Center, 1999.
    [43]. R.J. Mooney, P.N. Bennett, and L. Roy. Book Recommending Using Text Categorization with Extracted Information. Proc. Recommender Systems Papers from 1998 Workshop, Technical Report WS-98-08,1998.
    [44]. M. Pazzani and D. Billsus. Learning and Revising User Profiles: The Identification of Interesting Web Sites. Machine Learning, 1997. 27(3):313-331.
    [45]. U. Shardanand and P. Maes. Social Information Filtering: Algorithms for Automating 'Word of Mouth'. Proc. Conf. Human Factors in Computing Systems, 1995, pp.210-217.
    [46]. B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-Based Collaborative Filtering Recommendation Algorithms. Proc. 10th International World Wide Web Conference, 2001, pp.285-295.
    [47]. M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining Content-Based and Collaborative Filters in an Online Newspaper. Proc. ACM SIGIR '99 Workshop Recommender Systems: Algorithms and Evaluation, Aug. 1999.
    [48]. P. Melville, R.J. Mooney, R. nagarajan. Content-Boosted Collaborative Filtering for Improved Recommendations. Proc of the 18th international conference on Artificial Intelligence, 2002, pp. 187-192.
    [49]. M. Degemmis, P. Lops, G. Semeraro. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction, 2007.17(3):217-255.
    [50]. G. Semeraro, M. Degemmis, P. Lops, P. Basile. Combining Learning and Word Sense Disambiguation for Intelligent User Profiling. Proc. of the 20~(th) International Joint Conferences on Artificial Intelligence, 2007, pp.2856-2861.
    [51]. D. Albrecht, I. Zukerman. Introduction to the special issue on statistical and probabilistic methods for user modeling. User Modeling and User-Adapted Interaction, 2007.17(1):1-4.
    [52]. R. Krovetz, W.B. Croft. Lexical ambiguity and information retrieval. ACM Transaction on Information retrieval, 1992.10(2):115-441.
    [53]. M. Sanderson, K. van Rijsbergen. The impact on retrieval effectiveness of skewed frequency distributions. ACM Transactions on Information Systems, 1999. 17(4):440-465.
    [54]. H. Schutze, J. Pederson. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 161 -175.
    [55]. S. Cronen-Townsend, W.B. Croft. Quantifying query ambiguity. In Proceedings of Human Language Technology 2002,2002, pp.94-98.
    [56]. S. Cronen-Townsend, Y. Zhou, W. B. Croft. Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2002, pp.299-306.
    [57]. I. Soboroff. Overview of the trec 2004 novelty track. In Proceedings of the Thirteenth Text Retrieval Conference, NIST Special Publication, 2004, pp.500-261.
    [58]. M.L. Kherfi, D. Ziou, A. Bernardi. Image Retrieval from the World Wide Web: Issues, Techniques, and Systems. ACM Computing Surveys, 2004. 36(1):25-67.
    [59]. Y. Choi, E.M. Rasmussen. Searching for Images: The Analysis of Users' Queries for Image Retrieval in American History. Journal of the America Society for Information Science and Technology, 2003. 54(6):498-511.
    [60]. C. Frankel, M. Swain, and V. Athitsos. Webseer: An Image Search Engine for the World Wide Web. IEEE Conf. on CVPR, 1997.



    [61]. T.A.S. Coelho, P.P. Calado, L.V. Souza, B. Ribeiro-Neto, R. Muntz. ImageRetrieval Using Multiple Evidence Ranking. IEEE Trans. KDE, 2004.16(4):408-417.

    [62]. G Carneiro, N. Vasconcelos. A Database Centric View of Semantic ImageAnnotation and Retrieval. Proc. 28th Int'l ACM SIGIR conf. on Research anddevelopment in IR, 2005, pp.559-566.

    [63]. R. Entlich. FAQ-Image search engine, http://www.rlg.org/preserv/diginews/diginews5-6.html#faq.

    [64]. Y.T. Zhuang, Q. Li, R.W.H.Lau. Web-Based Image Retrieval: a Hybrid Approach.Proc. Computer Graphics Int'l, 2001, pp.62-69.

    [65]. M. Lei, J.Y. Wang, B.J. Chen, X.M. Li. Improved Relevance Ranking inWebGather. Journal of Computer Science and Technology, 2001.16(5):410-417.

    [66]. M.S. Branicky, V.S. Borkar, S.K. Mitter. A unified framework for hybrid control:Model and optimal control theory. IEEE TRANSACTIONS ON AUTOMATICCONTROL, 1998.43(1):31-46.

    [67]. J. Broglio, J.P. Callan, W.B. Croft, D. W Nachbar. Document Retrieval andRouting Using the INQUERY System. In D.K. Harman, editor, Overview of theTREC-3,1995,pp.29-38.

    [68]. K. Fujimura, H. Toda, T. Inoue and N. Hiroshima. BLOGRANGER-AMulti-faceted Blog Search Engine. In Proceedings of the WWW 2006 Workshopon the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.

    [69]. K. Fujimura, T. Inoue and M. Sugizaki. The EigenRumor Algorithm for RankingBlogs. In Proceedings of the WWW 2005 Workshop on the WebloggingEcosystem: Aggregation, Analysis and Dynamics, 2005.

    [70]. Bloglines: http://www.bloglines.com.

    [71]. Blogpulse: http://www.blogpulse.com.

    [72]. D. Beeferman and A. Berger. Agglomerative Clustering of a Search Engine QueryLog. In Proceedings of the sixth ACM SIGKDD international conference onKnowledge discovery and data mining, 2000, pp.407-416.

    [73]. L. Gravano, V. Hatzivassiloglou and R. Lichtenstein. Categorizing Web Queries According to Geographical Locality. In Proceedings of the twelfth international conference on Information and knowledge management, 2003, pp.325-333.
    [74]. D. Shen, R. Pan, J.T. Sun, J.J. Pan, K. Wu, J. Yin and Q. Yang. Q2C@UST: Our Winning Solution to Query Classification in KDDCUP 2005. In ACM SIGKDD Explorations Newsletter, 2005, pp.100-110.
    [75]. M.D. Mulvenna., S.S. Anand and A.G Buchner. Personalization on the Net using Web mining: introduction. Communications of the ACM, 2000.43(8):122-125.
    [76]. M. Eirinaki, M. Vazirgiannis. Web mining for web personalization. ACM Transaction on Internet Technology, 2003. 3(1):1-27.
    [77]. G.I. Webb, M.J. Pazzani and D. Billsus. Machine Learning for User Modeling. User Modeling and User-Adapted Interaction, 2004. 11(1-2):19-29.
    [78]. G. Mishne. Multiple Ranking Strategies for Opinion Retrieval in Blogs. In Proceedings of the fifteenth Text Retrieval Conference (TREC 2006), 2006.
    [79]. G. Mishne and M. de Rijke. A study of blog search. In Proceedings of ECIR 2006, 2006,pp.289-301.
    [80]. H. LIU, X. XIE, X. TANG et al. Effective Browsing of Web Image Search Results. Proceedings of the 6th ACM SIGMM International workshop on Multimedia information retrieval, 2004, pp.84-90.
    [81]. B. LUO, X.G. WANG. X.O. TANG A World Wide Web Based Image Search Engine Using Text and Image Content Features. Proceedings of IS&T/SPIE Electronic Imaging 2003,2003, pp.123-130.
    [82]. B. GAO, T.Y. LIU, T. QIN, et al. Web image clustering by consistent utilization of visual features and surrounding texts. Proceedings of the 13th annual ACM International Conference on Multimedia, 2005, pp. 112-121.
    [83]. X.J. WANG, W.Y. MA, L. ZHANG, et al. Iteratively clustering Web images based on link and attribute reinforcements. Proceedings of the 13th annual ACM International Conference on Multimedia, 2005, pp. 122-131.
    [84]. D. CAI, X.F. HE, Z.W. LI, et al. Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis. Proceedings of the 12th annual ACM International Conference on Multimedia, 2004, pp.952-959.
    [85]. J.A. HARTIGAN, M.A. WONG A K-means clustering algorithm. Applied Statistics, 1979.28(1):100-108.
    [86]. K. VENKATALAKSHMI, P. PRAISY, R. MARAGATHAVALLI,et al. Multispectral Image Clustering Using Enhanced Genetic k-Means Algorithm. Information Technology Journal, 2007. 6(4):554-560.
    [87]. N. VENKATESWARAN, Y.V. RAO RAMANA. K-Means Clustering Based Image Compression in Wavelet Domain. Information Technology Journal, 2007. 6(1):148-153.
    [88]. L.D. WANG. Clustering WWW Image Search Results Using Color Histogram. and Textual Information. USA, The University of Wisconsin Madison: Computer Science Department, 2006.
    [89]. K. Balog, M.D. Rijke. Decomposing Bloggers Moods, 3rd Workshop on Weblogging Ecosystem, WWW 2006.
    [90]. G. Mishne. Experiments with Mood Classification in Blog Posts. 1st Workshop on Stylistic Analysis of Text for Information Access, SIGIR 2005.
    [91]. X.C Ni, G.R Xue, X. Ling, et al. Exploring in the Weblog Space by Detecting Informative and Affective Articles. Proc. of the 15th International Conference on World Wide Web, 2007, pp.281-290.
    [92]. T.Fukuhara, T.Murayama, T.Nishida. Analyzing concerns of people using Weblog articles and real world temporal data, 2nd Workshop on the Weblogging Ecosystem, WWW 2005.
    [93]. M. Thelwall. Bloggers during the London attacks: Top information sources and topics. 3rd Workshop on the Weblogging Ecosystem, WWW 2006.
    [94]. V. Vapnik. Principles of Risk Minimization for Learning Theory. Advances in Neural Information Processding Systems, Morgan Kaufmann, 1992, pp.831-838.
    [95]. A. Qamra, B. Tseng, E.Y. Chang. Mining Blog Stories Using Community-Based and Temporal Clustering. Proceedings of the 15th ACM international conference on Information and knowledge management, 2006, pp.58-67.
    [96]. E. Adar, L.A. Adamic. Tracking information epidemics in blogspace. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp.207-214.
    [97]. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In Proceedings of the12th International Conference on World Wide Web (WWW), 2003, pp.568-576.
    [98]. B. L. Tseng, J. Tatemura, and Y. Wu. Tomographic clustering to visualize blog communities as mountain views. In Proceedings of 2nd Annual Workshop on the Weblogging Ecosystem, 2005.
    [99]. Z. Wu and R. Leahy. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993.15(11):1101-1113.
    [100]. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.22(8):888-905.

    [101]. A. Broder, M. Fontoura, et al. A Semantic Approach to Contextual Advertising. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp.559-566.
    [102]. H.K. Bhargava, et al. Paid Placement Strategies for Internet Search Engines. Proceedings of the 11th international conference on World Wide Web, 2002, pp.117-123.
    [103]. S. Mccoy, A. Everard, et al. The Effects of Online Advertising. In Communications of the ACM, 2007. 50(3):84-88.

    [104]. B. Ribeiro-Neto, M. Cristo, et al. Impedance Coupling in Content-targeted Advertising. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005, pp.496-503.
    [105]. W.T. Yih, J. Goodman and V.R. Carvalho. Finding Advertising Keywords on Web Pages. Proceedings of the 15th international conference on World Wide Web, 2006, pp.213-222.
    [106]. A. Lacerda, M. Cristo, et al. Learning to Advertise. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp.549-556.
    [107]. C.N. Wang, P. Zhang, et al. Understanding Consumers Attitude Toward Advertising. In Proceedings eighth Americas Conference on Information Systems, 2002.
    [108]. J. Feng, H.K.Bhargava, et al. Comparison of allocation rules for paid placement advertising in search engines. In Proceedings of the 5th International Conference on Electronic Commerce, 2003, pp.294-299.
    [109]. Q.Z. Mei, X. Ling, et al. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. Proceedings of the 16th international conference on World Wide Web, 2007, pp. 171-180.
    [110]. M. Hu and B. Liu. Mining and Summarizing Customer Reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp.168-177.
    [111]. V. Hatzivassiloglou and K.R. McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, 1997, pp.174-181.
    [112]. X.W. Ding and B. Liu. The Utility of Linguistic Rules in Opinion Mining. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp.811-812.
    [113]. P. D. Turney and M. L. Littman. Measuring praise and criticism: Inference of semantic orientation from association. In ACM Transactions on Information Systems, 2003. 21(4):315-346.
    [114]. M. Zhou, C.N. Huang. An Efficient Syntactic Tagging Toll for Corpora. In Proceedings of the 15th conference on Computational linguistics - Volume 2,1994, pp.949-955.
    [115]. L.W. Ku, Y.T. Liang, H.H. Chen. Opinion Extraction, Summarization and Tracking in News and Blog Corpora. AAAI Spring Symposia 2006 on Computational Approaches to Analyzing Weblogs, 2006.
    [116]. G. Mishne and M.D. Rijke. Language Model Mixtures for Contextual Ad Placement in Personal Blogs. In Proceedings of 5th International Conference on NLP (FinTAL), 2006, pp.435-446.
    [117]. B. RIBEIRO-NETO, R. BARBOSA. Query performance for tightly coupled distributed digital libriaries. Proceedings of 3rd ACM conference on digital libraries, 1998, pp. 182-190.
    [118]. A. MAC, J.A. MCCANN, S.E. ROBERTSON. Parallel search using partitioned inverted files. Proceedings of 7th international symposium on string processing and information retrieval, 2000, pp.209-220.
    [119]. L.A BARROSO, J. DEAN, U. HOLZLE. Web search for a planet: the google cluster architecture. IEEE Micro, 2003.23(2):22-28.
    [120]. S. MELNIK, S. RAGHAVAN, B. YANG, et al. Building a distributed full-text index for the web. Proceedings of the 10th international conference on World Wide Web, 2001, pp.396-406.
    [121]. S. BUTTCHER, C.L.A. CLARKE, B. LUSHMAN. Hybrid index maintenance for growing text collections. Proceedings of the 29th ACM SIGIR conference on research and development in information retrieval, 2006, pp.356-363.
    [122]. N. LESTER, A. MOFFAT, J. ZOBEL. Fast on-line index construction by geometric partitioning. Proceedings of the 14th ACM international conference on information. and knowledge management, 2006, pp.776-783.
    [123]. B.S. JEONG, E. OMIECINSKI. Inverted file partitioning schemes in multiple disk systems. IEEE transaction on parallel and distributed systems, 1995. 6(2):142-153.
    [124]. J. ZOBEL, A. MOFFAT. Inverted files for text search engines. ACM computing surveys, 2006. 38(2):No.6.

    [125]. C. BADUE, B. RIBEIRO-NETO, R. BAEZA-YATES, et al. Distributed query processing using partitioned inverted files. Proceedings of 8th international symposium on string processing and information retrieval, 2001, pp. 20-20.
    [126]. A. TOMASIC, H. GARCIA-MOLINA. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. Proceedings of 2nd international conference on parallel and distributed information systems, 1993, pp.8-17.
    [127]. R. LEMPEL, S. MORAN. Optimizing result prefetching in web search engines with segmented indices. ACM transaction on Internet technology, 2004. 4(1):31-59.



    [128]. A. MOFFAT, W. WEBBER, J. ZOBEL, et al. A pipelined architecture fordistributed text query evaluation. Information retrieval, 2006.10(3):205-231.

    [129]. A. MOFFAT, W. WEBBER, J. ZOBEL. Load balancing for term-distributedparallel retrieval. Proceedings of the 29th ACM SIGIR conference on research anddevelopment in information retrieval, 2006, pp.348-355.

    [130]. H.E. WILLIAMS, J. ZOBEL, D. BAHLE. Fast phrase querying with combinedindexes. ACM transaction on information systems, 2004. 22(4):573-594.

    [131]. M.S. KIM, K.Y. WHANG, J.G LEE, et al. n-gram/2L: a space and time efficienttwo-level n-gram inverted index structure. Proceedings of the 31st internationalconference on very large databases, 2005, pp.325-336.

    [132]. W. Webber, A. Moffat. In search of reliable retrieval experiments. Proc. 10thAustralasian Document Computing Symposium, 2005, pp.26-33.

    [133]．潘云鹤，王金龙，徐从富．数据流频繁模式挖掘研究进展，2006．32(4)：594-602．

    [134]. GS. Manku, R. Motwani. Approximate Frequency Counts over Data Streams.Proceedings of the 28th international conference on Very Large Data Bases, 2002,pp.346-357.

    [135]. A. Arasu, GS. Manku. Approximate Counts and Quantiles over Sliding Windows.Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium onPrinciples of database systems, 2004, pp.286-296.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700