基于用户点击行为的数字图书搜索系统研究与实现

英文题名：Research and Implementation of Digital Book Search System Based on User Click-through Data
作者：袁川
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数字图书馆 ; 关联图 ; BookRank ; 查询词聚类 ; 多信息源集成
英文关键词：Digital Library ; Correlation Graph ; BookRank ; Query Clustering ; Ensemble of Multiple Information Sources
学位年度：2008
导师：吴江琴 ; 庄越挺
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2008-06-10

摘要

数字图书馆(Digital Library)在世界很多国家受到了高度关注,并取得了迅猛发展,已经成为人们获取信息与知识的重要途径。数字图书搜索则是数字图书馆必须提供的支撑性服务,本论文针对数字图书搜索以及搜索结果排序问题做了深入研究与开发,以便读者能够在海量数字图书资源中快速发现他所需要的数字图书。
     传统数字图书搜索建立在关系型数据库之上,采用关键词的简单匹配来判别相关程度,不能反映图书的质量信息和受关注程度,缺乏有效的综合排序机制,不能综合利用多种排序依据。
     本文的主要工作如下:一、利用数字图书馆门户丰富用户使用日志数据,提出两个点击流上的随机行走算法:BookRank—基于访问关联图的图书评分算法,提供图书相关性排序功能;OueryCluster—基于查询-阅读行为的查询词聚类算法,利用读者对检索结果的隐式反馈信息,提供对查询词的聚类功能。二、抓取互联网上的图书评分相关数据,将其整合进我们的图书搜索排序系统中去作为搜索结果排序的一个重要依据。三、在查询词聚类的基础之上,实现一种多排序依据集成方法,针对每类查询词,综合利用从访问关联图得出的图书相关性排序、互联网上的图书评分以及文本相似度这三种信息源,形成最终的搜索结果排序。四、开发完成相应的数字图书搜索系统,部署在高等学校中英文数字图书合作计划(CADAL)的网站上,根据用户在实际使用中的反映,与传统数字图书搜索相比,新搜索系统的搜索结果排序更加合理。
Many developed and developing countries over the world have put large efforts on the development of digital library since the mid 1990's. Digital library has become an important means for people to access desired knowledge and information. Digital book search is the sustainable service that digital library should provide. This paper exclusively focuses on the development of digital book search and explores in depth the problem of search results ranking, so that the visitors of digital library can quickly find books satisfying their needs in the massive book resources.
     The traditional digital book search is based on the matching techniques of relational database. It can only find out the relevant book entries which contain the keywords the reader entered. Moreover, it lacks effective book ranking mechanism to sort results of relevant books, and ignores the popularity and quality of these books.
     The main work of this paper is summarized as follows: 1. Extract behavior information of user clicking on books out of the access logs, construct Correlation Graph of books read by users, and use random walk algorithm to rank the books by relevance. 2. Extract query words and book reading records out of the access logs, and utilize the clustering effect of random walks to cluster query words. 3. Crawl book score data from well-known online bookstores on the Internet, which act as another important measure for book ranking. 4. Propose an approach to integrating multiple book ranking infonnation for each class of query. The final ranking list of results of book search is gained by fusing text similarity, book score data from online bookstores and book ranking from Correlation Graph. 5. We have developed a digital book search system and deployed it in the CADAL portal using the above algorithms and techniques. Users have reported that the new book search system provides the more reasonable ranking of search results compared with the original book search module.

引文

[1]Jaime Carbonell,Raj Reddy.Million Books Digital Library Project:Research Issues in Data Mining and Text Mining[R].Talk presented at MSR India TechVista Symposium,2006
    [2]WU jiang-qin,Zhuang Yue-ting,PAN Yun-he.Technical features in the Portal to CADAL[J].Journal of Zhejiang University Science,2005,6A(11):1249-1257
    [3]CADAL项目组.CADAL技术报告[R].浙江:浙江大学,2006
    [4]CADAL 管理中心.百万册书数字图书馆项目在中国的背景情况[OL].(2008-05-12).http://www.cadal.zju.edu.cn
    [5]A.G.Buchner,M.D.Mulvenna.Discovering Internet marketing intelligence through online analytical web usage mining[C].ACM SIGMOD,1998,27(4):54-61
    [6]马睿.元数据检索及两种挖掘算法在图书个性化推荐中的应用研究[硕士学位论文].吉林:吉林大学,2006
    [7]陈艳梅.基于元数据的数字图书馆信息资源组织[J].大学图书情报学刊,2003,3:40-43
    [8]Wang X,Wu H,Wei L,Zhou A.A Similarity-based Analysis Model for Topic Distillation[J].International Journal of Computational Intelligence and Application,2002,2(3):267-275
    [9]张敏,马少平,宋睿华.DF还是IDF主特征模型在Web信息检索中的作用[J].软件学报,2005,16(5):1012-1020
    [10]M.Szummer,T.Jaakkola.Partially labeled classification with Markov random walks[C].In Advances in Neural Information Processing Systems(NIPS),2002:945-952
    [11]L.Page,S.Brin,R.Motwani.The PageRank Citation Ranking:Bring Order to the Web[R].Technical report Stanford Digital Library Technologies Project,1998
    [12]Chris Ridings,Mike Shishigin.PageRank Uncovered[M].北京:机械工业出版社,2002
    [13]药成刚.基于链接结构的中文网页排序算法研究[硕士学位论文].黑龙江:哈尔滨工业大,2006
    [14]孙启明.基于Web日志挖掘的搜索引擎排序算法的改进[硕士学位论文].黑龙江:哈尔滨工程大学,2007
    [15]J.Kleinberg.Authoritative Sources in a Hyperlinked Environment[J].Journal of ACM,1999,46(5):604-632
    [16]Ping-Ning Tan,Michael Steinbach,Vipin Kumar.数据挖掘导论[M].北京:人民邮电出版社,2006
    [17]李向云.Web日志挖掘技术的研究[硕士学位论文].黑龙江:大庆石油学院,2007
    [18]Raymond Kosala,Hendrik Blockeel.Web Mining Research:A Survey[J].SIGKDD Explorations,2000,2(1):256-276
    [19]R.Cooley,B.Mobasher,J.Srivastava.Web mining:Information and pattern discovery on the World Wide Web[C].In:Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence(ICTAI' 97),CA,1997:558-567
    [20]Marco Gori,Augusto Pucci.ItemRank:A Random-Walk Based Scoring Algorithm for Recommender Engines[C].IJCAI,2007:2766-2771
    [21]Hatcher E,Gospodnetic O.Lucene in Action[M].Greenwich:Manning Press,2004
    [22]Baeza-Yates R,Ribeiro-Neto B.Modern Information Retrieval[M].北京:机械工业出版社,2004
    [23]Brian Pinkerton.Finding what people want:Experiences with the web crawler[C].In Proceedings of the Second World-Wide Web Conference,Chicago,Illinois,1994
    [24]E.Agichtein,E.Brill,S.Dumais.Improving web search ranking by incorporating user behavior information[C].In SIGIR '06:Proceedings of the 29~(th)annual international ACM SIGIR conference on research and development in information retrieval,New York,NY,USA,2006:19-26
    [25]E.Agichtein,E.Brill,S.Dumais,R.Ragno.Learning user interaction models for predicting web search result preferences[C].In SIGIR '06:Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval,New York,NY,USA,2006:3-10
    [26]T.Joachims.Optimizing search engines using clickthrough data[C].In KDD '02:Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,New York,NY,USA,2002:133-142
    [27]T.Joachims,L.Granka,B.Pan,H.Hembrooke,G.Gay.Accurately interpreting clickthrough data as implicit feedback[C].In SIGIR '05:Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval,New York,NY,USA,2005:154-161
    [28]周登朋,谢康林.Lucene搜索引擎[J].计算机工程,2007,33(18):23-34
    [29]G.-R.Xue,H.-J.Zeng,Z.Chen,Y.Yu,W.-Y.Ma,W.Xi,W.Fan.Optimizing web search using web click-through data[C].In:Proceedings of the thirteenth ACM international conference on Information and knowledge management,New York,NY,USA,2004:118-126
    [30]Sergey Brin,Lawrence Page.The Anatomy of a Large-Scale Hypertextual Web Search Engine[C].In:Proceedings of the Seventh International Conference on World Wide Web 7,The Netherlands,1998:107-117
    [31]R.Srikant,R.Agrawal.Mining generalized association rules[C].In:Proceedings of 21th International Conference on Very Large Data Bases,Zurich,Switzerland,1995:407-419
    [32]Amy N.Langille,Carl D.Meyer.Deeper Inside PageRank[J].Internet Mathematics,2004,1(3):335-380

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700