个性化搜索引擎模型的研究与改进

英文题名：Research and Improvement on Model of Personalized Search Engine
作者：李连江
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：搜索引擎 ; 排序算法 ; 满意度 ; 用户兴趣 ; 个性化
英文关键词：search engine ; ranking algorithm ; satisfaction ; user-interest ; personalization
学位年度：2008
导师：张健沛
学科代码：081202
学位授予单位：哈尔滨工程大学
论文提交日期：2008-01-10

摘要

通过使用搜索引擎,人们可以方便快捷的从大量信息中查找出自己需要的内容。比起曾经功能单一的搜索引擎,现在的搜索引擎已经有了很大的发展。但是,现有的搜索引擎技术仍然存在有不够智能化,不能够在大量的搜索结果中挑选出用户真正感兴趣的结果的问题。而这正是本课题要研究改进的问题。
     针对用户对搜索引擎个性化服务的需要,作者阐述了一种个性化搜索引擎页面排序算法的实现思想:采用基于Web数据挖掘的方法从用户动作中判断用户是否对网页有“兴趣”;在对原有搜索引擎排序技术进行研究与分析的基础上,采用聚类的方法对网页进行分类;建立用于存储用户兴趣信息的关键字——用户兴趣表,同时建立了网页类型表作为支撑;通过分析国内外关于个性化搜索的著作,提出一种适合个性化排序的权值计算公式,通过对存储在用户兴趣表中的用户兴趣信息进行分析得到符合用户兴趣的排序结果。同时,基于这个排序算法本文建立了一种个性化搜索引擎模型,并对各部分的实现进行分析设计。在模型中加入个性化分析模块以及网页类型分析模块,目的是提高搜索引擎的个性化分析能力,使搜索结果更符合用户需要,提高用户对个性化搜索引擎的满意度。
     最后,作者通过对比传统搜索引擎的实验验证了采用个性化排序算法的搜索引擎模型具备较高的用户满意度。分析了可能存在的问题,并指出可以继续研究的方向。
Through search engine, people could easily get the content what they need. Compared with the old one, the search engine today has a large development. But, there are also some problems, for example, the search engine is not intelligent enough, they can not get the really interested answers of the users from amount of searching results. And it is just the attitude of the research.
     In allusion to the need of search engine's personalized service, the author puts forward the ranking pages algorithm of personalized search engine: The thesis judged whether the user was interested in the web through the user's action based on the web data-mining method. The author cheese clustering to class the webpage, based on the analysis for original search engine technology. The thesis build on a Key-word and User-interest table for the User-interest message's storing, and build on a web-type table to support the User-interest table. Through analyzed the works, the author give a rank formula which could get the proper result through the User-interest message storing in the User-interest table. At the same time, this thesis build up a model of personalized search engine and the realization of each part of the system are analyzed and designed. The purpose for adding personalized and page-type analyzing model is to improve the personalized analyzing ability, to make the searching results conform users' need and improve users' satisfaction for the personalized search engine.
     At last, the author has confirmed the better users' satisfaction of the model by experiments compared with the traditional search engine. Also, the author brings forward the direction of the next step of research and some potential problems.

引文

[1]潘照明.智能中文搜索引擎若干关键技术的研究与实现.浙江大学硕士学位论文.2006:1-4页
    [2]Emtage,Deutsch.Archie-an electronic directory service for the Internet.Proceedings of the Winter 1992 USENIX Conference.1992:93-110P
    [3]Holyer.Wandering the World-Wide Web.AISB Quarterly.1993,35(8):63-65P
    [4]Babcock.Yahoo worm sends a warning as Ajax proliferates.InformationWEEK.2006,31:124-128P
    [5]Felter,Laura M.Google scholar,scirus,and the scholarly search revolution.Searcher Mag Database Prof.2005,13(2):43-48P
    [6]Chakrabarti,M.van den Berg.Focused crawling:a new approach to topic-specific Web resource discovery.In Proceedings of the Eighth International World Wide Web Conference.1999:73-80P
    [7]朱华.中文搜索引擎结构初探.情报科学.2001(11):17-20页
    [8]Myoung-Bum Chung,I-Ju Ko.Auto picture classification using a structure simplicity of the picture and face region detection.The 9th International Conference on Advanced Communication Technology.2007:156-158P
    [9]Joon Hur,Hongchul Lee.An intelligent manufacturing process diagnosis system using hybrid data mining.6th Industrial Conference on Data Mining.2006:14-15P
    [10]Tim Berners-Lee,J Hendler,O Lassila.The Semantic Web.Scientific American.2001,23(3):21-25P
    [11]Tim Finin,James Mayfield.Information retrieval and semantic web.Thrity eighth Hawaii Intemational Conference on System Sciences.2005:47-49P
    [12]吴丹.搜索引擎的智能化研究.情报理论与实践.2002,25(4):18-19页
    [13]C·谢尔曼等.看不见的网站:Internet专业信息检索指南.沈阳:辽宁科技出版社,2003:20-25页
    [14]叶鹰.信息检索:理论与方法.北京:高等教育出版社,2005:256-260页
    [15]Lawrence.Interceptor line-of-sight rate steering:necessary conditions for a direct hit.Journal of Guidance,Control,and Dynamics.1998,31(2):471-476P
    [16]Yong,Nordiana,Kadir.Search engines integration with HITS redundancies filtering,International MultiConference of Engineers and Computer Scientists.2007(1):1026-1031P
    [17]耿骞,毛瑞.汉语自然语言检索中的词法分析处理.情报科学.2004(04):466-469页
    [18]常璐,夏祖奇.搜索引擎的几种常用排序算法.图书情报工作.2003(06):70-73页
    [19]王舜燕,甘泉.基于Web结构挖掘的HITS算法分析及改进.软件导刊.2007(2):66-67页
    [20]谭琼,李晓黎,史忠植.一种实现搜索引擎个性化服务的方法.计算机科学.2002(29):23-25页
    [21]王晓宇,周傲英.万维网的链接结构分析及其应用综述.软件学报.2003,14(10):1768-1780页
    [22]范焱,王清毅.NaiveBayes方法协调分类Web网页.软件学报.2001(12):1386-1392页
    [23]Bi-Ru Dai,Jen-Wei Huang,Mi-Yen Yeh.Adaptive clustering for multiple evolving streams.IEEE Transactions on Knowledge and Data Engineering.2006,18(9):1166-1180P
    [24]薛为民,陆玉昌.文本挖掘技术研究.北京联合大学学报.2005,19(4):59-63页
    [25]Kurt,Tozal.A Web classification framework based on XSLT.APWeb 2006International Workshops:XRA,IWSN,MEGA,and ICSE.2006,38(42):86-96P
    [26]刘子良,田永先.一种文本分类数据挖掘的技术.计算机与信息技术(开发与应用).2006(8):22-23页
    [27]刘艳慧,雷英杰.基于Web数据挖掘技术研究.自动化技术.2007(9):96-97页
    [28]丁瑾.基于Web的数据挖掘综述.科技情报开发与经济.2004,14(12):267-268页
    [29]廖文军,叶喜民.中文网页自动分类系统研究.新乡师范高等专科学校学报.2007,21(2):67-69页
    [30]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究.2001(9):23-26页
    [31]谷峰,吴扬扬.文本分类关键技术.福建电脑.2006(9):5-8页
    [32]牛强,王志晓,陈岱等.基于SVM的中文网页分类方法的研究.计算机工程与设计.2007,28(8):1893-1895页
    [33]刘志刚,李德仁,秦前清等.支持向量机在多类分类问题中的推广.计算机工程与应用.2004(07):312-317页
    [34]张俐,李星,陆大金.中文网页自动分类新算法.清华大学学报(自然科学版).2000,40(1):39-42页
    [35]Roberston.Anecdotes(Google search engine).IEEE Annals of the History of Computing.2005,27(3):96-97P
    [36]陈友,张国基.一种改进的SVM算法及其在证券领域中的应用.华南理工大学学报.2003,31(7):15-17页
    [37]刘冰.多类SVM分类算法的研究和改进.电脑知识与技术.2007:1590-1596页
    [38]刘霞,卢苇.SVM在文本分类中的应用研究.自动化计算机.2007:72-77页
    [39]彭玲.一种新的动态进化聚类算法.广西师范大学学报(自然科学版).2006(04):175-179页
    [40]赵银春,付关友,朱征宇.基于Web浏览内容和行为相结合的用户兴趣挖掘.计算机工程.2005,31(12):93-94页
    [41]孔娟,马亨冰.PageRank算法的原理与解析.福建电脑.2007(1):95-96页
    [42]Otis Gospodnetic,Erik Hatcher.Lucene IN ACTION中文版.北京:电子工业出版社,2007:72-73页
    [43]原福永,褚蓓蓓.一种基于超链接结构的向量空间模型改进算法.中文信息学报.2005,19(4):68-71页
    [44]黄俞.散乱数据曲面拟合的B样条方法.大连理工大学硕士论文.2007:11-14页
    [45]John C,Platt.Using Analytic QP and Sparseness to Speed Training of Support Vector Machines.In Proc.IEEE Neural Networks in Signal Processing.1997:97P
    [46]李学勇,欧阳柳波,李国徽等.网络蜘蛛搜索策略比较研究.计算机工程与应用.2004(4):128-131页
    [47]Hirsch,Laurence.Evolving Lucene search queries for text classification.Hirsch,Robin.Proceedings of GECCO 2007:Genetic and Evolutionary Computation Conference,Sheffield Hallam University.2007:1604-1611P
    [48]开放源代码的全文检索引擎Lucene-介绍,系统结构与源码实现分析.http://lucene.cn/about.htm
    [49]毛明志,胡日章.散乱数据曲面拟合及软件.计算物理.2001,18(5):435-438页

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700