基于链接信誉分析的网页权威排序分类算法研究

英文题名：Web Authority Sort Classification Algorithm Based on the Analysis of Link Credibility
作者：赵航
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本分类 ; 链接分析 ; 链接信誉 ; PageRank ; 分类搜索
英文关键词：Text Classification ; Link analysis ; link reputation ; the PageRank ; Category
英文关键词：Search
学位年度：2012
导师：杨天奇
学科代码：081203
学位授予单位：暨南大学
答辩委员会主席：鲍苏苏

摘要

随着互联网普及，网页数量呈指数增长，用户通过现有搜索引擎进行网页搜索时存在很大困难。究其原因，一是搜索引擎返回结果存在主题混杂，没有根据主题对网页搜索结果进行分类，这增加了用户搜索所需主题类型信息的困难。二是搜索引擎返回检索结果存在网页质量参差不齐（存在垃圾网页，垃圾广告），增加用户筛选高质量信息的困难。针对上述问题，本文做了一下工作。
     首先，为了解决搜索引擎返回结果中的网页主题混杂现象，本文将对网页进行主题类别标识，用户可以选择自己需要信息主题类别搜索，从而更快更准确定位到所需信息。
     其次为了提高网页文本分类准确度，提出基于特征噪声加权的特征权重算法方法，该方法通过降低用词不规范特征噪声对网页文本分类影响，提高网页文本分类的准确度和健壮性。
     再次，针对用户检索的网页质量参差不齐问题，本文把市场经济中的商家信誉模型引入到对网页权威的评价排序。通过挖掘历史链接信誉评价，建立与PageRank算法结合的评价模型对网页进行调整排序，有效提高搜索结果排在前面网页的质量，有效激励网页生产者专注创造高质量的网页。
     最后，应用本文思想建立一个系统模型，从而证明本文思想的可用性。
With the popularity of the Internet, the number of web pages has grownexponentially, and it is greatly difficult to get information through the existing searchengines. First of all, the search results with the search engine contain mixed themes,which are not classified according to the themes and make the users more difficultlyto get the topic type information. Secondly, the quality of search results is uneven(containing junk pages, junk advertisings and so on), which make the users difficultlyto filter the high-quality information. Aiming at these problems, this article makessome work as follows.
     First, in order to solve the mixed subjects of the pages returning from the searchengine, this article will make web pages with category identifiers. Then the users canchoose their categories to search, which is faster and more accurate to locate thedesired information.
     Secondly, in order to increase the accuracy of classifying the page text, the paperwill propose a feature weight algorithm basing on feature noise weighting. Thisalgorithm reduces the impact on webpage text classing caused by non-standardfeature noise. The method improves the accuracy and robustness of the page textclassification.
     Again, to address the problem that the quality of search results is uneven, thepaper will introduce the business reputation in the market economy model to the sortof evaluation on the web authoritative. Through mining the evaluating the credibilityof historical links, the paper adjusts the ordering of the pages with the evaluationmodel combined with the algorithm of PageRank, which improves the quality of thetop search results page and encourages the web producers effectively to take focus oncreating high-quality pages.Finally, this article will build a system model with the thinking, thus which will provethe availability of the ideas.

引文

[1]宋琦.智能检索系统中用户兴趣模型构建技术研究[J].情报杂志，2007，（01）.
    [2]蒋卫星.计算机技术与发展[J]。计算机技术与发展，2007，（04）.
    [3]搜索引擎优化.http://www.netup360.com/Search-Engine-Marketing.htm.
    [4]罗江锋.一种一直恶意网页的web权威结点挖掘算法研究.国防科学技术大学,2008.
    [5]Gonlon S.Linoff Michael J.A.Berry Mining the Web:Transforming Customer Dataint Customer Value.
    [6]龚畅.基于web挖掘技术的网页分类研究[D].江南大学,2009.
    [7]徐法艳.基于Web挖掘技术的网页分类研究[D].扬州大学,2008.
    [8]许世明.中文网页分类技术研究及预分类算法实现[D].西安电子科技大学,2009.
    [9]高岩.朴素贝叶斯分类器的改进研究[D].华南理工大学,2011.
    [10]官理,祖峰,唐文胜.快速的支持向量机多类分类研究[J].计算机工程与应用,2008,(05).
    [11]李村合,冯静.一种改进的KNN网页分类算法[J].微计算机应用,2008,(03).
    [12]赵航，杨天奇，赵小厦.基于特征噪声加权的特征权重算法改进[J].微型机与应用，2012，347（2）.
    [13]陆玉昌、鲁明羽.向量空间法中单词权重函数的分析和构造[J].计算机研究与发展,2002,39(10):1205-1210.
    [14]鲁松、李晓黎.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(06):8-20.
    [15]李凯齐、刁兴春.基于信息增益的文本特征权重改进算法[J].计算机工程,2011,37(01):16-21.
    [16]台德艺、王俊.文本分类特征权重改进算法[J].计算机工程,2010,36(9):187-202.
    [17]任函.大规模中文网页的自动分类研究[D].华中师范大学,2006.
    [18] A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Link analysisranking: algorithms, theory, and experiments[M]. ACM Transactions on InteretTechnology,5(1):231–297,2005.
    [19] Bing Liu..Web数据挖掘[M].清华大学出版社，2009：175-185.
    [20]刘雁书,方平. Web网站站外链接类型与特征调查——链接分析法可行性研究[J].大学图书馆学报,2001,(05).
    [21]郭勇峰.电子商务市场的价格与信誉——基于淘宝网的分析[J].中国物价，2012，（01）：46-49.
    [22] Pr0-google's pagerank0,2002.http://pr.efactory.de/e-pr0.shtml.
    [23]藕军.Deep Web搜索引擎的关键技术[D].合肥工业大学,2007.
    [24]李宜兵.基于搜索引擎网页排序算法研究[D].沈阳理工大学.2011.
    [25]蒋辉.基于任务上下文的查询扩展优化研究[D].南华大学,2009.
    [26]高珊.信息检索中的查询扩展及相关技术研究[D].华中师范大学,2008.
    [27]张海涛,刘甲学,宋川.超文本系统信息结构组成元素—链的分析[J].情报科学,2002,(04).
    [28]朱自强,网络信息计量学理论与方法：大学网站网络流量及页面链接分析研究[D].南京理工大学,2005.
    [29]刘雁书.链接关系在网络信息评价中的应用研究[D].中南大学,2001.
    [30]宋玲玲.基于链接结构分析的Web信息检索方法研究[J].现代情报,2007,(02).
    [31]汪洋.网络营销在测量仪器表行业的应用研究[D].复旦大学,2009.
    [32] PageRank算法学习.http://www.ebailu.com/PageDigest.asp?id=26.
    [33]刘军,基于Web结构挖掘的HITS算法研究[D].中南大学,2008.
    [34]完谨裕.企业信誉管理的多维度理解[J].滁州学院学报,2007(04).
    [35]时延军.基于Nutch的分布式搜索引擎的设计与研究[D].长春理工大学,2010.
    [36]百度百科--nutch. http://baike.baidu.com/view/46642.htm

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700