基于PageRank与HITS的改进算法的网页排名优化

英文篇名：An improved algorithm for page rank optimization based on PageRank and HITS algorithms
作者：库珊 ; 刘钊
英文作者：Ku Shan;Liu Zhao;College of Computer Science and Technology,Wuhan University of Science and Technology;Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System,Wuhan University of Science and Technology;
关键词：PageRank算法 ; HITS算法 ; 链接结构 ; 网页排序 ; 算法改进
英文关键词：PageRank algorithm;;HITS algorithm;;link structure;;webpage ranking;;algorithm improvement
中文刊名：YEKJ
英文刊名：Journal of Wuhan University of Science and Technology
机构：武汉科技大学计算机科学与技术学院;武汉科技大学智能信息处理与实时工业系统湖北省重点实验室;
出版日期：2019-03-19 16:01
出版单位：武汉科技大学学报
年：2019
期：v.42;No.185
基金：国家自然科学基金资助项目(51874217)
语种：中文;
页：YEKJ201902013
页数：6
CN：02
ISSN：42-1608/N
分类号：78-83

摘要

针对传统网页排序算法PageRank和HITS中存在的主题漂移、检索效率低等不足,本文提出了一种改进算法PHIA(PageRank and HITS Improved Algorithm)。该算法继承了HITS算法获取根集和基本集的方法,并且使用根集中所有网页的PageRank值作为Hub和Authority初始迭代值,最后根据马尔可夫链求随机矩阵的特征向量的方式来获取网页排名的静态分布。基于随机关键词的检索结果可知,相比于传统的PageRank和HITS算法,改进PHIA算法具有更快的收敛速度,并且在一定程度上提高了网页排序的准确度。
Aiming at overcoming the disadvantages such as topic drift and low retrieval efficiency in the traditional webpage ranking algorithms PageRank and HITS,an improved algorithm named PHIA(PageRank and HITS Improved Algorithm) was proposed.Firstly,the algorithm inherits the way of HITS algorithm to obtain the root set and the basic set,then employs the PageRank value of all web pages in the root set as the initial iteration value of Hub and Authority,and finally,the page ranking status is obtained by searching the eigenvectors of random matrix based on the Markov chain.The calculation results based on random keyword retrieval show that compared with the traditional PageRank and HITS algorithms,the improved PHIA algorith not only has a faster convergence rate but also improves the accuracy of page ranking to some extent.

引文

[1]Brin S,Page L.The anatomy of a large-scale hypertextual Web search engine[J].Computer Networks and ISDN Systems,1998,30(1-7):107-117.
    [2]Kleinberg J M.Authoritative sources in a hyperlinked environment[J].Jounal of the ACM,1997,46(5):604-632.
    [3]汤茂杰,赵鹏,王瑀屏.基于IRR信息的改进ARC算法[J].中国科技论文,2014,9(4):425-428.
    [4]Lempel R,Moran S.The stochastic approach for link-structure analysis(SALSA)and the TKCeffect[J].Computer Networks,2000,33:387-401.
    [5]Richardson M,Domingos P.The intelligent surfer:probabilistic combination of link and content information in PageRank[C]//Proceedings of International Conference on Neural Information Processing Systems:Natural and Synthetic.MIT Press Cambridge,MA,USA,2001:1441-1448.
    [6]齐向明,孙文心.一种多特征因子融合的PageRank算法研究[J].计算机工程与应用,2017,53(7):97-103.
    [7]陈建峡,黄日,马忠宝.基于PageRank的Lucene排序算法优化与实现[J].计算机工程与科学,2012,34(10):123-127.
    [8]喻金平,朱桂祥,梅宏标.基于Web链接分析的HITS算法研究与改进[J].计算机工程与应用,2013,49(21):42-45.
    [9]Sehgal U,Kaur K,Kumar P.Notice of violation of IEEE publication principles“the anatomy of a largescale hyper textual Web search engine”[C]//Proceedings of 2009Second International Conference on Computer and Electrical Engineering.Dubai,United Arab Emirates,2009:11101849.
    [10]Lofgren P,Siddhartha B,Ashish G.Personalized PageRank estimation and search:a bidirectional approach[J].Computer Science,2015,arXiv:1507.05999.
    [11]Xiang D,Wen X Q,Wang L T.Low-power scanbased built-in self-test based on weighted pseudorandom test pattern generation and reseeding[J].IEEE Transactions on Very Large Scale Integration(VLSI)Systems,2017,25(3):942-953.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700