Web链接结构挖掘中HITS算法的分析与改进
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,随着Internet/Web技术的快速普及和迅猛发展,它为人们提供了丰富的信息资源的同时,其所具有的海量数据、复杂性、极强的动态性和用户的多态性等特点也给Web资源的发掘造成了相当的难度。因此,将数据挖掘技术和Web结合起来,进行Web数据挖掘也就随之成为解决Web挖掘问题的重要途径。
     在传统的信息检索技术己经成熟的现状下,从Web数据本身的特点出发,充分地挖掘Web上庞大的超链接资源,通过超链接进行搜索,建立有效的Web信息检索模型,从而找到我们需要的信息。但传统的基于超链接的网页搜索排序算法是纯粹地基于链接分析(即Web结构挖掘)来发现权威网页,没有考虑网页的具体内容,存在所谓的“主题漂移”问题,即算法的结果往往包含这样一些网页,它们相互链接密度较高,但在内容上却偏离了查询主题。
     本文通过对经典的Web结构挖掘算法HITS算法的研究学习,针对HITS算法中只考虑Web页面之间的超链接分析而忽略了Web页面的内容,从而导致分析结果出现“主题偏移”和主题之间的多重加强关系等不足,提出了一种结合超链接分析和内容相关性分析的关于HITS算法的改进算法——G-HITS算法,该算法通过对不同Web页面进行内容分析并赋予链接之间不同的权重来实现对HITS算法的改进,一定程度上改善了HITS算法的不足,更好的实现了权威网页的查找。最后通过实验证明G-HITS算法的有效性。
Recently, along with the quick popularization and development of the Internet and Web technology, it supplies people with abundant information. Internet constructed based on huge volume of data and its complexity, extreme dynamic and all kinds of clients have made the internet source development difficult.Therefore,locating valuable information in the Web has become the important issue in the area of Web Data mining.The traditional method of information browser has been mature and under the circumstance, we mine huge linkage resource on the Web according to the attribute of it.Then we search and build the Web indormation retrieval model to find information we need.
     The current method of locating the ring web page is based on the hyperlink ranking algorithm.However,such method may cause the topic drift problem,which is the results of algorithm is often irrelevant with the searching topic,but has high link density.
     By studying the classical Web structure mining algorithm HITS and considering that the HITS only calculates the hyperlink among the web and ignores the content of web result in the drawback of topic drift, we propose an improved HITS algorithm—G-HITS that combines hyperlink analysis and content analysis.The new algorithm improves the HITS by analyzing the content of the web and giving the hyperlinks with different weight.And the experiment proves the new algorithm effective.
引文
[1] Zhou Hongfang,Feng Boqin,Lv Lintao,Luo Zuomin.LQRA:Anew Method to Im -prove Web searching Quality[C],JICC2005,World Scientific,2005: 635-638.
    [2] Search Engine User Behavior[EB/OL].http://www.iprospect.com/premUmPDF/ White Paper_2006_SearchEngineUserBehavior.pdf.
    [3] Kleinberg J.Authoritative sources in a hyperlinked environment[C].In Proceedings of the 9th Annual ACMSIAM Symposium on Discrete Algorithms,San Francisco, California,United States,January 1998.
    [4] Raymond Kosala and Hendrik Bloekeel.Web Mining Researeh:A Survey.SIGKDD ExPlorations,July 2000,2(l):1-15.
    [5] Bettina Berendt, Bamshad Mobasher, Miki Nakagawa, etal.The ImPact of Site Str- uture and User Environment on Session Reconstructionin Web Usage Analysis. In proceedings of the 4th Web KDD2002 Workshop,at the ACMSIGKDD Conferen cion Knowledge Discovery in Databases.Edmonton, Alberta,Canada. 2002.115-129
    [6] Watts,D.J.and Strogatz,S.H.,Colleetive dynamics of’small-world’networks, Nature 393,1998:440-442.
    [7] FaloutsosM,FaloutsosP,and,Faloutsos,C.On,Power-law relationships of the internet topology,Computer Communications Review29,1999:251-262.
    [8] Broder,P RajagOPalan,S Stata,RTomkins, and Wiener.JGraph structure in the web, Computer Networks33,2000:309-320.
    [9] R.Kumar,P.Raghavan,S.Rajagopalan.etal.Trawling the Web for emerging cyber eo- mmunities.Computer Networks.1999,31(11-16):1481-1493.
    [10]Huberman,B.A.the Law of the Web,Computer Networks MIT Press,Cambridge, MA2001.
    [11]Barabasi,ALand Albert R.Emergence of scaling in random networks,Scienee 286, 1999:509-512.
    [12]The Anatomy of a Large-Scale HyPertextual Web Search Engine.Brin S, Page L. Proceedings of the 7th Intemational World Wide WebConference,1998.
    [13]JKleinberg.Anthoritative sources in a hyperlinked environment Journal of the ACM,November 1999:604-632.
    [14]黄隽毅.Web数据挖掘中HITS算法的研究[D].大连:大连理工学.2004:22.
    [15]高琐,谷士文,唐琏.基于链接分析Web社区发现技术的研究[J].计算机应用研究,07,2006:0183-0186.
    [16]周敏子,周皓峰,王晨等.使用频繁结构提炼网络权威资源[J].计算机研究与发展,2004,41(10):1615-1620.
    [17]杨楠,弓丹志,李饮等.Web社区发现技术综述[J].计算机研究与发展,200 5,42(3):439-447.
    [18]Albert R,Keong H,Barabasi A.Diameter of the World Wide Web[J].Nature.401 1999:130-133.
    [19]WangX,WuH,WeiL.Asimilarity based analysis model for topic distillation[J].Inter mational Journal of Computational Intelligence andApplieation,2002,2(3):267- 275
    [20]BrianAmento,Loren Terveen,Will Hill.Does”Authority”Mean Quality Predict Exp ert QualityRatings of WEB Doeuments[C].In:23rd Annual Intemational ACMSIGIR Conference on Research and Development in Information Retrieval 2000:140-145.
    [21]N.Imafuji,M.Kitsuregawa.Effects of maximum flow algorithm on identifying web community[C].In:4th Intemational Workshop on web Information and Data Manage -ment, 2002:43-48.
    [22]M.Kitsuregawa.Finding a web community by maximum flow algorithm with hits seorebased capacity[C].In 8th International Conference on Database Systems for Advanced APPlieations, 2003:98-104.
    [23]Kumar,Raghavan, Rajagopalan.etal.The Web as agraph[C].In:Proceedings of the 18th ACMSIGACT-SIGMOD-SIGART Symposium on principles of Database Sys -tems. Pennsylvania:ACM Press,1999:109-118.
    [24]http://www.google.com.
    [25]S.Brin,L.Page.Google搜索引擎剖析.程序员,2003(4).
    [26]Amento B,Terveen LG,Hill W C.Does Authority Mean Quality? Predieting Expert Quality Ratings of Web Documents[C].Proc23rdAnnual Intl.ACMSIGIR,1998.
    [27]Katz L.A new status index derived from sociometric analysis [J].Psychometricka, 1953,18:39-43.
    [28]Hubbell C H. An input-output approach to clique identifieation [J].Science, 1965, 28:377-399.
    [29]Garfield E.Citation analysis as a tool in joumal evaluation [J].Science,1972,178: 471-479.
    [30]Pinski G, Narin F.Citation influence for journal gregates of journal aggregates of scientific publieations: theory, with application to the literature of physics[J]. Inf Proc and Management, 1976,12:297-312.
    [31]Geller N. On the citation infiuence methodology of Pinski and Narin[J]. Inf Proc and Management,1978,14:93-95.
    [32]Doreian P.A measure of standing for citation networks within a wider environ ment [J].Inf Proc and Management,1994,30:21-31.
    [33]Botafogo R, Rivlin E,Shneiderman B.Struetural analysis of hyper text:identifying hierarchies and useful metrics[J].ACM Trans Inf Sys,1992,10:142-180.
    [34]Carriere J, Kazman R.WebQuery:searehing and visualizing the Web through con- nectivity[OL].http:www,cgl.uwaterloo.ca/Projects/Vanish/Webquery-1.html,1997.
    [35]Jon M.Kleinberg.Hubs,Authorities,and Communities.ACM Computing Surveys 31(4),December1999.
    [36]Jon M.Kleinberg.Authoritative sources in a hyerlinked environment.Proceedings of ACMSIAM SymPosium on Discrete Algorithms, 1998.668-677.
    [37]Gordon S.Linoff Michael J.A.Berry等著,沈钧毅,宋擒豹,燕彩蓉等译.2004 Web数据结构挖掘:将客户数据转化为客户价值.北京.电子工业出版社.
    [38]Web数据挖掘的研究现状及发展,北京,杨庆越.
    [39]孙建军成颖等编著.2004年.信息检索技术.北京.科学出版社.
    [40]李晓明,闫宏飞,王继民著.2004年4月.搜索引擎—原理\技术与系统.北京.科学出版社.
    [41]J.Kleinberg.Authoritative sources in a hyperlinked environment.Proc.9th ACMSI -AM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999).
    [42]Monika Henzinger,Steve Lawrence.Extracting knowledge from the World Wide Web.2004 by The National Academy of Sciences of the USA.5186–5191.PNAS. April 6,2004.vol.101.suppl.1.
    [43]Andrew Y.Ng,Alice X.Zheng,Michael I.Jordan.Stable algorithms for link analy- sis.Proceedings of the 24th annual international ACMSIGIR conference on resear-ch and development in information retrieval New Orleans,Louisiana,United States Pages:258-266 Year of Publication:2001 ISBN:1-58113-331-6.
    [44]S.Chakrabarti,B.Dom,D.Gibson,J.Kleinberg,P.Raghavan,S.Rajagopalan,Automatc resource list compilation by analyzing hyperlink structure and associated text. Proc 7th International World Wide Web Conference, 1998.
    [45]S.Chakrabarti,B.Dom,D.Gibson,J.Kleinberg,S.R.Kumar,P.Raghavan,S.Rajagopaln,A.Tomkins,Hypersearching the Web.Scientific American,June1999.
    [46]Amento,B.,Terveen,L.,&Hill,W.(2000).Does"Authority"mean quality?Predicting expert quality ratings of web documents.In Proceedings of the 23rd annual interna- tional ACMSIGIR conference on research and development in information retrie val,Athens,Greece(pp.296-303).
    [47]江裕民.基于超链接的WEB结构挖掘算法的研究[D].西安:西安电子科技大学. 2006:01.
    [48]杨彬.Web信息搜索技术的研究[D].西安:西北大学.2007:03.
    [49]罗彩君.Web社区结构挖掘的研究与应用[D].西安:西北大学.2008:06.
    [50]刘芳芳.Web链接分析中HITS算法的研究[D].大连:大连理工大学.2006:12

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700