一种新的Web结构挖掘算法的研究

英文题名：Research of a New Algorithm for Web Structure Mining
作者：刘王峰
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：Web结构挖掘 ; PageRank ; HITS ; 时间权重 ; ANWSMA
英文关键词：Web Structure Mining ; PageRank ; HITS ; Time Weight ; ANWSMA
学位年度：2010
导师：郑有才
学科代码：081202
学位授予单位：西安电子科技大学
论文提交日期：2010-01-01

摘要

Web数据挖掘是数据挖掘技术和Internet应用研究相结合的研究领域,现已成为数据挖掘领域的重点研究方向。Web结构挖掘是Web数据挖掘中的一个很重要的方面,其经典算法有HITS算法和PageRank算法。虽然这两种算法都取得了定的成效,但是也都存在一些不足之处,如主题漂移现象。
     本文在对经典的Web结构挖掘算法HITS和PageRank进行了深入研究和分析的基础上,针对这两种经典算法的一些不足之处,提出了一种集超链接、超链接权重和时间权重三位于一体的新的算法—ANWSMA。该算法首先采用HITS算法中构造基集的思想得到有向图,然后用时间权重替换PageRank算法中的阻尼因子,同时针对链向网页的重要程度不同赋予不同的超链接权重,计算网页等级值,最后进行排序输出。
     最后,通过测试与分析,验证了ANWSMA算法的合理性和有效性。
Web data mining is the combination of data mining technology and application of Internet research, and it has become the focus of the field of data mining research. Web structure mining is a very important aspect of Web data mining, it has the classic algorithm of the HITS algorithm and the PageRank algorithm. While these two algorithms have achieved some success, but there are also some shortcomings, such as the topic drift.
     In this thesis, on the basis of depth research and analysis of the classical Web structure mining algorithms HITS and PageRank, against to some of the inadequacies of the two classical algorithms, proposes a new algorithm—ANWSMA that set of hyperlinks, hyperlink weight and the time of weight. First, the algorithm get digraph using the ideas of the structure-based assembly of the HITS algorithm, and then replace the damping factor of the PageRank algorithm as time weight, give different Hyperlink weight to the web page according to the degree of the importance of the web page, to calculate the value of web rank and sorted out.
     Finally, its rationality and availability has been verified through simulation experiments and comparison with classical algorithm.

引文

[1]M.S.Chen, S.Park and P.S.Yu, Emeinet Data Mining for Path Travemal Patterns in a Web environment[C]. Proc. of the 16th IEEE Internal Conf. on Distilbuted Computing Systems,May 27-30,1996:385-392.
    [2]H.Mannila and H.Toivone. Discovering frequent episodes in Mining[J], Portland, Oregen,1996:146-151.
    [3]Tak Yah, Manhew Jacobsen, Hector Gareia-Molina and Umeshwar Dayal. From User Access Patterns to Dynamic Hypertext Linking[C], In Proc. of the 5th International Wodd Wide Web Conference, Paris, France,1996.
    [4]Kleinberg J. Authoritative sources in a hyperlinked environment[J]. In Proc. of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, United States, January 1998.
    [5]Larry Page, Sergey Btin, R.Motwani, T.winograd. The PageRank Citation Ranking:Bringing Order to the Web[J], Stanford Digital Library Technologies Project,1998.2.
    [6]何晓阳,吴治蓉,连丽红等.SALSA算法技术剖析[J].情报技术.2004(7).26-27.
    [7]Saeko Nomura, Satoshi Oyama, Tetsuo Hayamizu, Toru Ishida. Analysisi and Improvement of HITS Algorithm for Detecting Web Communities [J]. Proc. of the 2002 Symposium on Applications and the Internet,2002:132-140.
    [8]J.Pitkow. Summary of WWW Characterizations[C]. In 7th World Wide Web Conference,1998:551-558.
    [9]P.Pirolli, James E.Pitkow, Ramana Rao. Silk from a Sow's ear:Extracting usable structures from the Web [J]. Proc. ACM SIGCHI,1996:1-9.
    [10]R.Weiss, B.Velez, M.Sheldon, C.Nemprempre, P.Szilagyi, D.K.Gifford. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering [C]. Proc. of the Seventh ACM Conference on Hypertext,1996:180-193.
    [11]Wu K.L, Yu P.S, Ballman A. A Web usage mining and analysis tool [J]. IBM System Jouranl,2001,37(1):89-104.
    [12]钱杰,张婕,高乐.Web结构挖掘中的PageRank算法改进[J].计算机系统应用,2008,7：43-44.
    [13]黄英铭.Web结构挖掘及HITS算法分析[J].计算机与现代化,2007,143(7)：23-25.
    [14](印度)西蒙(Soman.K.P)等著,范明,牛常勇译.数据挖掘基础教程[M].北京：机械工业出版社,2009.
    [15]陈文伟,黄金才,赵新昱等著.数据挖掘技术[M].北京：北京工业大学出版社,2002.
    [16]Moreno, M. N. Garcia, F. J. Polo, M. J. Lopez, V. F. Using Association Analysis of Web Data in Recommcnder Systems [J]. Lectures Notes in Computer Science, LNCS 3182,2004:11-20.
    [17]杨炳儒,李岩,陈新中等.Web结构挖掘[J].计算机工程,2003,29(20)：28-30.
    [18]Oren Etzioni. The World Wide Web:quagmire or gold mine[J]. Communications of the ACM,1996,39(11):65-68.
    [19](印度)查凯莱巴蒂(Soumen Chakrabarti). Web数据挖掘：超文本数据的知识发现(英文版)[M].北京：人民邮电出版社,2009.
    [20]张云涛,龚玲著.数据挖掘原理与技术[M].北京：电子工业出版社,2004.
    [21]韩家炜.Web挖掘研究[J].计算机研究与发展,2002,,38(4)：405-414.
    [22]朱玉全,杨鹤标,孙蕾著.数据挖掘技术[M].南京：东南大学出版社,2006.
    [23]黄隽毅.关于Web数据挖掘中HITS算法的研究[D].大连理工大学.信号与信息处理,2004.
    [24]苏新宁,杨建林,江念南等著.数据仓库与数据挖掘[M].北京：清华大学出版社,2006.
    [25]Y.Kawachi, S.Yoshii, M.Furukawa. Labeled Link Analysis for Extracting User Characteristics in E-commerce Activity Network[C]. In Proc. of IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong,2006:73-80.
    [26]李昕,朱永胜,武港山著.Web结构分析算法HITS的改进及应用[J].计算机工程,2005,31(6)：40-42.
    [27]韩家炜,孟小峰,王静等.Web结构研究[J].计算机研究与发展,2001,38(4)：405-410.
    [28]Gordon S. Linoff, Michael J. A. Berry著.沈钧毅,宋擒豹,燕彩蓉等译.Web数据挖掘：将客户数据转化为客户价值[M].北京：电子工业出版社,2004.
    [29]Jon M. Kleinberg. Hubs, Authorities and Communities[J]. ACM Computing Surveys,1999,31(4).
    [30]宋建康,张礼平.Web结构挖掘算法探讨[J].华东理工大学学报,2003,29(5)：537-540.
    [31]H.Ishii and R.Tempo. Computing the PageRank variation for fragile web data [J]. SICE J. Control, Measurement and System Integration,2009,2:1-9.
    [32]王艳华,张纪.Web结构挖掘及其算法[J].计算机工程,2005,31(增刊)125-127.
    [33]范聪贤.Web结构挖掘中PageRank算法研究[D].苏州大学.计算机应用技术,2009.
    [34]刘栋,刘希玉,郝婷婷.基于PageRank和HITS算法的Web结构挖掘算法研究[J].山东科学,2006,19(4)：11-14.
    [35]W. Xing and A. Ghorbani. Weighted PageRank Algorithm [C]. In Proc. of the 2th Annual Conference on Communication Networks and Services Research,2004: 305-314.
    [36]宋聚平,王永成,尹中航等.对网页PageRank算法的改进[J].上海交通大学学报,2003,37(3)：397-400.
    [37]Bing Liu著,俞勇,薛贵荣,韩定一等译.Web数据挖掘[M].北京：清华大学出版社,2009.
    [38]姬彦利.Web结构挖掘算法研究[D].华中师范大学.计算机软件与理论,2009.
    [39]Soumen Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text [C]. In Proc. of 7th International W WW Conference, 1998,30:65-74.
    [40]Kleinberg J. Authoritative sources in a hyperlinked environment[J]. Journal of the ACML,1999,46(5):604-632.
    [41]Web Experiments. http://www.cs.toronto.edu/-tsap/experiments/journal-experiments/index.html.
    [42]Bharat K, Henzinger M R. Improved algorithms for topic distillation in a hyperlinked environment[C]. In Proc. of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998.
    [43]Web Experiments, Queries. http://www.cs.toronto.edu/-tsap/experiments/www10-experiments/index.html.
    [44]Datasets for Experiments on Link Analysis Ranking Algorithms. http://www.cs.toronto.edu/-tsap/experiments/datasets/index.html.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700