一种抑制恶意网页的web权威结点挖掘算法研究

英文题名：The Research of Web Authority Nodes Mining to Restrain the Malicious Web Page
作者：罗江锋
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：恶意网页 ; Markov链 ; 负权惩罚
英文关键词：malicious web pages ; Markov chain ; negative link weight penalty
学位年度：2008
导师：张维明
学科代码：1201
学位授予单位：国防科学技术大学
论文提交日期：2008-12-01

摘要

随着网络资源的爆炸式增长,web资源的海量性与复杂性使得对web资源的管理变得越来越困难。如今,大量含有非法广告、病毒程序、木马程序的恶意网页已经充斥着web网络,这些恶意网页根据搜索引擎的局限,采取作弊手段,常常在我们的搜索结果中占据较高的排名。目前对恶意网页的处理主要是通过病毒检测软件防止网页中恶意代码的运行,或是在用户通过搜索引擎定位到某个网页时,对恶意网页提示安全警告。这些方法都完全依赖于反病毒软件或网页过滤技术,存在一定的局限性。而基于链接分析的恶意网页抑制方法只是在屏蔽恶意网页的同时,删除恶意网页的所有链接,或对恶意网页的链接进行识别追踪和过滤,没有将恶意网页信息充分的应用到搜索引擎的网页排序算法中。
     文本主要研究存在恶意网页情况下的web权威网页结点的挖掘问题。文中首先介绍了web挖掘的一般理论和权威网页结点挖掘算法的研究现状。针对现有算法的不足,通过合理假设,不但滤除被发现的恶意网页结点,还将恶意网页结点的先验信息应用在web权威结点的排序算法中。一方面,在模型建立时充分考虑恶意网页的影响,建立了一种新的、考虑恶意网页结点影响的web资源随机浏览模型。通过模型对问题的抽象,将web权威网页结点的挖掘问题转化成一个Markov链状态空间平稳状态分布的求解问题,给模型的算法实现打下理论基础;另一方面,在算法实现过程中,提出了一种通过引入负权对指向恶意网页结点链接进行惩罚的web网页结点排序算法,通过惩罚机制来抑制一般网页对恶意网页的链接,达到抑制恶意网页的目的。
     理论分析和实验均表明,链接到恶意网页的行为将受到惩罚,与恶意网页链接越紧密,链接的恶意网页数量越大,其权威值降低越多;而不链接到恶意网页的页面权威值将得到一定的增加。这种奖惩机制,将有效抑制一般网页对恶意网页的链接,从链接分析的角度实现了对恶意网页的有效抑制。此外,本文还对算法进行了改进和推广,明确了算法的应用范围。并且在仿真实验中详细讨论了图的生成模型和仿真数据的生成过程,增加了实验数据和实验结果的可信度。
As the large number and complex structure of the web resource, it is difficult for us to manage the web pages orderly. There are more and more malicious web pages mixed in the web resource. More seriously, as the limitations of the search engine, malicious pages are always returned as authority resource nodes though some illegal ways. Many anti-virus tools have been used to restrain the malicious pages by preventing the running of malicious codes which hide in the pages, or give a safety warning when the user prepare to open it. Those methods make the anti-virus task totally depend on the anti-virus software, or some content-identifying technique. It doesn’t work well. Then, some new methods have been used from the view of linkage analysis. As long as the malicious content is identified, it is common to simply filter out the malicious pages and its linkage. They don’t distinguish the linkage to malicious pages from others during the page’s rank.
     In this paper, we mainly discuss the link-based authority web pages mining under the environment of malicious pages. After the introduction of the status quo of the graph mining and its general theory, and based on some reasonable assumptions, this paper mainly researches on the impact of the malicious web pages on user’s surfing action and present a new surfing action model. The new model take the prior information of malicious pages into account and, more importantly, convert the problem of the authority web pages mining into the solution of a Markov chain’s steady-state distribution. Under the new surfing model, we put forward a new page rank algorithm with negative link weight penalty to restrain the linkage to malicious pages, in which the web pages which link to malicious pages are punished. Subsidiary nodes are introduced to ensure the correctness and effectiveness of the algorithm under different conditions.
     Both theoretic analysis and simulation result show authority values of the nodes linking to malicious ones will be reduced, and the more linkages and linkage weight value are, the more authority value will be reduced. But page’s authority of those pages without links to malicious page will be increased. It effectively restrains the linkage to malicious nodes from the perspective of link analysis. All the simulation dates is generated by a web graph model, which is credible.
     Finally, this paper gives some improved algorithm and takes a statement of conditions under which the algorithm can be used.

引文

[1] SHETTY J, AD IB I J. Discovering important nodes through graph entropy: the case of Enron Email database [C ]. ACM SIGKDD International Conference on Knowledge Discovery and Data mining. Chicago: ACM Press, 2005: 74 281.
    [2]熊金,刘悦,白硕.基于结构的Email挖掘算法EHITS.计算机应用研究.第25卷第4期.2008年4月.
    [3] S. Borgatti. Identifying sets of key players in a network. In Computational, Mathematical and Organizational Theory. 2005.
    [4] Lee C Y. An Algorithm for Path Connections and Its Applications [J]. IEEE Trans. On Electron. Comput.. 1961, EC-10: 346-365.
    [5]杨元法,庄明.网络中最短距离的递归算法.计算机工程.第31卷第13期. 2005年7月.
    [6] Bo Yang, William K. Cheung, and Jiming Liu. community mining from signed social network. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 10, OCTOBER 2007.
    [7] D. Chakrabarti. Graph Mining Laws, Generators, and Algorithms. ACM Computing Surveys, Vol. 38, March 2006.
    [8] Tamas Horvath, Jan Ramon, Stefan Wrobel. Frequent Subgraph Mining in Outerplanar Graphs. ACM SIGKDD. 2006.
    [9] Hanghang Tong Christos Faloutsos. Center-Piece Subgraphs: Problem Definition and Fast Solutions. Machine Learning Department, School of Computer Science, Carnegie Mellon University. May 2006.
    [10] Ingrid Fischer, Thorsten Meinl. Graph Based Molecular Data Mining - An Overview. Computer Science Department, University of Erlangen-Nuremberg. 2004 IEEE.
    [11] O. Mason, M. Verwoerd. Graph theory and networks in Biology. The Institution of Engineering and Technology 2007.
    [12] MOODY, J. 2001. Race, school integration, and friendship segregation in America. Amer. J. Sociol. 107, 3,679–716.
    [13] FLAKE, G.W., LAWRENCE, S. GILES, C. L.. Efficient identification of Web communities. In Conference of the ACM Special Interest Group on Knowledge Discovery and Data Mining. ACM Press, New York. 2000.
    [14] ERD?OS, P. AND R′ENYI, A. 1960. On the evolution of random graphs. Publication of the Mathematical Institute of the Hungarian Acadamy of Science 5, 17–61.
    [15] TRAVERS, J. AND MILGRAM, S. 1969. An experimental study of the SmallWorld problem. Sociometry 32, 4, 425–443.
    [16] FALOUTSOS, M., FALOUTSOS, P., AND FALOUTSOS, C. 1999. On power-law relationships of the Internet topology.In Conference of the ACM Special Interest Group on Data Communications (SIGCOMM). ACM Press,New York, NY, 251–262
    [17] ALBERT, R., JEONG, H., AND BARAB′ASI, A.-L. 1999. Diameter of the World-Wide Web. Nature 401, 130–131.
    [18] BRODER, A. Z., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J.2000. Graph structure in the web: experiments and models. In International World Wide Web Conference.ACM Press, New York, NY.
    [19] AIELLO, W., CHUNG, F., AND LU, L. 2000. A random graph model for massive graphs. In ACM Symposium on Theory of Computing. ACM Press, New York, NY, 171–180.
    [20] EVERITT, B. S. 1974. Cluster Analysis. John Wiley, New York, NY.
    [21] GIRVAN, M. AND NEWMAN,M. E. J. 2002. Community structure in social and biological networks. In Proceedings of the National Academy of Sciences. Vol. 99. National Academy of Sciences, Washington DC
    [22] KARYPIS, G. AND KUMAR, V. 1998. Multilevel algorithms for multi-constraint graph partitioning. Tech. Rep. 98-019, University of Minnesota.
    [23] Andrei Broder, Ravi Kumar. Graph structure in the web. http://www9.org/w9cdrom/ 160/160.html. 2003.
    [24] R.Cooley, B. Mobasher, J.Srivastava. Web Mining:Information and Pattern Discovery on the World Wide Web. Department of Computer Science and Engineering University of Minnesota Minneapolis.
    [25] Raymond Kosala, Hendrik Blockeel. Web Mining Research:A Survey.ACM SIGKDD,July 2000.
    [26] J. Cho, H. Garcia-Molina. Synchronizing a database to Improve Freshness. To appear in 2000 ACM International Conference on Management of Data (SIGMOD). May 2000.
    [27] Andrei Broder, Ravi Kumar. Graph structure in the web. http://www9.org/w9cdrom/160/160.html. 2003.
    [28] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999.
    [29] J. Carriere, and R. Kazman. WebQuery: Searching and visualizing the Web through connectivity, Proc. 6th , 1997.
    [30] S. Brin, and L. Page. The anatomy of a large scale hypertextual web search engine. Proc. 7th WWW. 1998.
    [31] Albert, Jeong, Barabasi. Diameter of the World Wide Web. Nature 401:130-131, Sep 1999.
    [32] Aristides Gionis. Mining the graph structures of the web. Yahoo! Research.
    [33] Bernardo A. Huberman, Peter L. T. Pirolli. Strong regularities in World Wide Web. SCIENCE VOL. 280 APRIL 1998.
    [34] Http://www.kreny.com/pagerank_cn.htm.
    [35] Wenpu Xing, A.G., Weighted PageRank Algorithm. Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR’04). 2004 IEEE.
    [36]杨彬,康慕宁.基于概念的权重PageRank改进算法.情报杂志,2006.
    [37]吴春,旭郭磊.Web结构挖掘的PageRank算法改进.情报杂志,2005.
    [38] JON M. KLEINBERG. Authoritative sources in a hyperlinked environment. Journal of the ACM, Vol. 46, No. 5, September 1999.
    [39]宋建康,张礼平.Web结构挖掘算法探讨.华东理工大学学报,2003.
    [40]杨炳儒,李岩,陈新中.Web结构挖掘.计算机工程,2003年20期.
    [41]万华,牛军钰.链接信息在Web检索中的应用.计算机工程,2002.
    [42]张娜,张化祥.基于超链接和内容相关度的检索算法.计算机应用,
    [43]黄英铭.Web结构挖掘及HITS算法分析.计算机与现代化,2007年7期.
    [44]仲婷,金浩,冯茜芦,潘金贵.一种基于结构分析的改进HITS算法.广西师范大学学报,2005.
    [45] R. Lempel, S. Moran .The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect. Department of Computer Science. 2000.
    [46]何晓阳,吴治蓉.SALSA算法技术剖析.情报技术,2004.
    [47] Dell Zhang, Yisheng Dong, An Efficient Algorithm to Rank Web Resources, in Computer Networks 2000. p. 449-455.
    [48] G. Salton, Automatic Text Processing,Addison-Wesley, Boston, 1989.
    [49]田铮,秦超英,随机过程与应用. 2007:科学出版社.
    [50] Alberto Medina, Anukool Lakhina, Ibrahim Matta, John Byers. BRITE: Universal Topology Generation from a User’s Perspective. Computer Science Department Boston University. April12, 2001.
    [51] Ravi Kuma.“Extracting large-scale knowledge bases from the web”in the VLDB journal. 1999, pp.639-650. [Online]. Available: citeseer.ist.psu.edu / kumar99extracting.html.
    [52] Jiangfeng Luo, Cheng Zhu, Weiming Zhang, Zhong Liu, Jincai Huang. Restrain the Linkage to Malicious Web Pages though Negative Link Weight. 2008 IEEE International Symposium on Kmowledge Acquisition.
    [53] AMARAL, L. A. N., SCALA, A., BARTH′EL′EMY, M., AND STANLEY, H. E.2000. Classes of small-world networks. Proceedings of the National Academy of Sciences 97, 21, 11149–11152.
    [54]检测和过滤网络搜索引擎返回结果中包含的指向恶意网页的链接的方法.王凤仙.专利:200510086981.7. 2007年5月.
    [55]恶意网页的剖析与对策.张福增,赵永升,孔繁芸,宋丽华.福建电脑. 2004年第7期.
    [56]基于链接的网络木马追踪技术.陶然,李志勇,王越,张昊,杜华.专利:20061015233.7. 2006年9月.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700