基于DOM建模的网页木马检测的分类器设计

英文题名：Classification for Webpage Trojan Detection Based on DOM Modeling
作者：范宇
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：网页木马检测 ; 决策树分类器 ; WIM-DOM模型 ; 序列特征
英文关键词：webpage-trojan detection ; decision tree ; WIM-DOM ; sequential feature
学位年度：2010
导师：季丽萍
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2010-12-01

摘要

随着互联网应用的普及,网页已经成为人们获取和发布信息的最主要的方式之一。大量网站在提供信息的同时,也给用户带来了不少安全隐患。据统计显示,木马已经取代病毒成为目前最主要的威胁,而被感染的木马中有超过90%是通过网页传播。因此,如何有效地防止网页木马的传播,以保证用户在使用各种基于网页的应用中不被恶意代码感染,成为亟待解决的问题之一。
     不同于传统木马,网页木马传播速度更快、范围更广,威胁更大;另外,网页木马采用脚本编码,更容易被编码、加密,拥有更多的变种。因此,传统木马的检测模型并不适用于网页木马检测。然而现有的网页木马检测技术中,仍以广泛应用于传统木马检测的静态特征匹配技术为主,它对未知样本的处理滞后;并且,随着木马库的迅速增长,匹配效率会越来越低。因此我们需要一种针对网页本身的高效检测技术,在木马通过网页入侵主机之前,将威胁阻止。
     针对以上问题,课题首先对网页木马的攻击原理进行了深入分析,总结出网页木马攻击的特点和检测难点;然后研究了网页的DOM结构,以及浏览器对网页的解析过程,在此基础上提出一种基于DOM结构的网页代码审查模型WIM-DOM;最后,在WIM-DOM建模基础上,构建了基于决策树的网页木马分类器。本文的主要研究工作和创新点如下:
     (1)研究了网页木马的攻击原理和网页的DOM结构、解析原理,总结出网页木马攻击的特性及其在DOM元素属性和解构上的表现方式。
     (2)提出了一种基于DOM结构的网页代码审查模型WIM-DOM。该模型针对网页木马攻击的隐蔽性和局部性特点,利用DOM结构将网页源文件映射成为DOM元素序列。该模型既增强了DOM元素属性特征,又保留了元素间的层次结构,有利于局部特征在网页木马检测中发挥作用,为分类器的设计打下基础。
     (3)在WIM-DOM建模基础上,设计了两种基于决策树的网页木马分类器。分类器WIM-DOM(I)首次提出以DOM元素的属性信息作为分类特征。WIM-DOM(II)首次基于网页木马攻击的序列模式提取DOM元素的序列特征,以提高分类器对于具有多步骤攻击行为的网页木马的检测率,并利用统计信息降低网页自身差异对分类的影响。
     (4)设计分类实验,从准确性和效率两个方面验证了WIM-DOM分类器的优势。
With the rapid development of internet application, website has gradually become the most important way to access and release infomation. At the same time, website brought users many new security risks. According to the statistics, trojan horse has become the main threat instead of virus, and more than 90% of them are propagated through webpages. Therefore, research on how to protect users from webpage trojans in web-base applications is attracting more and more attention.
     Different from the traditional trojans, webpage trojans spread faster and wider, with more serious threat. As webpage trojans are always coded by script which is more likely to be encoded and encrypted. As such, traditional detect models could not be adapted for the webpage trojan detection. However, webpage trojan detection still relies mainly on statical feature matching based on the traditional trojan horse detection nowadays. It's unresponsitive to unkonwn samples, and the efficiency would decline seriously as the featuer-databases increase. Other researches proposed to detect trojans through monitoring the dynamic behavior of the host. Unfortunately, the detection is taken after infection. Therefore, novel webpage trojan detection methods which target the interpretation of webpage are urgently needed. Such methods could detect threats before the trojans infect into localhost by the webpages.
     To overcome the above problems, we firstly made a detailed survey on the principles of webpage trojan attack and the DOM structure of webpages, based on which, we proposed a novel webpage inspect model based on DOM structure(WIM-DOM). Then we design our decision tree based classifier with the WIM-DOM model as the input. Compared with previous work, we have made the following contributions:
     First, we propose a novel webpage inspect model based on the DOM structure,called WIM-DOM. The model uses the inherent DOM structure to map the source document into a sequence of DOM elements, which could reflect the two characteristic of webpage trojan attack: hidden and locality. The model enhances the attributes of DOM elements, and reserves the hierarchy among neighboring nodes as well. As such, local features could contribute more in the classification than other methods.
     Second, we design a classifier based on WIM-DOM which could be used for webpage-trojan detection. In the classifier, we proposed to use the attributes of DOM elements as the main classifier features, including some statistics to decrease the influence brought by the diversity of webpages. In addition, we are the first to use sequential patterns of the DOM elements for webpage trojan detection, which is proved to be effective in improving the performance of malicious sample with multi-step attack behavior.
     Finally, we designed several comparative experiments for the WIM-DOM classification from two aspects: the accuracy and efficiency.

引文

1 Anchiva Threat Report. http://www.anchiva.com/. 2009
    2 Stopbadware, Trends in Badware. http://stopbadware.org/home/research. 2010.
    3 Symantec.Information on Back Orifice and NetBus. http://www.symantec.com/avcenter/warn/backorifrce.html.2006.
    4孙淑华.内核级木马隐藏技术研究与实现.中国科学院软件研究所硕士论文, 2004. 2-10.
    5刘成光.基于木马的网络攻击技术研究与实现.西北工业大学硕士论文, 2004. 12-15.
    6李焕洲,唐彰国,钟明全等.基于行为监控的木马检测系统研究及实现.四川师范大学学报(自然科学版). 2009,5.
    7黄雯霆.基于数据挖掘的入侵检测方法与木马技术研究.厦门大学硕士论文, 2006.
    8 Shugang Tang. The detection of Trojan horse based on the data mining. Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery, 2009. 311-314.
    9 Qin-Zhang Chen, Rong Cheng, Yu-Jie Gu. Classification Algorithms of Trojan horse Detection Based on Behavior. Proceedings of the 2009 International Conference on Multimedia Information Networking and Security, 2009. 510-513.
    10吴润浦,方勇,吴少华.基于统计与代码特征分析的网页木马检测模型.信息与电子工程2009,7(1):71-75
    11杜振华,张健.一种恶意网页检测系统的研究与设计.全国计算机安全学术交流会论文集, 2007.
    12 The Swider HoneyMoney Project. http://research.microsoft.com/HoneyMonkey.
    13 The Client Honeynet Project. https://www.client-honeynet_org/
    14 Alexander Moshchuk, Tanya Bragin. A Crawler-based Study of Spyware on the Web.
    15 Cloudsecurity. Http://cloudsecurity.org/
    16 Trendmicro. Cloud Security Solution. http://www.trendmicro.com.cn/cloud
    17 Panda. Cloud Protection. http://www.pandayun.com/cp_02.html
    18瑞星云安全联盟. http://union.rising.com.cn/
    19唐树刚.基于文件静态特征的木马检查研究.天津大学硕士论文.
    20戴敏,黄亚楼,王维.基于文件静态信息的木马检测模型.计算机工程, 2006年3月.第32卷第6期
    21 Julisch K. Data mining for intrusion detection: A critical review. IBM Research, Zurich Research Laboratory.
    22 Richard Lippmann, Joshua W. Haines. The 1999 DARPA OffLine Intrusion Detection Evaluation. 1999.
    23张复生.基于广谱通信行为分析的木马检测技术研究.重庆邮电大学学位论文. 2007.
    24周荃.基于人工智能技术的网络入侵检测的若干方法.计算机应用研究. 2007, 24(5).
    25 Xiaolei Li, Xun Li, Yun zhang. Local Area Network Anomaly Detection using Association Rules Mining. Proceedings of the 5th International Conference on wireless communications, networking and mobile computing, 2009. 4678-4682.
    26 P.Morley. Processing virus collections. In Proceedings of the 2001 Virus Bulletin Conference, 2001. 129-134.
    27 360 Safe Browser. http://se.360.cn.
    28孙晓妍,王洋,祝跃飞等.基于客户端蜜罐的恶意网页检测系统的设计与实现.计算机应用. 2007, 27(7):1613-1615.
    29 Zhi-Yong Li, RanTao, Zhen-He Cai, et al. Web Page Malicious Code Detect Approach Based on Script Execution. Fifth International Conference on Natural Computation, 2009.
    30 Ming-Wei Wu, Sy-Yen Kuo. Examining Web-based Spyware Invasion with Stateful Behavior Monitoring. 13th IEEE International Symposium on Pacific Rim Dependable Computing. 2007.
    31 Oystein Hallaraker, Giovanni Vigna. Detecting Malicious JavaScript Code in Mozilla. In Proceedings of the 10th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS’05). 2005.
    32 Chia-Mei Chen, Wan-Yi Tsai, Hsiao-Chung Lin. Anomaly Behavior Analysis for Web Page Inspection. First International Conference on Networks & Communications. 2009.
    33 Shih-Fen Lin, Yung-Tsung Hou, Chia-Mei Chen, et al. Malicious Webpage Detection by Semantics-Aware Reasoning. In Eighth International Conference on Intelligent Systems Design and Applications. 2008.
    34 Mihai Christodore scu, Somesh Jha, Ssanjit A. Seshia, et al. Semantics-AwareMalware Detection. In Proceedings of the 2005 IEEE Symposium on Security and Privacy (S&P’05). 2005.
    35 Yi-Tung F. Chan, Charles A.Shoniregun, G.alyna A. Akmayeva, et al. Applying Semantic Web and User Behavior Analysis to Enforce the IDS. Institute of Electrical and Electronics Engineers. 2009.
    36唐骏,庄毅,许斌等.基于马尔可夫模型的恶意网页检测算法.中国计算机学会信息保密专业委员会论文集, 2006.
    37葛先军,李志勇,宋巍巍.基于网页恶意脚本链接分析的木马检测技术.第五届中国测试学术会议论文集, 2008.
    38 Yong Wang, Dawu Gu, Jianping Xu, et al. Assembly Reverse Analysis on Malicious Code of Web Rootkit Trojan. Proceedings of the 2009 International Conference on Web Information Systems and Mining. 2009.
    39 Sebastla Zander, Grenvlle Armitage. Covert Channels and Countermeasures in Computer Network Protocols. Communications Surveys and Tutorials, IEEE. 2007, 45(12):136-142.
    40 Deepa Kundur, Kamran Ahsan. Covert channels in the TCP/IP protocol suite. 2003
    41 W3C. http:// www.w3school.com.cn
    42 SpiderMonkey. http://www.mozilla.org/js/spidermonkey/
    43 Chuan Yue, Haining Wang Characterizing insecure javascript practices on the web. Proceedings of the 18th international conference on World Wide Web. 2009: 961-970.
    44 Spidermonkey development document. https://developer.mozilla.org/cn
    45 Karen Scarfone, Peter Mell. Guide to Intrusion Detection and Prevention Systems (IDPS). Recommendations of the National Institute of Standards and Technology.2007.
    46 Roland Kwitt, PUlrich Hofmann. Unsupervised Anomaly Detection in Network Traffic by Means of Robust PCA. Proceedings of the International Multi-Conference on Computing in the Global Information Technology. 2007:37-37.
    47 Jiawei Han, Micheline Kamber. Data Mining concepts and Techniques. 2006:185-217
    48 Jingfeng Cai. Decision tree pruning using expert knowledge . Doctoral Thesis of University of Akron. 2006.
    49 Yang Li, Li Guo. TCM-KNN scheme for network anomaly detection using feature-based optimizations. Proceedings of the 2008 ACM symposium on Applied computing. 2008.
    50 Bradley P. Carlin, Thomas A. Louis. Bayes and empirical bayes methods for data analysis. Statistics and Computing. 1997,7(2):153-154
    51 Wang Jie, Ji Zhen-Zhou, Hu Ming-Zeng. High-performance multi-pattern matching structure in hardware network firewall. Proceedings of the 9th WSEAS international conference on Applied informatics and communications. 2009: 187-191.
    52 Benson Luk, Eyal Reuveni, Kamron Farrokh. Intelligent Detection of Malicious Script Code. Symantec.
    53 Ian H. Witten, Eibe Frank. Data Mining Practical Maching Learning Tools and Techniques Second Edition. 2007:126-188
    54 Weka. http://www.weka.net.nz/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700