恶意代码检测与分类技术研究

英文题名：Research into the Detection and Classification of Malware
作者：赵恒立
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：恶意代码 ; 组合检测特征 ; PE文件解析 ; 动态行为分析 ; Windows调试器 ; Windows ; API ; 支持向量机 ; 特征属性量化 ; 恶意代码分类
英文关键词：Malicious code ; Combination feature ; PE file parsing ; Dynamic behavior analysis ; Windows debugger ; Windows API ; Support vector machine ; Weightiness of feature attribute ; Malware classification
学位年度：2009
导师：郑宁
学科代码：081203
学位授予单位：杭州电子科技大学
论文提交日期：2009-12-01

摘要

恶意代码的爆炸式增长以及其变形多态技术应用使得传统的基于特征码的检测方式不能满足安全新要求。本文从反病毒实际需求出发,提出了一种恶意代码自动化检测与归类方法。通过恶意代码综合分析系统(AMIAS)提取出静态和动态行为组合特征,然后使用支持向量机建立两类分类器对样本进行检测。同时,生成恶意代码行为分析报告,并通过解析已知病毒库中恶意代码行为分析报告,提取出病毒家族行为模式,然后使用支持向量机建立恶意代码多类分类模型。本文提出的恶意代码检测方法克服了单一静态或动态检测的不足,能够实现海量样本的快速检测,分类方法根据恶意代码的行为将样本划分到所属恶意代码家族,能够为后续恶意代码清除工作提供指导。
     本文对以下的四个方面进行了研究。第一,提出了一个用于恶意代码检测的动态与静态组合特征定义方法。通过学习恶意代码静态和动态行为信息,定义一个55维恶意代码组合检测特征向量,其中包含的20维静态特征通过分析恶意代码和正常代码的PE结构信息差异获得。动态行为分析法具有识别未知恶意代码的能力,在恶意代码Win32 API调用信息大量研究的基础上定义了35维动态行为特征向量,特征向量的每一维表示一种动态行为事件,这些行为事件都是通过相应的Win32 API函数及其参数调用信息归纳得出的。
     第二,本文基于虚拟机控制技术实现了一个恶意代码自动化综合分析系统(AMIAS)。AMIAS系统主要实现两个功能,一是提取出与组合特征定义中特征项对应的特征值。二是对每一个分析样本生成相应的行为分析报告,AMIAS系统属于自动化的联机处理系统,能够解决反恶意代码工作对海量恶意代码分析的需求。
     第三,本文提出了一种新的基于支持向量机模型的恶意代码检测方法,在组合特征定义的基础上,建立支持向量机两类分类器用于恶意代码检测。检测实验数据集包含9917个恶意代码和6591个正常代码。初始实验中根据数据集的不同来源,建立不同训练集用于训练支持向量机分类器。根据初始实验分类误差数据中有效特征数统计结果,通过设定有效特征数阈值对初始实验进行改进,改进实验结果表明当阈值为6时,检测效果和样本利用率都较高。同时本文设计了对比实验,验证组合特征定义法与支持向量机模型联合用于恶意代码检测的有效性,对比结果表明在误报率小幅提升的情况下检测率得到了较大提高。对于灰色样本数据检测误差,本文引入特征属性重要性量化方法,通过对特征属性值的加权处理,有效降低了灰色样本的检测误差。
     第四,本文对基于行为的恶意代码分类方法进行改进,通过恶意代码行为分析报告的分类间接实现恶意代码的分类。基于恶意代码行为信息单元的定义对行为分析报告进行特征词提取并对提取出的特征词进行聚类预处理,然后定义映射函数将行为分析报告映射成特征词向量空间数据,最后训练支持向量机多类分类器实现恶意代码自动化分类,实验表明基于行为信息单元的特征提取方法能有效提高恶意代码自动化分类的准确率和效率。
With the explosive growth of malware which often use polymorphism and metamorphism technology,the traditional signature-based detection methods could not meet the security requirements.From the perspective of actual anti-virus requirements,this paper proposes an automated malicious code detection and classification methods. The automated malware integrated analysis system(AMIAS) can extract static and dynamic behavior features, then use support vector machine to detect malware. AMIAS system also generate the malware behavior analysis report.We learned the behavior patterns from each malware family in the known malware database and establish a multi-class classifier with SVM for the classification of new detected malicious samples. Our method overcomes the shortcoming of single static or dynamic detection method and could achieve rapid detection of massive malware samples. Malware classification result could provide guidance for the remove of malware.
     The main contents of this paper focus on four aspects: first, we proposed a definition of static and dynamic behavior feature. By learning known malware static and dynamic behavior information, we defined a 55-dimensional combination feature.Static feature includes a total of 20 features,these static features are extracted from the PE file structure differences between the benign and malicious code.Dynamic behavior analysis has the ability to detect unknown malicious code, therefore behavior features is the main body of the union feature. Based on the extensive research on the Win32 API using of malware,we defined a total of 35 behavior features. Each feature represents a kind of dynamic behavior event, these event all derived from the summarized information with corresponding Win32 API function calls and their parameters.
     Second, we implement the automation of malicious code integrated analysis system (AMIAS) with the virtual machine control technology. AMIAS system has two functions, one is extracts the value of feature which is correspondingly defined in feature space. The other is to generate an behaviour analysis report of each sample. AMIAS is an automated on-line processing system, which will address the massive malware analysis requirements.
     Third, we proposed a new malware detection method based on SVM. With the definition of combination feature,we construct SVM model for malware detection. Detection experiment dataset contains 9917 malware and 6591 benign code. According to the different data sets source, we design an initial experiment and create different training set for the training of SVM classifier. With the mathematical statistics of effective feature numbers of error samples in the initial experiment. We improved the initial experiment and the results show that when the threshold number is 6, the ratio of detected and sample utilization are both high. We also designed comparative experiments to verify the effect of joint use with combination feature and SVM model. The results show that joint use detection method perform better. For the gray samples, we have improved the model with the introduction of feature importance quantitative methods, we generate new feature value with product of feature weights and value. Experiments show that improved detection performance better on the gray samples.
     Fourth, We improved the malware behaviour report classification method and accomplished malware classification task through the report classification indirectly. Based on the malware behaviour unit, we extracted feature words from behaviour report, then define mapping function to map behavior analytical report into vector spatial data, finally train a multi-class SVM classifiers for automatic classification of malware. Comparison with similar methods,experimental results show that our method can effectively improve the accuracy and efficiency of malware classification.

引文

[1]国家计算机网络应急技术处理协调中心2008网络安全工作报告[R/OL]. http://www.cert.org.cn/UserFiles/File/CNCERTCC2008AnnualReport_Chinese.pdf.
    [2] McAfee,Best Behavior-Making Effective Use of Behavioral Analysis [R/OL]. NETWORK ASSOCIATES, 2002.
    [3] Ed Skoudis,Lenny Zeltser.决战恶意代码[M].北京:电子工业出版社2005.
    [4] Alisa Shevchenko. The evolution of technologies used to detect malicious code [EB/OL]. [2009-10-07]. http://www.viruslist.com/en/analysis?pubid=204791972.
    [5] Matthew G.Schultz,Eleazar Eskin,Erez Zadok.Data Mining Methods for Detection of New Malicious Executables [C].IEEE Computer Society,2001:38-49.
    [6] Mihai Christodorescu. Static analysis of executables to detect malicious patterns [C]. Proceedings of the 12th conference on USENIX Security Symposium,2003,50(6): 169-186
    [7] Moser A, Kruegel C, Kirda E. Limits of static analysis for malware detection [C]. Proceedings of Twenty-Third Annual Computer Security Applications Conference, 2007, 421-430.
    [8] Tony Abou-Assaleh,Cercone N,Sweidan R. Detection of new malicious code using n-grams signatures [C]. Second annual conference on privacy,security and trust. 2004, p:193–196
    [9] C.Willems.CWSandbox:Automatic Behaviour Analysis of Malware [EB/OL]. http://www.cwsandbox.org/,2006.
    [10] U.Bayer,C.Kruegel,E.Kirda.TTAnalyze:A Tool for Analyzing Malware [C].15th Annual Conference of the European Institute for Computer Antivirus Research. 2006
    [11] Fabrice Bellard.Qemu [EB/OL]. http://fabrice.bellard.free.fr/qemu/,2005
    [12] Fabrice Bellard.Qemu,a fast and portable dynamic translatorn [C].In USENIX Annual Technical Conference,2005,
    [13] Peter Ferrie,Senior Principal Researcher,Symantec Advanced Threat Research.Attacks on Virtual Machine Emulators. [EB/OL]. http://www.subsync.symantec.com/avcenter/reference/Virtual_Machine_Threats.pdf
    [14] Joebox a sandbox application for automatic behaviour analysis of malware[EB/OL]. http://www.joebox.org/
    [15] S. Wehner. Analyzing worms and network traffic using compression [J]. Journal of Computer Security. 2007,15(3):303-320.
    [16] Tony.Lee,J.J.Mody. Behavioral classification [C].In Proceedings of EICAR.2006.
    [17] M.Bailey,J.Oberheide,J.Andersen,Z.M.Mao,F.Jahanian,andJ.Nazario.Automated classification andanalysis of internet malware [C]. In Proceedings of the Symposiumon Recent Advances in Intrusion Detection(RAID07). 2007, p:178–197.
    [18] C.Willems, T.Holz, and F.Freiling. Toward Automated Dynamic Malware Analysis Using CWSandbox [C]. IEEE Security and Privacy, 2007, 5(2):32-39.
    [19] Walter Oney. Programming the Microsoft Windows Driver Model [M]. Microsoft Press, 2000.
    [20]卢浩,胡华平,刘波.恶意软件分类方法研究[J].计算机应用研究,2006,9:4-7
    [21] David Salomon. Foundations of Computer Security [M]. Springer Press, 2006.
    [22] Greg Hoglund, GaryMcGraw. Exploiting Software: How to Break Code [M]. US: AddisonWesley, 2004. 60-81
    [23] Matt Conover. w00w00 on heap overflows. Technical report, [EB/OL], http://www.w00w00.org/les/articles/heaptut.txt, 1999. Accessed 07/03/2008.
    [24] Kyung-Suk Lhee and Steve J Chapin. Buffer overffow and format string overflow vulnerabilities [J]. Software Practice and Experience,2003,33:423-460.
    [25] Peter Szor. The Art of Computer Virus Research and Defense [M]. Symantec Press (Addison-Wesley), 2005.
    [26] VirusTotal Service [EB/OL]. www.virustotal.com/. Accessed 30/11/2009.
    [27]潘勉,薛质.基于DLL技术的特洛伊木马植入新方案[J].计算机工程,2004,30(18):110-112
    [28] Zhao Hengli,Zheng Ning. Unknown malware detection based on the full virtualization and SVM [C]. The 3rd International Conference on Management of e-Commerce and e-Government. 2009, 473-476
    [29] Boyun Zhang,Jianping Yin.Using Support Vector Machine to Detect Unknown Computer Viruses[J], International Journal of Computational Intelligence Research. 2006,100-104
    [30] Antivirus Research and Detection Techniques. [EB/OL]. http://www.extremetech.com/article2/01558116616700.asp,2005.
    [31]江阳.ASPack为EXE文件减肥. [EB/OL].http://tech.sina.com.cn, 2001.
    [32]反病毒引擎设计.[EB/OL]. http://freehan.vipcn.com/infoView/Article_754.html. 2003.
    [33] Hume.病毒和网络攻击中的多态、变形技术原理分析及对策.Xcon 2003.
    [34] JeremyZ.Kolter,Marcus A.Maloof. Learning to detect malicious executables in the wild [C].Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,2004,470-478
    [35] MatthewG.Schultz,Eleazar Eskin,Erez Zadok. Data Mining Methods for Detection of New Malicious Executables[C], Proceedings of the 2001 IEEE Symposium on Security and Privacy, 2001, 38-49
    [36] Forrest, S, Hofmeyr, S, Somayaji, A, Longstaff, T A. Sense of self for Unix processes [C]. Proceedings of the 1996 IEEE symposium on Computer Security and Privacy, IEEE Press, 1996, 120-128.
    [37] Forrest S, Hofmeyr S, Somayaji A. Computer immunology [J]. Communication of ACM40, 1997, 40(10): 88-96.
    [38] S.Mukkamala,G.I.Janoski,A.H.Sung.Intrusion Detection Using Support Vector Machines [J], Proceedings of the High Performance Computing Symposium,2002,178-183
    [39] Warrender C, Forrest S, Pearlmutter B. Detecting intrusions using system calls: alternative data models [J]. Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy, 1999, 133-145.
    [40]徐明,陈纯,应晶.基于系统调用分类的异常检测[J].软件学报,2004,15(3):391-403.
    [41]张波云,殷建平,张鼎兴,张鼎兴.基于K-最近邻算法的未知病毒检测[J].计算机工程与应用,2005,6:7-10.
    [42]张波云,殷建平,蒿敬波,张鼎兴.基于多重朴素贝叶斯算法的未知病毒检测[J].计算机工程,2006,32(10):18-21.
    [43] Marius Gheorghescu,AN AUTOMATED VIRUS CLASSIFICATION SYSTEM [C]. 2004,Microsoft Corp, Security Business and Technology Unit
    [44] T.Holz,C.Willems,K.Rieck,P.Duessel,andP.Laskov. Learning and Classification of Malware Behavior [J].In Fifth Conference on Detection of Intrusions and Malware & Vulnerability Assessment,2008,36-48
    [45] Kath R. PE文件格式详解[Z]. MSDN, 2003.
    [46]唐树刚基于文件静态特征的木马检测研究[D].天津:天津大学, 2005
    [47]戴敏,黄亚楼,王维,基于文件静态信息的木马检测模型[J],计算机工程,32(6):198-200
    [48]胡永涛,王维,肖新光.基于决策树模型的恶意程序判定方法[J].信息网络安全. 006,p:51-52
    [49]张波云,计算机病毒智能检测技术研究[D].国防科技大学,2007.4
    [50] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, Helmut Veith. Detecting Malicious Code by Model Checking [C]. Conference on Detection of Intrusion and Malware & Vulnerability Assessment, 2005.
    [51]反病毒引擎设计之虚拟机查毒篇,[EB/OL]. http://blog.chinaunix.net/u1/50394/showart_469657.html.
    [52] VMware VIX SDK1.1说明. [EB/OL]. http://www.vmware.cn/Soft/844.html.
    [53] John Robbins. Debugging Applications[M]. Washi-ngton: Microsoft Press, 2000.
    [54]沈美明,温冬婵. IBM-PC汇编语言程序设计(第2版) [M].清华大学出版社,2001
    [55] Vladimir N.Vapnik,统计学习理论[M].北京:电子工业出版社,1998
    [56] Chih-Chung,andC.Jian. LIBSVM: a library for support vector machines[EB/OL]. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [57]杜光辉,入侵检测系统中数据预处理技术的研究[D].郑州大学,2007,14-15
    [58]陈亮,基于Win32 API的恶意代码检测技术研究[D].杭州电子科技大学,2009.
    [59] Quinlan JR. Induction of Decision tree[J]. Machine Learning,1986,1(1):81 - 106.
    [60] D.Blei, A.Ng,M.Jordan. Latent Dirichlet allocation[J]. Journal of Machine Learning Research 2003,3:993–1022
    [61] Zhao Hengli,Zheng Ning. Malicious executables classification based on behavioral factor analysis[C]. International Conference on e-Education, e-Business, e-Management and e-Learning.
    [62] Eric.SvenRistad,Peter N.Yianilos.Learning String-Edit Distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998 p.522-532
    [63] Ian H.Witten,Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques[M]. Morgan Kaufman, 2005
    [64]李文斌,刘椿年,陈嶷瑛.基于特征信息增益权重的文本分类算法[J].北京工业大学学报,2006.
    [65] Boyun Zhang, JianPing Yin. Unknown Malicious Codes Detection Based on Rough Set Theory and Support Vector Machine[C].IEEE International Joint Conference on Neural Networks. 2006.
    [66] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, Helmut Veith. Detecting Malicious Code by Model Checking[C]. IEEE Conference on Detection of Intrusion and Malware & Vulnerability Assessment, DIMVA 2005.
    [67] virus signature [EB/OL]. http://www.atis.org/tg2k/_virus_signature.html ,Sep 2005.
    [68] virus signature[EB/OL].www.webopedia.com/TERM/V/virus_signature.html,May 2004
    [69]刘磊,邵堃.两种恶意代码行为特征统计方法的比较[J].合肥工业大学学报自然科学版. 2009,1(32):61-65.
    [70]刘武,段海新.用VMware构建高效的网络安全实验床[J].计算机应用研究,2005. p.212-214
    [71]余松林.医学统计学[M].北京:人民卫生出版社,2002:76-78.
    [72] Statistical Pattern Recognition: A Review [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(1):4-37.
    [73] Christodorescu M, Jha.S, Seshia SA, Song D, Bryant.R.E. Semantics-aware malware detection [C]. Proceedings of IEEE Symposium on Security and Privacy, 2005, 32-46.
    [74]郭飞,周曼丽.Win32 API拦截技术综述[J].计算机工程与应用,2002,19:144-146.
    [75]汪廷华,田盛丰.样本属性重要度的支持向量机方法[J].北京交通大学学报.2007,5(31):87-90
    [76] ChenWenliang,ChangXingzhi,WangHuizhen. Automatic Word Clustering for Text Categorization Using Global Information[C].The Asia Information Retrieval Symposium,2004.
    [77]刘海峰,王元元,姚泽清,王倩.一种基于特征聚类的文本分类模型研究[J].情报学报.2008,27(2):224-228
    [78]蒋宗礼,徐学可,李帅.文本分类中基于词条聚合的特征抽取[J].哈尔滨工程大学学报.2008.11(29):1205-1209.
    [79] Yanfang Ye,DingdingWang,TaoLi,Dongyi Ye.IMDS:Intelligent Malware Detection System[C].he Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2007.1043-1047.
    [80]李昆仑,黄厚宽,田盛丰.模糊多类支持向量机及其在入侵检测中的应用[J].计算机学报,2005,28(2):274-280.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700