基于文本嵌入特征表示的恶意软件家族分类

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于文本嵌入特征表示的恶意软件家族分类

详细信息查看全文 | 推荐本文 |

英文篇名：Malware family classification based on text embedding feature representation
作者：张涛 ; 王俊峰
英文作者：ZHANG Tao;WANG Jun-Feng;College of Computer Science, Sichuan University;
关键词：恶意软件 ; 分类 ; 文本嵌入 ; Doc2Vec
英文关键词：Malware;;Classification;;Text Embedding;;Doc2Vec
中文刊名：SCDX
英文刊名：Journal of Sichuan University(Natural Science Edition)
机构：四川大学计算机学院;
出版日期：2019-05-13 15:24
出版单位：四川大学学报(自然科学版)
年：2019
期：v.56
基金：国家重点研发计划项目(2018YFB0804503,2016QY06X1205);; 装备预研教育部联合基金(6141A02011607,6141A02033304);; 四川省重点研发计划项目(18ZDYF3867,2017GZDZX0002)
语种：中文;
页：SCDX201903011
页数：9
CN：03
ISSN：51-1595/N
分类号：71-79

摘要

自动化、高效率和细粒度是恶意软件检测与分类领域目前面临的主要挑战.随着深度学习在图像处理、语音识别和自然语言处理等领域的成功应用,其在一定程度上缓解了传统分析方法在人力和时间成本上的巨大压力.因此本文提出一种自动、高效且细粒度的恶意软件分析方法-mal2vec,其将每个恶意软件看成是一个具有丰富行为语义信息的文本,文本的内容由恶意软件动态执行时的API序列构成,采用经典的神经概率模型Doc2Vec对文本集进行训练学习.实验结果表明,与Rieck~([1])等人的分类效果相比,本文方法得到的效果有明显提升.特别的,不同于其他深度学习的方法,本文方法能够抽取模型训练的中间结果进行显式表示,这种显式的中间结果表示具有可解释性,可以让我们从细粒度层面分析恶意软件家族的行为模式.
Automation, efficiency, and granularity are major challenges in the area of malware detection and classification. With the successful application of deep learning in the fields of image processing, speech recognition and natural language processing, it has alleviated the enormous pressure of traditional analysis methods on manpower and time cost to some extent. This paper describes mal2 vec: an automatic, efficient and fine-grained malware analysis method, which treats each malware as a text with rich behavioral semantic information. The content of the text is composed of API sequences when malware is dynamically executed. We use the classical neural probability model Doc2 Vec to train the text set. The experimental results show that the effect of this paper is significantly improved compared with the classification effect of Rieck et al. In particular, unlike other methods of deep learning, this method can extract the intermediate results of model training for explicit representation. This explicit intermediate result is interpretable and allows us to analyze the behavior patterns of the malware family from a fine-grained level.

引文

[1] Rieck K,Trinius P,Willems C,et al.Automatic analysis of malware behavior using machine learning [J].J Comput Secur,2011,19:639.
    [2] 肖锦琦,王俊峰.基于模糊哈希特征表示的恶意软件聚类方法 [J].四川大学学报:自然科学版,2018,55:469.
    [3] Karampatziakis N,Stokes J W,Thomas A,et al.Using file relationships in malware classification [C]//Proceedings of the 9th international Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Heraklion,Crete,Greece:Springer,2013.
    [4] Dahl G E,Stokes J W,Deng L,et al.Large-scale malware classification using random projections and neural networks [C]//Proceedings of the International Conference on Acoustics,Speech and Signal Processing.Vancouver,BC,Canada:IEEE,2013.
    [5] Kolter J Z,Maloof M A.Learning to detect and classify malicious executables in the wild [J].J Mach Learn Res,2006,7:2721.
    [6] Schultz M G,Eskin E,Zadok F,et al.Data mining methods for detection of new malicious executables [C] //Proceedings of the 2001 IEEE Symposium on Security and Privacy.S&P 2001.Oakland,CA,USA,USA:IEEE,2001.
    [7] Kolter J Z,Maloof M A.Learning to detect malicious executables in the wild [C]//Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.Seattle,WA,USA:ACM,2004.
    [8] 魏琴芳,李林乐,张峰,等.一种安卓系统手机恶意软件链接串行联合检测方法 [J].重庆邮电大学学报:自然科学版,2017,29:6.
    [9] Raff E,Barker J,Sylvester J,et al.Malware detection by eating a whole exe [J].Comput Res Repo,2017,2017:1710.
    [10] Nataraj L,Yegneswaran V,Porras P,et al.A comparative assessment of mMalware classification using binary texture analysis and dynamic analysis [C]//Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence.Chicago,Illinois,USA:ACM,2011.
    [11] Mikolov T,Sutskever I,Chen K,et al.Distributed representations of words and phrases and their compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems.Lake Tahoe,Nevada:Curran Associates Inc,2013.
    [12] Le Q,Mikolov T.Distributed representations of sentences and documents [C]//Proceedings of the 31st International Conference on Machine Learning.Beijing,China:JMLR.org,2014.
    [13] 郭文,王俊峰.Windows恶意代码动态通用脱壳方法研究 [J].四川大学学报:自然科学版,2018,55:283.
    [14] Campr M,Je?ek K.Comparing semantic models for evaluating automatic document summarization [C]//Proceedings of the International Conference on Text,Speech,and Dialogue.[s.l.]:Springer,Cham,2015.
    [15] Liang G,Pang J,Dai C.A Behavior-Based Malware Variant Classification Technique [J].Int J Inform Educ Tech,2016,6:291.引用本文格式:中文:张涛,王俊峰.基于文本嵌入特征表示的恶意软件家族分类 [J].四川大学学报:自然科学版,2019,56:441.英文:Zhang T,Wang J F.Malware family classification based on text embedding feature representation [J].J Sichuan Univ:Nat Sci Ed,2019,56:441.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700