基于图卷积网络的恶意代码聚类

英文篇名：Malware clustering based on graph convolutional networks
作者：刘凯 ; 方勇 ; 张磊 ; 左政 ; 刘亮
英文作者：LIU Kai;FANG Yong;ZHANG Lei;ZUO Zheng;LIU Liang;College of Electronics and Information Engineering, Sichuan University;College of Cybersecurity, Sichuan University;
关键词：恶意代码 ; 图卷积网络 ; 聚类 ; API调用图 ; 卷积神经网络
英文关键词：Malicious code;;GCN;;Clustering;;API call graph;;CNN
中文刊名：SCDX
英文刊名：Journal of Sichuan University(Natural Science Edition)
机构：四川大学电子信息学院;四川大学网络空间安全学院;
出版日期：2019-07-08 10:21
出版单位：四川大学学报(自然科学版)
年：2019
期：v.56
基金：国家重点研发计划基金(2017YFB0802904)
语种：中文;
页：SCDX201904012
页数：7
CN：04
ISSN：51-1595/N
分类号：80-86

摘要

许多新型恶意代码往往是攻击者在已有的恶意代码基础上修改而来,因此对恶意代码的家族同源性分析有助于研究恶意代码的演化趋势和溯源.本文从恶意代码的API调用图入手,结合图卷积网络(GCN),设计了恶意代码的相似度计算和家族聚类模型.首先,利用反汇编工具提取了恶意代码的API调用,并对API函数进行属性标注.然后,根据API对恶意代码家族的贡献度,选取关键API函数并构建恶意代码API调用图.使用GCN和卷积神经网络(CNN)作为恶意代码的相似度计算模型,以API调用图作为模型输入计算恶意代码之间的相似度.最后,使用DBSCAN聚类算法对恶意代码进行家族聚类.实验结果表明,本文提出的方法可以达到87.3%的聚类准确率,能够有效地对恶意代码进行家族聚类.
Many new types of malwares are often modified by attackers based on the existing malwares. Therefore, family homology analysis of malwares can help to study of evolutionary trend and traceability of malwares. In this paper, starting from API call graphs of malwares and combined with Graph Convolutional Networks(GCN), we proposed a similarity calculation and family clustering model for malwares. Firstly, we extract API call graphs of malwares with disassembly tools and the attribution of the API functions in the graphs are labeled. Then, we select key API functions by its contribution to the malware families and the API call graphs of malwares are constructed. We use GCN and Convolutional Neural Networks(CNN) as the model of the malware similarity calculation which the inputs are the API call graphs. Finally, we use DBSCAN algorithm to cluster malwares. The experimental results show that the proposed method can achieve 87.3% accuracy and can effectively cluster malware families.

引文

[1]韩晓光,曲武,姚宣霞,等.基于纹理指纹的恶意代码变种检测方法研究[J].通信学报,2017,35:125.
    [2]赵炳麟,孟曦,韩金,等.基于图结构的恶意代码同源性分析[J].通信学报,2017(s2):86.
    [3]李勤,师维,孙界平,等.基于卷积神经网络的网络流量识别技术研究[J].四川大学学报:自然科学版,2017,54:959.
    [4]杨可心,桑永胜.基于BP神经网络的DDoS攻击检测研究[J].四川大学学报:自然科学版,2017,54:71.
    [5]Kipf T N,Welling M.Semi-supervised classification with graph convolutional networks[J/OL].ComputSci,2016,/abs/1609.02907[2016-09-09].https://arxiv.org/abs/1609.02907.
    [6]Goldberg L A,Goldberg P W,Phillips C A,et al.Constructing computer virus phylogenies[J].J Algorithms,1998,26:188.
    [7]Karim M E,Walenstein A,Lakhotia A,et al.Malware phylogeny generation using permutations of code[J].J Comput Virol,2005,1:13.
    [8]Searles R,Xu L,Killian W,et al.Parallelization of machine learning applied to call graphs of binaries for malware detection[C]//Proceedings of the 201725th Euromicro International Conference on Parallel,Distributed and Network-based Processing(PDP).Saint Petersburg:IEEE,2017.
    [9]Zhu R,Li C,Niu D,et al.Android malware detection using large-scale network representation learning[J/OL].Comput Sci,2018,/abs/1806.04847[2018-06-13].https://arxiv.org/abs/1806.04847.
    [10]Chopra S,Hadsell R,LeCun Y.Learning a similarity metric discriminatively,with application to face verification[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Diego:IEEE,2005.
    [11]Bai Y,Ding H,Bian S,et al.Graph edit distance computation via graph neural networks[J].Comput Sci,2018,/abs/1808.05689[2018-08-16].https://arxiv.org/abs/1808.05689.
    [12]Bahdanau D,Cho K,Bengio Y.Neural machine translation by jointly learning to align and translate[J/OL].Comput Sci,2014,/abs/1409.0473[2014-09-01].https://arxiv.org/abs/1409.0473.
    [13]李荣.利用卷积神经网络的显著性区域预测方法[J].重庆邮电大学学报:自然科学版,2019,31:37.
    [14]Severyn A,Moschitti A.Learning to rank short text pairs with convolutional deep neural networks[C]//Proceedings of the 38th international ACMSIGIR conference on research and development in information retrieval.Santiago:ACM,2015.
    [15]Ester M,Kriegel H P,Sander J,et al.A densitybased algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining.Portland:AAAI Press,1996.
    [16]徐健锐,詹永照.基于Spark的改进K-means快速聚类算法[J].江苏大学学报:自然科学版,2018,39:73.
    [17]姜建华,吴迪,郝德浩,等.基于CDbw和人工蜂群优化的密度峰值聚类算法[J].吉林大学学报:理学版,2018,56:1469.
    [18]Rand W M.Objective criteria for the evaluation of clustering methods[J].J Am Stati Assoc,1971,66:846.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700