基于N-Gram的计算机病毒特征码自动提取的改进方法

英文篇名：Improved Method of Computer Virus Signature Automatic Extraction Based on N-Gram
作者：杨燕 ; 蒋国平
英文作者：YANG Yan;JIANG Guo-ping;School of Computer Science and Technology,Nanjing University of Posts and Telecommunications;School of Automation,Nanjing University of Posts and Telecommunications;
关键词：N-Gram ; 病毒特征码 ; 特征浓度 ; 数据字典
英文关键词：N-Gram;;Virus signature;;Signature concentration;;Data dictionary
中文刊名：JSJA
英文刊名：Computer Science
机构：南京邮电大学计算机学院;南京邮电大学自动化学院;
出版日期：2017-11-15
出版单位：计算机科学
年：2017
期：v.44
语种：中文;
页：JSJA2017S2072
页数：5
CN：S2
ISSN：50-1075/TP
分类号：348-351+371

摘要

随着计算机技术的发展和普及,计算机病毒带来的危害日趋严重。传统N-Gram算法难以提取不同长度的特征,导致有效特征缺失,并产生庞大的特征集合,造成空间的浪费。针对这些问题,提出一种改进的基于N-Gram的特征码自动提取方法。该方法在原有N-Gram特征提取算法的基础上引入变长N-Gram特征,提取不同长度的有效特征,生成不定长病毒特征码。综合考虑特征频率的相关性,利用特征浓度对N-Gram特征进行有向筛选,生成数据字典,节省存储空间。实验结果表明,与单纯使用定长N-Gram的算法相比,该方法能有效降低特征码自动提取的误报率。
With the rapid development of computer technology,security threats brought by computer virus have become more and more serious.The traditional N-Gram algorithm is difficult to capture bytes of different length,leading to the lack of effective signature and the geheration of huge signature sets,and creating a waste of storage space.Instead of using fixed-length N-Gram feature that the traditional way dose,an improved computer virus signature automatic extraction algorithm based on variable-length N-Gram was proposed to solve these problems.It extracts the effective signature to generate variable-length virus signature.Taking the correlation of signature frequency into account,the algorithm uses signature concentration to extract the N-Gram feature of malware samples and generates a data dictionary to save the storage space.In the experiment results,compared with the traditional algorithm which uses fixed-length NGram feature,the proposed method can effectively decrease the false rate of signature extraction.

引文

[1]YEGNESWARAN V,GIFFIN J T,BARFOD P,et al.An architecture for generating semantics-aware signatures[C]∥Conference on Usenix Security Symposium.USENIX Association,2004:7-7.
    [2]LEE H,KIM W,HONG M.Biologically Inspired Computer Virus Detection System[J].Lecture Notes in Computer Science,2004,3141:153-165.
    [3]KIJEWSKI P.Automated Extraction of Threat Signatures from Network Flows[OL].http://www.first.org/conference/2006/papers/kijewski-piotr-paper.pdf.
    [4]KREIBICH C,ROWCROFT J.Honeycomb:creating intrusion detection signatures using honeypots[J].Acm Sigcomm Computer Communication Review,2015,34(1):51-56.
    [5]张小康,帅建梅,史林.基于加权信息增益的恶意代码检测方法[J].计算机工程,2010,36(6):149-151.
    [6]KEPHART J O,ARNOLD W C.Automatic extraction of computer virus signatures[C]∥4th Virus Bulletin International Conference.1994.
    [7]张福勇.基于n-gram词频的恶意代码特征提取方法[J].网络安全技术与应用,2015(11):88-89.
    [8]白金荣,王俊峰,赵宗渠.基于PE静态结构特征的恶意软件检测方法[J].计算机科学,2013,40(1):122-126.
    [9]RAFF E,ZAK R,COX R,et al.An investigation of byte n-gram features for malware classification[J].Journal of Computer Virology&Hacking Techniques,2016:1-20.
    [10]曾键,赵辉.一种基于N-Gram的计算机病毒特征码自动提取方法[J].计算机安全,2013(10):2-5.
    [11]李沁蕾,王蕊,贾晓启.OSN中基于分类器和改进n-gram模型的跨站脚本检测方法[J].计算机应用,2014,34(6):1661-1665.
    [12]DHAYA R,POONGODI M.Detecting software vulnerabilies in android using static analysis[C]∥International Conference on Advanced Communication,Control and Computing Technologies.2014.
    [13]O’KANE P,SEZER S,MCLAUGHLIN K.N-gram density based malware detection[C]∥Computer Applications&Research.IEEE,2014:1-6.
    [14]SHABTAI A,MOSKOVITCH R,FEHER C,et al.Detecting unknown malicious code by applying classification techniques on OpCode patterns[J].Security Informatics,2012,1(1):1-22.
    [15]SANTOS I,BREZO F,UGARTE-PEDRERO X,et al.Opcode sequences as representation of executables for data-miningbased unknown malware detection[J].Information Sciences,2013,231(9):64-82.
    [16]吴军.数学之美[M].北京:人民邮电出版社,2012.
    [17]恶意代码网站[OL].http://vxheaven.org.
    [18]金雄斌.计算机病毒特征码自动提取技术的研究[D].武汉:华中科技大学,2011.
    [19]TANG Y,XIAO B,LU X.Using a bioinformatics approach to generate accurate exploit-based signatures for polymorphic worms[J].Computers&Security,2009,28(8):827-842.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700