基于熵时间序列的恶意Office文档检测技术

英文篇名：Malicious Office document detection technology based on entropy time series
作者：周安民 ; 户磊 ; 刘露平 ; 贾鹏 ; 刘亮
英文作者：ZHOU An-min;HU Lei;LIU Lu-ping;JIA Peng;LIU Liang;College of Electronics and Information, Sichuan University;
关键词：熵时间序列 ; 功率谱 ; 机器学习 ; 恶意文档检测
英文关键词：entropy time serie;;power spectrum;;machine learning;;malicious document detection
中文刊名：SDDX
英文刊名：Journal of Shandong University(Natural Science)
机构：四川大学电子信息学院;
出版日期：2019-04-03 08:56
出版单位：山东大学学报(理学版)
年：2019
期：v.54
基金：国家重点基础研究发展规划项目计划(2017YFB0802900)
语种：中文;
页：SDDX201905001
页数：7
CN：05
ISSN：37-1389/N
分类号：5-11

摘要

为了更加准确地检测恶意Office(*.docx、*.rtf)文档,提出了一种基于文档熵时间序列对恶意Office文档进行检测的方法。该方法将恶意与非恶意文档二进制之间的差异转换为文件熵时间序列功率谱之间的差异性,然后采用IBK、random committe(RC)和random forest(RF)3种机器学习方法分别对数据进行学习和检测。实验结果显示,针对基于XML压缩技术的docx格式文档的准确率可以达到92.14%,而针对富文本格式(rtf)文件的准确率可以达到98.20%。
In order to detect malicious Office(*.docx, *.rtf) documents more accurately, a method based on document entropy time sequence to detect malicious Office documents is proposed. This method converts the difference between the malware and the non malicious document binary to the difference between the power spectrum of the time sequence of the file entropy, and then uses three kinds of machine learning methods, IBK, Random Committe(RC) and Random Forest(RF), to learn and detect the data respectively. The experimental results show that the accuracy of the docx format document for XML compression technology can reach 92.14%, while the accuracy of the rich text format(RTF) file can reach 98.20%.

引文

[1] SMUTZ C,STAVROU A.Malicious PDF detection using metadata and structural features[C]//Computer Security Applications Conference.Florida:ACM,2012:239-248.
    [2] SCHRECK T,BERGER S,GOBEL J.BISSAM:automatic vulnerability identification of office documents[M]// Detection Intrusions Malware,Vulnerability Assessment Anonymous.[s.l.]:Springer,2013:204-213.
    [3] CHANG C C,LIN C J.LIBSVM:a library for support vector machines[J].ACM Transactions on Intelligent System and Technology,2011,2(3):1-27.
    [4] NISSIM N,COHEN A,GLEZER C,et al.Detection of malicious PDF files and directions for enhancements:a state-of-the art survey[J].Computers and Security,2015,49:246-266.
    [5] MOSKOVITCH R,NISSIM N,ELOVICI Y.Malicious code detectionusing active learning[C]//Privacy,Security,and Trust in KDD.Berlin:Springer,2009:74-91.
    [6] HERBRICH R,GRAEPEL T,CAMPBELL C.Bayes point machines[J].Journal of Machine Learning Research,2001,1(1):245-278.
    [7] BAYSA D,LOW R M,STAMP M.Structural entropy and metamorphic malware[J].Journal of Computer Virology and Hacking Techniques,2013,9(4):179-192.
    [8] 严承华,程晋,樊攀星.基于信息熵的网络流量信息结构特征研究[J].信息网络安全,2014(3):28-31.YAN Chenghua,CHENG Jin,FAN Panxing.Research on the structure characteristics of network traffic information based on information entropy[J].Journal of Information Network Security,2014(3):28-31.
    [9] LYDA R,HAMROCK J.Using entropy analysis to find encrypted and packed malware[J].IEEE Security and Privacy,2007,5(2):40-45.
    [10] 刘荣,刘珩.低信噪比下基于功率谱熵的语音端点检测算法[J].计算机工程与应用,2009,45(33):122-124.LIU Rong,LIU Heng.Speech endpoint detection algorithm based on power spectral entropy at low SNR[J].Computer Engineering and Applications,2009,45(33):122-124.
    [11] MUKHERJEE A.Bit error rate analysis using converged Welch?s method for energy detection spectrum sensing in cognitive radio[J].Journal of Engineering Science and Technology Review,2016,9(5):117-120.
    [12] NISSIM N,MOSKVITCH R,BARAD O,et al.ALDROID:efficient update of Android anti-virus software using designated active learning methods[J].Knowledge & Information System,2016,49(3):1-39.
    [13] NISSIM N,COHEN A,ELOVICI Y.Boosting the detection of malicious documents using designated active learning methods[C]//IEEE 14th International Conference on Machine Learning and Applications.USA:IEEE,2015:760-765.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700