摘要
为了更加准确地检测恶意Office(*.docx、*.rtf)文档,提出了一种基于文档熵时间序列对恶意Office文档进行检测的方法。该方法将恶意与非恶意文档二进制之间的差异转换为文件熵时间序列功率谱之间的差异性,然后采用IBK、random committe(RC)和random forest(RF)3种机器学习方法分别对数据进行学习和检测。实验结果显示,针对基于XML压缩技术的docx格式文档的准确率可以达到92.14%,而针对富文本格式(rtf)文件的准确率可以达到98.20%。
In order to detect malicious Office(*.docx, *.rtf) documents more accurately, a method based on document entropy time sequence to detect malicious Office documents is proposed. This method converts the difference between the malware and the non malicious document binary to the difference between the power spectrum of the time sequence of the file entropy, and then uses three kinds of machine learning methods, IBK, Random Committe(RC) and Random Forest(RF), to learn and detect the data respectively. The experimental results show that the accuracy of the docx format document for XML compression technology can reach 92.14%, while the accuracy of the rich text format(RTF) file can reach 98.20%.
引文
[1] SMUTZ C,STAVROU A.Malicious PDF detection using metadata and structural features[C]//Computer Security Applications Conference.Florida:ACM,2012:239-248.
[2] SCHRECK T,BERGER S,GOBEL J.BISSAM:automatic vulnerability identification of office documents[M]// Detection Intrusions Malware,Vulnerability Assessment Anonymous.[s.l.]:Springer,2013:204-213.
[3] CHANG C C,LIN C J.LIBSVM:a library for support vector machines[J].ACM Transactions on Intelligent System and Technology,2011,2(3):1-27.
[4] NISSIM N,COHEN A,GLEZER C,et al.Detection of malicious PDF files and directions for enhancements:a state-of-the art survey[J].Computers and Security,2015,49:246-266.
[5] MOSKOVITCH R,NISSIM N,ELOVICI Y.Malicious code detectionusing active learning[C]//Privacy,Security,and Trust in KDD.Berlin:Springer,2009:74-91.
[6] HERBRICH R,GRAEPEL T,CAMPBELL C.Bayes point machines[J].Journal of Machine Learning Research,2001,1(1):245-278.
[7] BAYSA D,LOW R M,STAMP M.Structural entropy and metamorphic malware[J].Journal of Computer Virology and Hacking Techniques,2013,9(4):179-192.
[8] 严承华,程晋,樊攀星.基于信息熵的网络流量信息结构特征研究[J].信息网络安全,2014(3):28-31.YAN Chenghua,CHENG Jin,FAN Panxing.Research on the structure characteristics of network traffic information based on information entropy[J].Journal of Information Network Security,2014(3):28-31.
[9] LYDA R,HAMROCK J.Using entropy analysis to find encrypted and packed malware[J].IEEE Security and Privacy,2007,5(2):40-45.
[10] 刘荣,刘珩.低信噪比下基于功率谱熵的语音端点检测算法[J].计算机工程与应用,2009,45(33):122-124.LIU Rong,LIU Heng.Speech endpoint detection algorithm based on power spectral entropy at low SNR[J].Computer Engineering and Applications,2009,45(33):122-124.
[11] MUKHERJEE A.Bit error rate analysis using converged Welch?s method for energy detection spectrum sensing in cognitive radio[J].Journal of Engineering Science and Technology Review,2016,9(5):117-120.
[12] NISSIM N,MOSKVITCH R,BARAD O,et al.ALDROID:efficient update of Android anti-virus software using designated active learning methods[J].Knowledge & Information System,2016,49(3):1-39.
[13] NISSIM N,COHEN A,ELOVICI Y.Boosting the detection of malicious documents using designated active learning methods[C]//IEEE 14th International Conference on Machine Learning and Applications.USA:IEEE,2015:760-765.