您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2019, Vol. 54 ›› Issue (5): 1-7.doi: 10.6040/j.issn.1671-9352.2.2018.072

•   •    下一篇

基于熵时间序列的恶意Office文档检测技术

周安民(),户磊,刘露平*(),贾鹏,刘亮   

  1. 四川大学电子信息学院, 四川 成都 610065
  • 收稿日期:2018-09-20 出版日期:2019-05-20 发布日期:2019-05-09
  • 通讯作者: 刘露平 E-mail:1515742050@qq.com;529282048@qq.com
  • 作者简介:周安民(1963—),男,研究员,研究方向为安全防御与管理. E-mail:1515742050@qq.com
  • 基金资助:
    国家重点基础研究发展规划项目计划(2017YFB0802900)

Malicious Office document detection technology based on entropy time series

An-min ZHOU(),Lei HU,Lu-ping LIU*(),Peng JIA,Liang LIU   

  1. College of Electronics and Information, Sichuan University, Chengdu 610065, Sichuan, China
  • Received:2018-09-20 Online:2019-05-20 Published:2019-05-09
  • Contact: Lu-ping LIU E-mail:1515742050@qq.com;529282048@qq.com
  • Supported by:
    国家重点基础研究发展规划项目计划(2017YFB0802900)

摘要:

为了更加准确地检测恶意Office(*.docx、*.rtf)文档,提出了一种基于文档熵时间序列对恶意Office文档进行检测的方法。该方法将恶意与非恶意文档二进制之间的差异转换为文件熵时间序列功率谱之间的差异性,然后采用IBK、random committe(RC)和random forest(RF)3种机器学习方法分别对数据进行学习和检测。实验结果显示,针对基于XML压缩技术的docx格式文档的准确率可以达到92.14%,而针对富文本格式(rtf)文件的准确率可以达到98.20%。

关键词: 熵时间序列, 功率谱, 机器学习, 恶意文档检测

Abstract:

In order to detect malicious Office (*.docx, *.rtf) documents more accurately, a method based on document entropy time sequence to detect malicious Office documents is proposed. This method converts the difference between the malware and the non malicious document binary to the difference between the power spectrum of the time sequence of the file entropy, and then uses three kinds of machine learning methods, IBK, Random Committe (RC) and Random Forest (RF), to learn and detect the data respectively. The experimental results show that the accuracy of the docx format document for XML compression technology can reach 92.14%, while the accuracy of the rich text format (RTF) file can reach 98.20%.

Key words: entropy time serie, power spectrum, machine learning, malicious document detection

中图分类号: 

  • TP39

图1

docx文档结构"

图2

熵时间序列"

图3

实验步骤"

图4

功率谱对比图"

图5

准确率测试结果"

图6

召回率测试结果"

图7

F值测试结果"

图8

检测能力对比结果"

1 SMUTZ C, STAVROU A. Malicious PDF detection using metadata and structural features[C]//Computer Security Applications Conference. Florida: ACM, 2012: 239-248.
2 SCHRECK T, BERGER S, GOBEL J. BISSAM: automatic vulnerability identification of office documents[M]// Detection Intrusions Malware, Vulnerability Assessment Anonymous.[s.l.]: Springer, 2013:204-213.
3 CHANG C C , LIN C J . LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent System and Technology, 2011, 2 (3): 1- 27.
4 NISSIM N , COHEN A , GLEZER C , et al. Detection of malicious PDF files and directions for enhancements: a state-of-the art survey[J]. Computers and Security, 2015, 49: 246- 266.
5 MOSKOVITCH R, NISSIM N, ELOVICI Y. Malicious code detectionusing active learning[C]//Privacy, Security, and Trust in KDD. Berlin: Springer, 2009: 74-91.
6 HERBRICH R , GRAEPEL T , CAMPBELL C . Bayes point machines[J]. Journal of Machine Learning Research, 2001, 1 (1): 245- 278.
7 BAYSA D , LOW R M , STAMP M . Structural entropy and metamorphic malware[J]. Journal of Computer Virology and Hacking Techniques, 2013, 9 (4): 179- 192.
doi: 10.1007/s11416-013-0185-4
8 严承华, 程晋, 樊攀星. 基于信息熵的网络流量信息结构特征研究[J]. 信息网络安全, 2014, (3): 28- 31.
doi: 10.3969/j.issn.1671-1122.2014.03.006
YAN Chenghua , CHENG Jin , FAN Panxing . Research on the structure characteristics of network traffic information based on information entropy[J]. Journal of Information Network Security, 2014, (3): 28- 31.
doi: 10.3969/j.issn.1671-1122.2014.03.006
9 LYDA R , HAMROCK J . Using entropy analysis to find encrypted and packed malware[J]. IEEE Security and Privacy, 2007, 5 (2): 40- 45.
doi: 10.1109/MSP.2007.48
10 刘荣, 刘珩. 低信噪比下基于功率谱熵的语音端点检测算法[J]. 计算机工程与应用, 2009, 45 (33): 122- 124.
LIU Rong , LIU Heng . Speech endpoint detection algorithm based on power spectral entropy at low SNR[J]. Computer Engineering and Applications, 2009, 45 (33): 122- 124.
11 MUKHERJEE A . Bit error rate analysis using converged Welch's method for energy detection spectrum sensing in cognitive radio[J]. Journal of Engineering Science and Technology Review, 2016, 9 (5): 117- 120.
doi: 10.25103/jestr
12 NISSIM N , MOSKVITCH R , BARAD O , et al. ALDROID: efficient update of Android anti-virus software using designated active learning methods[J]. Knowledge & Information System, 2016, 49 (3): 1- 39.
13 NISSIM N, COHEN A, ELOVICI Y. Boosting the detection of malicious documents using designated active learning methods[C]//IEEE 14th International Conference on Machine Learning and Applications. USA: IEEE, 2015: 760-765.
[1] 刘铭, 昝红英, 原慧斌. 基于SVM与RNN的文本情感关键句判定与抽取[J]. 山东大学学报(理学版), 2014, 49(11): 68-73.
[2] 潘清清,周枫,余正涛,郭剑毅,线岩团. 基于条件随机场的越南语命名实体识别方法[J]. 山东大学学报(理学版), 2014, 49(1): 76-79.
[3] 杜瑞颖, 杨勇, 陈晶, 王持恒. 一种基于相似度的高效网络流量识别方案[J]. 山东大学学报(理学版), 2014, 49(09): 109-114.
[4] 董源1,徐雅斌1,2*,李卓1,2,李艳平1. 基于社会计算和机器学习的垃圾邮件识别方法的研究[J]. J4, 2013, 48(7): 72-78.
[5] 黄林晟1,邓志鸿1,2,唐世渭1,2,王文清3,陈凌3. 基于编辑距离的中文组织机构名简称-全称匹配算法[J]. J4, 2012, 47(5): 43-48.
[6] 唐都钰1,王大亮2,赵凯2,秦兵1,刘挺1. 面向汽车领域的软文识别研究[J]. J4, 2012, 47(3): 43-46.
[7] 黄贤立,罗冬梅. 倾向性文本迁移学习中的特征重要性研究[J]. J4, 2010, 45(7): 13-17.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 赵君1,赵晶2,樊廷俊1*,袁文鹏1,3,张铮1,丛日山1. 水溶性海星皂苷的分离纯化及其抗肿瘤活性研究[J]. J4, 2013, 48(1): 30 -35 .
[2] 孙小婷1,靳岚2*. DOSY在寡糖混合物分析中的应用[J]. J4, 2013, 48(1): 43 -45 .
[3] 罗斯特,卢丽倩,崔若飞,周伟伟,李增勇*. Monte-Carlo仿真酒精特征波长光子在皮肤中的传输规律及光纤探头设计[J]. J4, 2013, 48(1): 46 -50 .
[4] 谢涛,左可正. 关于两个幂等算子组合的Drazin逆的若干探讨[J]. J4, 2013, 48(4): 95 -103 .
[5] 王 怡,刘爱莲 . 时标下的蛛网模型[J]. J4, 2007, 42(7): 41 -44 .
[6] 袁晖坪 . 行(列)对称矩阵的Schur分解和正规阵分解[J]. J4, 2007, 42(10): 123 -126 .
[7] 杨伦,徐正刚,王慧*,陈其美,陈伟,胡艳霞,石元,祝洪磊,曾勇庆*. RNA干扰沉默PID1基因在C2C12细胞中表达的研究[J]. J4, 2013, 48(1): 36 -42 .
[8] 冒爱琴1, 2, 杨明君2, 3, 俞海云2, 张品1, 潘仁明1*. 五氟乙烷灭火剂高温热解机理研究[J]. J4, 2013, 48(1): 51 -55 .
[9] 唐风琴1,白建明2. 一类带有广义负上限相依索赔额的风险过程大偏差[J]. J4, 2013, 48(1): 100 -106 .
[10] 廖明哲. 哥德巴赫的两个猜想[J]. J4, 2013, 48(2): 1 -14 .