您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2019, Vol. 54 ›› Issue (5): 37-43.doi: 10.6040/j.issn.1671-9352.2.2018.136

•   • 上一篇    下一篇

基于连续特征的未知协议消息聚类算法

卢政宇(),李光松,申莹珠,张彬   

  1. 信息工程大学, 河南 郑州 450001
  • 收稿日期:2018-09-20 出版日期:2019-05-20 发布日期:2019-05-09
  • 作者简介:卢政宇(1993—),男,硕士研究生,研究方向为密码协议分析. E-mail:441840145@qq.com
  • 基金资助:
    国家重点研发计划资助项目(2016YFB080010);国家重点研发计划资助项目(2016YFB0800100);国家自然科学基金创新研究群体项目(61521003)

Unknown protocol message clustering algorithm based on continuous features

Zheng-yu LU(),Guang-song LI,Ying-zhu SHEN,Bin ZHANG   

  1. Information Engineering University, Zhengzhou 450001, Henan, China
  • Received:2018-09-20 Online:2019-05-20 Published:2019-05-09
  • Supported by:
    国家重点研发计划资助项目(2016YFB080010);国家重点研发计划资助项目(2016YFB0800100);国家自然科学基金创新研究群体项目(61521003)

摘要:

对未知协议消息序列进行聚类处理是分析协议格式的基础。从字符串匹配的角度出发,利用协议格式字段的连续性,在传统K-均值算法基础上提出一种基于连续特征的未知协议消息聚类算法。首先基于协议格式字段连续性对待测数据集进行粗聚类,提取出K-均值算法的初始聚类中心,再使用消息距离及收敛函数改进的迭代算法对数据进行迭代处理实现消息的进一步聚类。实验表明,提出的新方法与传统K-均值算法相比,在聚类准确度上提升了17.58%,迭代次数上减少了约58.27%,与EM算法、DBSCAN算法相比在聚类准确率与时间上均有明显提升。

关键词: 未知协议, 协议逆向分析, K-均值算法, 初始聚类中心

Abstract:

Clustering of unknown protocol message sequences is the basis for analyzing the protocol format. Based on the traditional K-means algorithm, this paper proposed a continuous-features-based unknown feature message clustering algorithm, which combines the continuity of protocol format field with string matching method. Firstly, coarsely cluster the measured data sets based on the continuity of the protocol format field, extract the initial clustering center of the K-means algorithm, and then iteratively process the data using the message distance and the iterative algorithm with improved convergence function to further cluster the message. Experiments show that compared with the traditional K-means algorithm, the proposed new method improves the accuracy of clustering by 17.58% and reduces the number of iterations by about 58.27%. Compared with EM algorithm and DBSCAN algorithm, the clustering accuracy and time are significantly improved.

Key words: unknown protocol, protocol reverse analysis, K-means algorithm, initial clustering centers

中图分类号: 

  • TP181

图1

初始聚类中心准确率"

图2

迭代次数降低比率"

图3

聚类准确度对比"

图4

时间消耗"

表1

噪声分离"

算法名称NWKK-meansEMDBSCAN
噪声分离功能

图5

NWK算法噪声分离结果"

1 吴礼发, 洪征, 潘璠. 网络协议逆向分析及应用[M]. 北京: 国防工业出版社, 2016.
WU Lifa , HONG Zheng , PAN Pan . Network protocol reverse analysis and application[M]. Beijing: National Defense Industry Press, 2016.
2 MACQUEEN J. Some methods for classification and analysis of multivariate observations[J]. Proc of Berkeley Symposium on Mathematical Statistics & Probability. Berkley: California Press, 1965: 281-297.
3 NEEDLEMAN S B , WUNSCH C D . A general method applicable to the search for similarities in the amino acid sequence of two proteins[J]. Journal of Molecular Biology, 1970, 48 (3): 443- 453.
doi: 10.1016/0022-2836(70)90057-4
4 CUI W, KANNAN J, WANG H J. Discoverer: automatic protocol reverse engineering from network traces[C]//Proceedings of 16th USENIX Security Symposium, Boston: Usenix Security, 2007.
5 HUANG Q, LEE P P C, ZHANG Z B. Exploiting intra-packet dependency for fine-grained protocol format inference[C]//2015 IFIP Networking Conference Toulouse: IEEE, 2015: 1-9.
6 SIJA B D , GOO Y H , SHIM K S , et al. A survey of automatic protocol reverse engineering approaches, methods, and tools on the inputs and outputs view[J]. Security and Communication Networks, 2018, 2018: 1- 17.
7 何超, 刘方, 曾曦, 等. 针对未知协议消息序列的聚类分析实现[J]. 通信技术, 2017, 50 (2): 277- 286.
doi: 10.3969/j.issn.1002-0802.2017.02.014
HE Chao , LIU Fang , ZENG Xi , et al. Clustering analysis of unknown protocol message sequence[J]. Communications Technology, 2017, 50 (2): 277- 286.
doi: 10.3969/j.issn.1002-0802.2017.02.014
8 GUSFIELD Dan . Algorithms on stings, trees, and sequences[J]. ACM SIGACT News, 1997, 28 (4): 41- 60.
doi: 10.1145/270563
9 YANG Q , WU X D . 10 challenging problems in data mining research[J]. International Journal of Information Technology & Decision Making, 2006, 5 (4): 597- 604.
10 SMYTH P . Clustering sequences with hidden markov models[J]. NIPS, 1997, 9: 648- 654.
11 OREBAUGH A , RAMIREZ G , BURKE J , et al. Wireshark & Ethereal network protocol analyzer toolkit[M]. Cambridge: Elsevier, 2007.
12 JOHNSON L . Weavings Women Doing Theology in Oceania[M]. Fiji: Wcavers, 2003.
13 王维彬, 钟润添. 一种基于贪心EM算法学习GMM的聚类算法[J]. 计算机仿真, 2007, 24 (2): 65- 68.
doi: 10.3969/j.issn.1006-9348.2007.02.018
WANG Weibin , ZHONG Runtian . A clustering algorithm based on greedy EM algorithm learning GMM[J]. Computer Simulation, 2007, 24 (2): 65- 68.
doi: 10.3969/j.issn.1006-9348.2007.02.018
14 ESTER M, KRIEGEL HP, XU X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise[C]//International Conference on Knowledge Discovery and Data Mining. Palo Alto: AAAI Press, 1996: 226-231.
15 BAI L , CHENG X Q , LIANG J Y , et al. Fast density clustering strategies based on the K-means algorithm[J]. Pattern Recognition, 2017, 71: 375- 386.
doi: 10.1016/j.patcog.2017.06.023
16 TÎRNĂUCĂ C , GÓMEZ-PÉREZ D , BALCÁZAR J L , et al. Global optimality in k-means clustering[J]. Information Sciences, 2018, 439/440: 79- 94.
doi: 10.1016/j.ins.2018.02.001
17 NAZNIN F , SARKER R , ESSAM D . Progressive alignment method using genetic algorithm for multiple sequence alignment[J]. IEEE Transactions on Evolutionary Computation, 2012, 16 (5): 615- 631.
doi: 10.1109/TEVC.2011.2162849
18 LEE C T , PENG S L . A pairwise alignment algorithm for long sequences of high similarity[M]. Singapore: Springer, 2017: 279- 287.
19 WIDE MAWI Working Group[EB/OL].[2018-5-20]. http://mawi.wide.ad.jp/.
[1] 刘国涛,张燕平,徐晨初. 一种优化覆盖中心的三支决策模型[J]. 山东大学学报(理学版), 2017, 52(3): 105-110.
[2] 及歆荣,侯翠琴,侯义斌,赵斌. 基于筛选机制的L1核学习机分布式训练方法[J]. 山东大学学报(理学版), 2016, 51(9): 137-144.
[3] 吴修国 曾广周. 目标描述逻辑研究[J]. J4, 2009, 44(11): 68-74.
[4] 许传轲 陈月辉 赵亚欧. 基于改进伪氨基酸组成的蛋白质相互作用预测[J]. J4, 2009, 44(9): 17-21.
[5] 修明 陈保会 史开泉. F-粗规律与它的属性控制[J]. J4, 2008, 43(12): 66-72.
[6] 张 凌,邱育锋,任雪芳, . 基于单向S-粗集的知识堆垒与知识垛识别[J]. J4, 2008, 43(10): 12-17 .
[7] 付海艳,卢昌荆,史开泉 . (F,F-)-规律推理与规律挖掘[J]. J4, 2007, 42(7): 54-57 .
[8] 张 凌,邱育峰,任雪芳 . 粗规律生成与它的分离[J]. J4, 2008, 43(3): 58-63 .
[9] 付海艳,史开泉 . 知识过滤与属性f-迁移依赖[J]. J4, 2007, 42(10): 54-58 .
[10] 翟俊海, 张垚, 王熙照. 相容粗糙模糊集模型[J]. 山东大学学报(理学版), 2014, 49(08): 73-79.
[11] 张聪, 于洪. 一种三支决策软增量聚类算法[J]. 山东大学学报(理学版), 2014, 49(08): 40-47.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 罗斯特,卢丽倩,崔若飞,周伟伟,李增勇*. Monte-Carlo仿真酒精特征波长光子在皮肤中的传输规律及光纤探头设计[J]. J4, 2013, 48(1): 46 -50 .
[2] 史开泉. 信息规律智能融合与软信息图像智能生成[J]. 山东大学学报(理学版), 2014, 49(04): 1 -17 .
[3] 裴胜玉,周永权*. 一种基于混沌变异的多目标粒子群优化算法[J]. J4, 2010, 45(7): 18 -23 .
[4] 赵君1,赵晶2,樊廷俊1*,袁文鹏1,3,张铮1,丛日山1. 水溶性海星皂苷的分离纯化及其抗肿瘤活性研究[J]. J4, 2013, 48(1): 30 -35 .
[5] 杨永伟1,2,贺鹏飞2,李毅君2,3. BL-代数的严格滤子[J]. 山东大学学报(理学版), 2014, 49(03): 63 -67 .
[6] 孟祥波1,张立东1,杜子平2. 均值-方差标准下带跳的保险公司投资与再保险策略[J]. 山东大学学报(理学版), 2014, 49(05): 36 -40 .
[7] 廖明哲. 哥德巴赫的两个猜想[J]. J4, 2013, 48(2): 1 -14 .
[8] 汤晓宏1,胡文效2*,魏彦锋2,蒋锡龙2,张晶莹2,. 葡萄酒野生酿酒酵母的筛选及其生物特性的研究[J]. 山东大学学报(理学版), 2014, 49(03): 12 -17 .
[9] 王开荣,高佩婷. 建立在DY法上的两类混合共轭梯度法[J]. 山东大学学报(理学版), 2016, 51(6): 16 -23 .
[10] 张明明,秦永彬. 基于前序关系的非确定型有穷自动机极小化算法[J]. J4, 2010, 45(7): 34 -38 .