您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2017, Vol. 52 ›› Issue (7): 73-79.doi: 10.6040/j.issn.1671-9352.1.2016.041

• • 上一篇    下一篇

基于多特征融合的垃圾短信识别

李润川1,2,3,昝红英1,申圣亚4,毕银龙1,张中军5   

  1. 1. 郑州大学信息工程学院, 河南 郑州 450000;2. 郑州大学互联网医疗与健康服务河南省协同创新中心, 河南 郑州 450000;3. 郑州大学产业技术研究院, 河南 郑州 450000;4. 郑州大学外语学院, 河南 郑州 450000;5. 周口师范学院计算机科学与技术学院, 河南 周口 466001
  • 收稿日期:2016-11-25 出版日期:2017-07-20 发布日期:2017-07-07
  • 作者简介:李润川(1991— ),男,硕士研究生,研究方向为机器学习、自然语言处理、智慧医疗. E-mail:runchuanli@foxmail.com
  • 基金资助:
    国家社会科学基金资助项目(14BYY096);国家自然科学基金资助项目(61402419);国家高技术研究发展863计划资助项目(2012AA011101);国家重点基础研究发展计划973课题资助项目(2014CB340504)

Spam messages identification based on multi-feature fusion

LI Run-chuan1,2,3, ZAN Hong-ying1, SHEN Sheng-ya4, BI Yin-long1, ZHANG Zhong-jun5   

  1. 1. School of Information Engineering, Zhengzhou University, Zhengzhou 450000, Henan, China;
    2. Cooperative Innovation Center of Internet Healthcare, Zhengzhou University, Zhengzhou 450000, Henan, China;
    3.Research Institute of Industrial Technology, Zhengzhou University, Zhengzhou 450000, Henan, China;
    4. School of Foreign Language, Zhengzhou University, Zhengzhou 450000, Henan, China;
    5. School of Computer Science and Technology of Zhoukou Normal University, Zhoukou 466001, Henan, China
  • Received:2016-11-25 Online:2017-07-20 Published:2017-07-07

摘要: 垃圾短信已日益成为影响人们日常生活的严重问题,由于短信属于短文本,长度较短,特征稀疏,尤其是垃圾短信为逃避过滤机制,其结构和内容常常不规范,所以传统的文本特征提取方法并不能完全适用于短信分类。从短信的结构及语义两个角度提取特征项,并建立语义特征词表,采用基于多特征融合的方法来向量化表示短信文本。针对短信数据集中存在的噪声及数据不平衡问题,分别比较了NB、SVM、DT、LR、MLP、RF分类器的性能差别。实验表明,采用RF分类算法,能有效减弱噪声干扰及数据不平衡性所带来的影响。通过在CCF 2015中国好创意竞赛题目“垃圾短信基于文本内容识别”所提供的数据集上进行验证,取得了很好的效果。

关键词: 垃圾短信, 数据不平衡, 随机森林, 多特征融合

Abstract: Spam message has increasingly become a serious problem affecting peoples daily live. the informative texts are short and sparse, especially the spam message, in order to avoid filtering mechanism, its structure and content is not always standardized so that the traditional text feature extraction method does not fully apply to text classification. This paper extract the feature item from the structure and semantics of two angles of short message, establish semantic feature list and use multi-feature fusion method to quantitatively express SMS text. According to noise and data imbalance problem exists in message, this paper compares the performance differences of NB, SVM, DT, LR, MLP and RF. The experiment shows that the RF classification algorithm can effectively reduce the impact of noise interference and data imbalance. Through the experiments on the data set which provided by Spam Message Based on Text Content Recognition in CCF 2015 China Creative Competition proved that our method works well.

Key words: data imbalance, multi-feature fusion, spam message, random forest

中图分类号: 

  • TP391
[1] ZELIKOVITZ S,TRANSDUCTIVE M F. Learning for short-text classification problem using latent semantic indexing international [J]. Journal of Pattern Recognition and Artificial Intelligence,2005,19(2):143-163
[2] BOZAN Y S, ÖNDER ÇOBAN, ÖZYER G T, et al. SMS spam filtering based on text classification and expert system[C] // Signal Processing and Communications Applications Conference. New York: IEEE, 2015: 2345-2348.
[3] SOHN D N, LEE J T, HAN K S, et al. Content-based mobile spam classification using stylistically motivated features[J]. Pattern Recognition Letters, 2012, 33(3):364-369.
[4] KARAMI A, ZHOU L. Improving static SMS spam detection by using new content-based features[C] // Proceedings of the 20th Americas Conference on Information Systems(AMCIS). Savannah: AMCIS, 2014.
[5] AKBARI F, SAJEDI H. SMS spam detection using selected text features and boosting classifiers[C] // Conference on Information and Knowledge Technology. New York: IEEE, 2015: 1-5.
[6] 张永军,刘金岭,于长辉.基于词贡献度的垃圾短信分类方法[J]. 山东大学学报(工学版), 2012,42(5):87-90. ZHANG Yongjun, LIU Jinling, YU Changhui. A spam short message classification method based on word contribution[J]. Journal of Shandong University(Engineering Science), 2012, 42(5):87-90.
[7] FAN X, HU H. A new model for chinese short-text classification considering feature extension[C] // International Conference on Artificial Intelligence and Computational Intelligence. New York: IEEE, 2010: 7-11.
[8] 滕少华.基于CRFs的中文分词和短文本分类技术[D].北京:清华大学,2009:43-48. TENG Shaohua. Chinese word segmentation and short text classification based on CRFs[D]. Beijing: Tsinghua University, 2009: 43-48.
[9] 唐焕玲, 孙建涛, 陆玉昌. 文本分类中结合评估函数的TEF-WA权值调整技术[J]. 计算机研究与发展, 2005, 42(1):47-53. TANG Huanling, SUN Jiantao, LU Yuchang. Combined with the TEF-WA technique evaluation function in text classification[J]. Journal of Computer Research and Development, 2005, 42(1):47-53.
[10] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4):128-130. PEI Yingbo, LIU Xiaoxia. Research on improved CHI feature selection in text classification [J]. Journal of Computer Engineering and Applications, 2011, 47(4):128-130.
[11] 熊忠阳, 张鹏招, 张玉芳. 基于χ~2统计的文本分类特征选择方法的研究[J]. 计算机应用, 2008, 28(2):513-514. XIONG Zhongyang, ZHANG Pengzhao, ZHANG Yufang. Research on text classification feature selection method based on χ~2 statistics [J]. Journal of Computer Applications, 2008, 28(2):513-514.
[12] 张爱华, 靖红芳, 王斌,等. 文本分类中特征权重因子的作用研究[J]. 中文信息学报, 2010, 24(3):97-104. ZHANG Aihua, JING Hongfang, WANG Bin, et al. Research on the role of feature weighting factor in text classification[J]. Journal of Chinese Information Processing, 2010, 24(3):97-104.
[13] 何珑. 基于随机森林的产品垃圾评论识别[J]. 中文信息学报, 2015, 29(3):150-154. HE Long. Identification of product waste based on random forest [J]. Journal of Chinese Information Processing, 2015, 29(3):150-154.
[1] 许传轲 陈月辉 赵亚欧. 基于改进伪氨基酸组成的蛋白质相互作用预测[J]. J4, 2009, 44(9): 17-21.
[2] 张华伟,王明文,甘丽新 . 基于随机森林的文本分类模型研究[J]. J4, 2006, 41(3): 139-143 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!