JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2017, Vol. 52 ›› Issue (7): 73-79.doi: 10.6040/j.issn.1671-9352.1.2016.041

Previous Articles     Next Articles

Spam messages identification based on multi-feature fusion

LI Run-chuan1,2,3, ZAN Hong-ying1, SHEN Sheng-ya4, BI Yin-long1, ZHANG Zhong-jun5   

  1. 1. School of Information Engineering, Zhengzhou University, Zhengzhou 450000, Henan, China;
    2. Cooperative Innovation Center of Internet Healthcare, Zhengzhou University, Zhengzhou 450000, Henan, China;
    3.Research Institute of Industrial Technology, Zhengzhou University, Zhengzhou 450000, Henan, China;
    4. School of Foreign Language, Zhengzhou University, Zhengzhou 450000, Henan, China;
    5. School of Computer Science and Technology of Zhoukou Normal University, Zhoukou 466001, Henan, China
  • Received:2016-11-25 Online:2017-07-20 Published:2017-07-07

Abstract: Spam message has increasingly become a serious problem affecting peoples daily live. the informative texts are short and sparse, especially the spam message, in order to avoid filtering mechanism, its structure and content is not always standardized so that the traditional text feature extraction method does not fully apply to text classification. This paper extract the feature item from the structure and semantics of two angles of short message, establish semantic feature list and use multi-feature fusion method to quantitatively express SMS text. According to noise and data imbalance problem exists in message, this paper compares the performance differences of NB, SVM, DT, LR, MLP and RF. The experiment shows that the RF classification algorithm can effectively reduce the impact of noise interference and data imbalance. Through the experiments on the data set which provided by Spam Message Based on Text Content Recognition in CCF 2015 China Creative Competition proved that our method works well.

Key words: data imbalance, multi-feature fusion, spam message, random forest

CLC Number: 

  • TP391
[1] ZELIKOVITZ S,TRANSDUCTIVE M F. Learning for short-text classification problem using latent semantic indexing international [J]. Journal of Pattern Recognition and Artificial Intelligence,2005,19(2):143-163
[2] BOZAN Y S, ÖNDER ÇOBAN, ÖZYER G T, et al. SMS spam filtering based on text classification and expert system[C] // Signal Processing and Communications Applications Conference. New York: IEEE, 2015: 2345-2348.
[3] SOHN D N, LEE J T, HAN K S, et al. Content-based mobile spam classification using stylistically motivated features[J]. Pattern Recognition Letters, 2012, 33(3):364-369.
[4] KARAMI A, ZHOU L. Improving static SMS spam detection by using new content-based features[C] // Proceedings of the 20th Americas Conference on Information Systems(AMCIS). Savannah: AMCIS, 2014.
[5] AKBARI F, SAJEDI H. SMS spam detection using selected text features and boosting classifiers[C] // Conference on Information and Knowledge Technology. New York: IEEE, 2015: 1-5.
[6] 张永军,刘金岭,于长辉.基于词贡献度的垃圾短信分类方法[J]. 山东大学学报(工学版), 2012,42(5):87-90. ZHANG Yongjun, LIU Jinling, YU Changhui. A spam short message classification method based on word contribution[J]. Journal of Shandong University(Engineering Science), 2012, 42(5):87-90.
[7] FAN X, HU H. A new model for chinese short-text classification considering feature extension[C] // International Conference on Artificial Intelligence and Computational Intelligence. New York: IEEE, 2010: 7-11.
[8] 滕少华.基于CRFs的中文分词和短文本分类技术[D].北京:清华大学,2009:43-48. TENG Shaohua. Chinese word segmentation and short text classification based on CRFs[D]. Beijing: Tsinghua University, 2009: 43-48.
[9] 唐焕玲, 孙建涛, 陆玉昌. 文本分类中结合评估函数的TEF-WA权值调整技术[J]. 计算机研究与发展, 2005, 42(1):47-53. TANG Huanling, SUN Jiantao, LU Yuchang. Combined with the TEF-WA technique evaluation function in text classification[J]. Journal of Computer Research and Development, 2005, 42(1):47-53.
[10] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4):128-130. PEI Yingbo, LIU Xiaoxia. Research on improved CHI feature selection in text classification [J]. Journal of Computer Engineering and Applications, 2011, 47(4):128-130.
[11] 熊忠阳, 张鹏招, 张玉芳. 基于χ~2统计的文本分类特征选择方法的研究[J]. 计算机应用, 2008, 28(2):513-514. XIONG Zhongyang, ZHANG Pengzhao, ZHANG Yufang. Research on text classification feature selection method based on χ~2 statistics [J]. Journal of Computer Applications, 2008, 28(2):513-514.
[12] 张爱华, 靖红芳, 王斌,等. 文本分类中特征权重因子的作用研究[J]. 中文信息学报, 2010, 24(3):97-104. ZHANG Aihua, JING Hongfang, WANG Bin, et al. Research on the role of feature weighting factor in text classification[J]. Journal of Chinese Information Processing, 2010, 24(3):97-104.
[13] 何珑. 基于随机森林的产品垃圾评论识别[J]. 中文信息学报, 2015, 29(3):150-154. HE Long. Identification of product waste based on random forest [J]. Journal of Chinese Information Processing, 2015, 29(3):150-154.
[1] HU Chuan-Ke, CHEN Ru-Hui, DIAO E-Ou. Prediction of proteinprotein interaction based on improved pseudo amino acid composition [J]. J4, 2009, 44(9): 17-21.
[2] ZHANG Wei-hua,WANG Ming-wen,GAN Li-xin . Automatic text classification model based on random forest [J]. J4, 2006, 41(3): 139-143 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!