山东大学学报(理学版) ›› 2017, Vol. 52 ›› Issue (7): 73-79.doi: 10.6040/j.issn.1671-9352.1.2016.041
李润川1,2,3,昝红英1,申圣亚4,毕银龙1,张中军5
LI Run-chuan1,2,3, ZAN Hong-ying1, SHEN Sheng-ya4, BI Yin-long1, ZHANG Zhong-jun5
摘要: 垃圾短信已日益成为影响人们日常生活的严重问题,由于短信属于短文本,长度较短,特征稀疏,尤其是垃圾短信为逃避过滤机制,其结构和内容常常不规范,所以传统的文本特征提取方法并不能完全适用于短信分类。从短信的结构及语义两个角度提取特征项,并建立语义特征词表,采用基于多特征融合的方法来向量化表示短信文本。针对短信数据集中存在的噪声及数据不平衡问题,分别比较了NB、SVM、DT、LR、MLP、RF分类器的性能差别。实验表明,采用RF分类算法,能有效减弱噪声干扰及数据不平衡性所带来的影响。通过在CCF 2015中国好创意竞赛题目“垃圾短信基于文本内容识别”所提供的数据集上进行验证,取得了很好的效果。
中图分类号:
[1] ZELIKOVITZ S,TRANSDUCTIVE M F. Learning for short-text classification problem using latent semantic indexing international [J]. Journal of Pattern Recognition and Artificial Intelligence,2005,19(2):143-163 [2] BOZAN Y S, ÖNDER ÇOBAN, ÖZYER G T, et al. SMS spam filtering based on text classification and expert system[C] // Signal Processing and Communications Applications Conference. New York: IEEE, 2015: 2345-2348. [3] SOHN D N, LEE J T, HAN K S, et al. Content-based mobile spam classification using stylistically motivated features[J]. Pattern Recognition Letters, 2012, 33(3):364-369. [4] KARAMI A, ZHOU L. Improving static SMS spam detection by using new content-based features[C] // Proceedings of the 20th Americas Conference on Information Systems(AMCIS). Savannah: AMCIS, 2014. [5] AKBARI F, SAJEDI H. SMS spam detection using selected text features and boosting classifiers[C] // Conference on Information and Knowledge Technology. New York: IEEE, 2015: 1-5. [6] 张永军,刘金岭,于长辉.基于词贡献度的垃圾短信分类方法[J]. 山东大学学报(工学版), 2012,42(5):87-90. ZHANG Yongjun, LIU Jinling, YU Changhui. A spam short message classification method based on word contribution[J]. Journal of Shandong University(Engineering Science), 2012, 42(5):87-90. [7] FAN X, HU H. A new model for chinese short-text classification considering feature extension[C] // International Conference on Artificial Intelligence and Computational Intelligence. New York: IEEE, 2010: 7-11. [8] 滕少华.基于CRFs的中文分词和短文本分类技术[D].北京:清华大学,2009:43-48. TENG Shaohua. Chinese word segmentation and short text classification based on CRFs[D]. Beijing: Tsinghua University, 2009: 43-48. [9] 唐焕玲, 孙建涛, 陆玉昌. 文本分类中结合评估函数的TEF-WA权值调整技术[J]. 计算机研究与发展, 2005, 42(1):47-53. TANG Huanling, SUN Jiantao, LU Yuchang. Combined with the TEF-WA technique evaluation function in text classification[J]. Journal of Computer Research and Development, 2005, 42(1):47-53. [10] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4):128-130. PEI Yingbo, LIU Xiaoxia. Research on improved CHI feature selection in text classification [J]. Journal of Computer Engineering and Applications, 2011, 47(4):128-130. [11] 熊忠阳, 张鹏招, 张玉芳. 基于χ~2统计的文本分类特征选择方法的研究[J]. 计算机应用, 2008, 28(2):513-514. XIONG Zhongyang, ZHANG Pengzhao, ZHANG Yufang. Research on text classification feature selection method based on χ~2 statistics [J]. Journal of Computer Applications, 2008, 28(2):513-514. [12] 张爱华, 靖红芳, 王斌,等. 文本分类中特征权重因子的作用研究[J]. 中文信息学报, 2010, 24(3):97-104. ZHANG Aihua, JING Hongfang, WANG Bin, et al. Research on the role of feature weighting factor in text classification[J]. Journal of Chinese Information Processing, 2010, 24(3):97-104. [13] 何珑. 基于随机森林的产品垃圾评论识别[J]. 中文信息学报, 2015, 29(3):150-154. HE Long. Identification of product waste based on random forest [J]. Journal of Chinese Information Processing, 2015, 29(3):150-154. |
[1] | 许传轲 陈月辉 赵亚欧. 基于改进伪氨基酸组成的蛋白质相互作用预测[J]. J4, 2009, 44(9): 17-21. |
[2] | 张华伟,王明文,甘丽新 . 基于随机森林的文本分类模型研究[J]. J4, 2006, 41(3): 139-143 . |
|