您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

J4 ›› 2011, Vol. 46 ›› Issue (5): 71-76.

• SEWM 2011 会议 • 上一篇    下一篇

基于聚类的垃圾邮件识别技术研究

蒋盛益1,庞观松2,张建军3   

  1. 1.广东外语外贸大学信息学院, 广东 广州 510420; 2.广东外语外贸大学国际工商管理学院, 广东 广州 510006;
    3.海军工程大学理学院, 湖北 武汉 430033
  • 收稿日期:2010-12-06 发布日期:2011-05-25
  • 作者简介:蒋盛益(1963- ),男,教授,硕士生导师,博士,主要研究方向为数据挖掘与自然语言处理.Email:jiangshengyi@163.com
  • 基金资助:

    国家自然科学基金资助项目(61070061);广东省自然科学基金资助项目(9151026005000002);广东省高层次人才项目;广东外语外贸大学研究生创新团队项目(10GWCXTD-08)

Research on spam detection techniques based on clustering

JIANG Sheng-yi1, PANG Guan-song2, ZHANG Jian-jun3   

  1. 1. School of Informatics, Guangdong University of Foreign Studies, Guangzhou 510420, Guangdong, China;
    2. School of Management, Guangdong University of Foreign Studies, Guangzhou 510006, Guangdong, China;
    3.College of Science, Naval University of Engineering, Wuhan 430033, Hubei, China
  • Received:2010-12-06 Published:2011-05-25

摘要:

随着垃圾邮件数量日益攀升,如何有效识别垃圾邮件已成为一项非常重要的课题。为克服k最近邻(k-nearest neighbor, kNN) 分类法在垃圾邮件识别中的缺陷,本文基于聚类算法提出了一种改进kNN识别方法。首先使用基于最小距离原则的一趟聚类算法将训练邮件集合划分为大小几乎相同的超球体,每个超球体包含一个类别或多个类别的文本;其次,采用投票机制对得到的聚类结果进行簇标识,即以簇中最多文本的类别作为簇的类别,得到的识别模型由具有标识的簇组成;最后,结合最近邻分类思想,对输入的邮件进行自动识别。实验结果表明,该方法可大幅度地降低邮件相似度的计算量,较TiMBL、Naïve Bayesian、Stacking等算法效果要好。同时,该方法是一种可增量式更新识别模型的方法,具有一定的实用性。

关键词: 垃圾邮件识别;k最近邻文本分类;一趟聚类算法;增量式建模

Abstract:

With the surge of email spam, how to detect it becomes an important and urgent problem. To cope with the defects of kNN spam detection, an improved kNN spam detection approach based on clustering is proposed. First, by using the least distance principle, the training email text samples are divided into several hyper spheres with the approximate radius, and the texts contained in hyper spheres are from one or more of these categories. Second, the clusters (hyper spheres) are tagged by using the majority voting mechanism,which means that each cluster is tagged with the category containing the most text in the cluster, and the detection model consists of tagged clusters. Finally, the email texts are detected with the kNN approach. Experimental results show that the proposed approach can substantially reduce the text similarity computation, and perform better than iMBL, Naïve Bayesian, and Stacking. Furthermore, the detection model constructed by the proposed approach can be incrementally updated, which has great feasibility in real-world applications.

Key words: spam detection; kNN text categorization; single pass clustering; incremental modeling

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!