您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2017, Vol. 52 ›› Issue (7): 44-51.doi: 10.6040/j.issn.1671-9352.1.2016.PC6

• • 上一篇    下一篇

一种基于启发式规则的半监督垃圾评论分类方法

张鹏1,王素格1,2*,李德玉1,2,王杰1   

  1. 1.山西大学计算机与信息技术学院, 山西 太原 030006;2.计算智能与中文信息处理教育部重点实验室, 山西 太原 030006
  • 收稿日期:2016-11-25 出版日期:2017-07-20 发布日期:2017-07-07
  • 通讯作者: 王素格(1964— ),女,教授,博士生导师,研究方向为文本挖掘. E-mail:wsg@sxu.edu.cn E-mail:zhpeng@sxu.edu.cn
  • 作者简介:张鹏(1988— ),男,博士研究生,研究方向为情感分析. E-mail:zhpeng@sxu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61573231,61632011,61672331);山西省科技基础条件平台计划资助项目(2015091001-0102)

A semi-supervised spam review classification method based on heuristic rules

ZHANG Peng1, WANG Su-ge1,2*, LI De-yu1,2, WANG Jie1   

  1. 1. School of Computer and Information Technology SXU, Taiyuan 030006, Shanxi, China;
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Taiyuan 030006, Shanxi, China
  • Received:2016-11-25 Online:2017-07-20 Published:2017-07-07

摘要: 互联网业已深入每个人的生活,团购平台、在线商店、在线消费等形式的电子商务平台已成为人们时下最流行的消费方式。几乎所有的电商平台都允许和鼓励用户在消费之后对产品或者服务进行评论,而且用户评论对潜在消费者和商家都具有极高的价值。这使得广告、虚假评论等形式的垃圾评论被人为地夹杂在用户评论中,以期达到虚假宣传、推广产品或者诋毁其他商家信誉的目的。垃圾评论检测和分析便是在这样一种应用背景下,研究如何有效地排除垃圾评论干扰,发挥有效评论价值的方法。针对COAE2015设定的垃圾评论识别任务,利用其提供的语料资源,设计了一种基于启发式规则的半监督垃圾评论分类方法。实验结果证明,提出的方法可以有效地识别垃圾评论,同时能够保持对有效评论的识别精度。

关键词: 启发式规则, 垃圾评论识别, 半监督学习

Abstract: Nowadays the Internet has affected everyones lives. E-commercial websites such as online-shopping, group purchases, and online consumption have already become most popular consumption patterns. Almost every e-commercial websites enable and encourage their customers to write a review on their products and services. These customers generative reviews are valuable to potential consumers and merchants, which leads a situation that spam reviews are added into the e-commercial websites manually on purpose of promoting products or damaging reputation of other merchants. Based on this application background, the spam reviews detection research aims to get rid of spam reviews and to make full use of normal customer reviews. This paper focus on COAE2015-TASK4, which sets up a public task of spam review detection. We proposed a semi-supervised spam review classification method based on heuristic rules using the corpora resources provided by the COAE2015-TASK4. Experiments showed our method can effectively detect spam reviews and keep a high classification accuracy of normal customer reviews.

Key words: spam review classification, heuristic rules, semi-supervised learning

中图分类号: 

  • TP391
[1] HEYDARI A, ALI TAVAKOLI M, SALIM N, et al. Detection of review spam: asurvey[J]. Expert Systems with Applications, 2015, 42(7):3634-3642.
[2] JINDAL N, LIU B. Analyzing and detecting review spam[C] // Proceeding of 7th IEEE International Conference on Data Mining(ICDM 2007). New York: IEEE, 2007: 547-552.
[3] CASTILLO C, DONATO D, BECCHETTI L, et al. A reference collection for Web spam[J]. ACM SigirForum, 2006, 40(2):11-24.
[4] FETTERLY D, MANASSE M, NAJORK M. Spam, damn spam, and statistics:using statistical analysis to locate spam Web pages[C] // Proceedings of the 7th International Workshop on the Web and Databases: Collocated with ACM Sigmod/pods 2004. New York: ACM, 2004: 1-6.
[5] JINDAL N, LIU B. Opinion spam and analysis[C] // Proceedings of the 2008 International Conference on Web Search and Data Mining. New York: ACM, 2008: 219-230.
[6] LI H, CHEN Z, MUKHERJEE A, et al. Analyzing and detecting opinion spam on alarge-scale dataset via temporal and spatial patterns[C] // Proceedings of The 9th International AAAI Conference on Web and Social Media(ICWSM-15). Menlo Park: ICWSM, 2015: 26-29.
[7] XIE S, WANG G, LIN S, et al. Review spam detection via temporal pattern discovery[C] // Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2012: 823-831.
[8] FEI G, MUKHERJEE A, LIU B, et al. Exploiting burstiness in rviews for review spammer detection[J]. ICWSM, 2013, 13:175-184.
[9] SHARMA K, LIN K I. Review spam detector with rating consistency check[C] // Proceedings of the 51st ACM Southeast Conference. New York: ACM, 2013: 34-39.
[10] MUKHERJEE A, KUMAR A, LIU B, et al. Spotting opinion spammers using behavioral footprints[C] // Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 632-640.
[11] KARAMI A, ZHOU B. Online review spam detection by new linguistic features[C] // iConference 2015 Proceedings.Urbana: IDEALS, 2015: 1-5.
[12] 刁宇峰, 杨亮, 林鸿飞. 基于LDA模型的博客垃圾评论发现[J]. 中文信息学报, 2011, 25(1):41-48. DIAO Yufeng, YANG Liang, LIN Hongfei. LDA-based opinion spam discovering[J]. Journal of Chinese Information Processing, 2011, 25(1):41-48.
[13] MUKHERJEE A, VENKATARAMAN V. Opinion spam detection: an unsupervised approach using generative models[J]. Techincal Report, UH, 2014(07):1-11.
[14] XU Q, ZHAO H. Using deep linguistic features for finding deceptive opinion spam[C] // In Proceedings of COLING 2012. Stroudsburg: ACL, 2013: 1341-1350.
[15] AL NAJADA H, ZHU X. iSRD: spam review detection with imbalanced data distributions[C] // Proceeding of Information Reuse and Integration(IRI), 2014 IEEE 15th International Conference on Information Reuse and Integration. New York: IEEE, 2014: 553-560.
[16] LI J, OTT M, CARDIE C, et al. Towards a general rule for identifying deceptive opinion spam[C] // InProceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2014: 1566-1576.
[17] LIN Y, ZHU T, WANG X, et al. Towards online review spam detection[C] // Proceedings of the 23rd International Conference on World Wide Web. New York: ACM, 2014: 341-342.
[18] 何珑. 基于随机森林的产品垃圾评论识别[J]. 中文信息学报, 2015, 29(3):150-154. HE Long. Identification of product review spam by random forest[J]. Journal of Chinese Information Processing, 2015, 29(3):150-154.
[19] LI H, LIU B, MUKHERJEE A, et al. Spotting fake reviews using positive-unlabeled learning[J]. Computación y Sistemas, 2014, 18(3):467-475.
[1] 苏丰龙,谢庆华,黄清泉,邱继远,岳振军. 基于直推式学习的半监督属性抽取[J]. 山东大学学报(理学版), 2016, 51(3): 111-115.
[2] 杜红乐,张燕,张林. 不均衡数据集下的入侵检测[J]. 山东大学学报(理学版), 2016, 51(11): 50-57.
[3] 吴鹏飞,孟祥增,刘俊晓,马凤娟 . 基于结构与内容的网页主题信息提取研究[J]. J4, 2006, 41(3): 131-134 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!