您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2019, Vol. 54 ›› Issue (7): 57-67.doi: 10.6040/j.issn.1671-9352.1.2018.077

•   • 上一篇    下一篇

基于双层堆叠分类模型的水军评论检测

廖祥文1,2,3,*(),徐阳1,2,3,魏晶晶4,杨定达1,2,3,陈国龙1,2,3   

  1. 1. 福州大学数学与计算机科学学院, 福建 福州 350116
    2. 福州大学福建省网络计算与智能信息处理重点实验室, 福建 福州 350116
    3. 数字福建金融大数据研究所, 福建 福州 350116
    4. 福建江夏学院电子信息科学学院, 福建 福州 350108
  • 收稿日期:2018-10-17 出版日期:2019-07-20 发布日期:2019-06-27
  • 通讯作者: 廖祥文 E-mail:liaoxw@fzu.edu.cn
  • 作者简介:廖祥文(1980—),男,博士,副教授,研究方向为信息检索、观点挖掘和情感分析、自然语言处理. E-mail:liaoxw@fzu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61772135);国家自然科学基金资助项目(U1605251);福建省自然科学基金资助项目(2017J01755);中国科学院网络数据科学与技术重点实验室开放基金课题(CASNDST201708);中国科学院网络数据科学与技术重点实验室开放基金课题(CASNDST201606);北邮可信分布式计算与服务教育部重点实验室主任基金资助(2017KF01)

Review spam detection based on the two-level stacking classification model

Xiang-wen LIAO1,2,3,*(),Yang XU1,2,3,Jing-jing WEI4,Ding-da YANG1,2,3,Guo-long CHEN1,2,3   

  1. 1. College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, Fujian, China
    2. Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, Fujian, China
    3. Digital Fujian Institute of Financial Big Data, Fuzhou 350116, Fujian, China
    4. College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, Fujian, China
  • Received:2018-10-17 Online:2019-07-20 Published:2019-06-27
  • Contact: Xiang-wen LIAO E-mail:liaoxw@fzu.edu.cn
  • Supported by:
    国家自然科学基金资助项目(61772135);国家自然科学基金资助项目(U1605251);福建省自然科学基金资助项目(2017J01755);中国科学院网络数据科学与技术重点实验室开放基金课题(CASNDST201708);中国科学院网络数据科学与技术重点实验室开放基金课题(CASNDST201606);北邮可信分布式计算与服务教育部重点实验室主任基金资助(2017KF01)

摘要:

对于水军评论检测问题,已有方法在提取用户行为关系以及通过神经网络提取特征时复杂度过大,同时由于网络评论属于短文本类,其书写的不规范会导致训练过程中文本特征提取困难;另外,已有方法对数据集不平衡分布情况考虑不足。为此,提出了一种基于双层堆叠分类模型的水军评论检测方法。首先通过三元组形式构造矩阵表示用户间关系,并通过主成分分析得到低维用户关系表示,以此刻画用户在评论数据中的行为差异并且降低计算的复杂度;然后,通过评论的段落向量表示以及计算离散型特征(包括文本相似度、信息熵等)解决文本特征难以提取的问题;最后将三者相联结作为融合文本与行为特征的整体特征表示。利用集成学习的方法构造双层堆叠分类模型对评论分类,以提升模型在非平衡数据集下的检测性能。实验采用Yelp2013评论数据集,结果表明,与目前最好的基准方法对比, F1值提高了1.7%~5.2%,在非平衡数据集中提升尤为明显。

关键词: 水军检测, 特征融合, 集成学习, 主成分分析

Abstract:

For the issue of review spam detection, on the one hand, the time and space complexity of existing methods is high when extracting user behavior relationships and training neural network. On the other hand, the non-standard writing format of E-commercial reviews leads to the indistinct contextual features and most experiment did not consider the effect of the imbalance of data. Therefore, we propose a method for review spam detection based on a two-level stacking classification model. In the method, the relationship between users and products is represented by a triplet. In order to characterize user's behavior and reduce complexity, low-dimensional feature representations are obtained by the principal component analysis. Then, the extracted paragraphs vector representation, information entropy and text similarity is represented as discrete feature to avoid indistinct of contextual features. Finally, the three connections are taken as the overall features combining text and behavioral features. These features are regarded as the input of the two-level stacking classification model in order to improve performance in unbalanced dataset. We conducted experiments in the Yelp 2013 dataset. Experimental results show the F1 value of our proposed method is 1.7%—5.2% better than the state-of-the-art method. What's more, the classification performance is significantly improved in the unbalanced dataset.

Key words: review detection, feature fusion, ensemble learning, principal component analysis

中图分类号: 

  • TP391

图1

基于双层堆叠分类模型的水军评论检测"

表1

离散特征表"

类别特征
用户行为离散特征用户粉丝数(FC)
用户一天内最大发布评论数(MN)
用户极端评论比例(包括1星或者5星)(ER)
用户评论等级分布信息熵(RE)
评论文本离散特征该评论与其他同商品评论的平均等级差(AR)
是否是极端评论(1星或5星)(IE)
该评论与同商品评论最大余弦相似度(MC)
该评论被点赞次数(AN)

图2

基于集成学习的分类模型构建"

表2

基于双层堆叠分类模型的水军评论检测算法"

基于双层堆叠分类模型的水军评论检测算法
输入:评论数据集合X{x1, x2, …, xn}、预设参数
输出:评论检测结果集合Y{y1, y2, …, yn}
1:初始化模型参数;
2:为获取低维用户关系向量T,利用公式(3)进行主成分分析和SVD奇异值分解;
3:利用Doc2vec方法得到评论上下文的段落向量表示Di;
4:计算用户交互行为获取离散特征集Li;
5:拼接特征F=concatenate{T, D, L}得到总特征表示;
6:对基模型进行模型选择并构造双层堆叠分类器;
7:利用公式(9)进行交叉验证预测,其结果作为新特征表示;
8:利用公式(15)学习训练融合模型并对新数据分类;
9:输出结果Y{y1, y2, …, yn}

表3

数据集统计信息"

数据项目酒店饭店
水军评论数8028 368
非水军评论数4 87650 149
水军评论占比14.1%14.3%
总评论数5 67858 517
总评论者数5 12435 593

表4

实验参数设置表"

参数类型可调参数参数值
PCA首次降维目标维度150
第二次降维目标维度(M+N)/2
Doc2vec段落向量维度80
词向量训练模型c-bow
堆叠分类器交叉验证次数5
XGBoost分类器学习率0.005
最大深度5

表5

各模型评价指标结果表"

方法数据分布酒店数据集饭店数据集
PrecisionRecallF1AccuracyPrecisionRecallF1Accuracy
M_BF+BIGRAM50:5082.886.984.885.182.888.585.683.3
ND46.582.559.484.948.287.962.378.5
HAAT50:5061.354.757.864.469.459.063.866.5
ND32.753.140.856.435.978.948.168.3
SPEAGLE50:5075.783.079.181.080.583.281.882.5
ND26.556.036.080.448.270.558.682.0
Rescal+BIGRAM50:5084.289.987.086.586.891.889.289.9
ND48.285.061.585.958.290.370.887.8
Our_Method50:5087.390.788.988.888.793.290.990.7
ND52.090.065.986.664.692.476.088.3

表6

不同特征提取方法检测效果比对表"

方法数据分布酒店数据集饭店数据集
PrecisionRecallF1AccuracyPrecisionRecallF1Accuracy
Rescal50:5083.487.084.584.182.888.585.683.3
ND44.387.758.883.653.988.366.983.4
Our_Method(SVM)50:5085.486.085.785.786.893.189.889.5
ND48.686.362.185.061.392.473.787.3

表7

2类评论典型离散特征比对表"

特征离散型交互特征数据
FCMNERREARIEMCAN
A19.03.00.140.791.80.00.025.0
B0.01.01.00.02.61.00.090.0

表8

考虑离散特征效果对比表"

方法评价指标酒店数据集
(50:50)
酒店数据集
(ND)
饭店数据集
(50:50)
饭店数据集
(ND)
不考虑离散交互特征F187.258.890.373.1
A86.280.189.286.6
考虑离散交互特征(全特征)F188.965.990.976.0
A88.886.690.788.3

图3

不同分类器模型F1值对比"

图4

不同分布数据集对F1值影响"

1 OTT M, CHOI Y, CARDIE C, et al. Finding deceptive opinion spam by any stretch of the imagination[C]// Proceedings of the Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACM, 2011: 309-319.
2 KIM S, CHANG H, LEE S, et al. Deep semantic frame-based deceptive opinion spam analysis[C]// Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York: ACM, 2015: 1131-1140.
3 KO M C, CHEN H H. Analysis of cyber army's behaviours on web forum for elect campaign[C]// Proceedings of the Asia Information Retrieval Symposium. Switzerland: Springer, Cham, 2015: 394-399.
4 LI Huayi, FEI Geli, SHAO Weixiang, et al. Bimodal distribution and co-bursting in review spam detection[C]// Proceedings of the International Conference on World Wide Web. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2017: 1063-1072.
5 REN Yafeng, ZHANG Yue. Deceptive opinion spam detection using neural network[C]// Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka: The COLING 2016 Organizing Committee, 2016: 140-150.
6 WANG Xuepeng, LIU Kang, ZHAO Jun. Handling cold-start problem in review spam detection by jointly embedding texts and behaviors[C]// Proceedings of the Meeting of the Association for Computational Linguistics. Vancouver: ACM, 2017: 366-376.
7 KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: EMNLP, 2014: 1746-1751.
8 SANTOSH K C, MAITY S K, MUKHERJEE A. ENWalk: learning network features for spam detection in twitter[C]// Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation. Switzerland; Springer, Cham, 2017: 90-101.
9 RAYANA S, AKOGLU L. Collective opinion spam detection: bridging review networks and metadata[C]// Proceedings of the 21th ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 985-994.
10 WANG Xuepeng, LIU Kang, HE Shizhu, et al. Learning to represent review with tensor decomposition for spam detection[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Austin: EMNLP, 2016: 866-875.
11 WANG Yalin , SUN Kenan , YUAN Xiaofeng , et al. A novel sliding window PCA-IPF based steady-state detection framework and its industrial application[J]. IEEE Access, 2018, 6: 20995- 21004.
doi: 10.1109/ACCESS.2018.2825451
12 LE Q, MIKOLOV T.Distributed representations of sentences and documents[C]// Proceedings of the International Conference on Machine Learning. Beijing: JMLR, 2014: 1188-1196.
13 CHEN Yijun, MAN Leungwong.Optimizing stacking ensemble by an ant colony optimization approach[C]// Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation. New York: ACM, 2011: 7-8.
14 SANTOSH K C, ARJUN Mukherjee. On the temporal dynamics of opinion spamming: case studies on yelp[C]// Proceedings of the 25th International Conference on World Wide Web. Republic and Canton of Geneva, Switzerland: WWW, 2016: 369-379.
15 MUKHERJEE A, VENKATARAMAN V, LIU B, et al. What yelp fake review filter might be doing[C]// Proceedings of the International AAAI Conference on Web and Social Media. Menlo Park: AAAI, 2013: 409-418.
16 HAI Zeng, ZHAO Peilin, CHENG Peng, et al. Deceptive review spam detection via exploiting task relatedness and unlabeled data[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Austin: EMNLP, 2016: 1817-1826.
17 FAKHRAEI S, SHASHANKA M. Collective spammer detection in evolving multi-relational social networks[C]// Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 1769-1778.
[1] 李润川,昝红英,申圣亚,毕银龙,张中军. 基于多特征融合的垃圾短信识别[J]. 山东大学学报(理学版), 2017, 52(7): 73-79.
[2] 于然1,2,刘春阳3*,靳小龙1,王元卓1,程学旗1. 基于多视角特征融合的中文垃圾微博过滤[J]. J4, 2013, 48(11): 53-58.
[3] 邵伟1,祝丽萍2,刘福国2,王秋平2. 对称阵稀疏主成分分析及其在充分降维问题中的应用[J]. J4, 2012, 47(4): 116-120.
[4] 周娟1,王仁卿2,郭卫华2*,王强2,王炜2,庞绪贵3,战金成3,代杰瑞3,周广军4. 鱼台优质稻生产基地土壤地球化学元素调查[J]. J4, 2012, 47(3): 5-9.
[5] 王德良,李科,陆丽玲. 石门国家森林公园唐鱼生境特征分析[J]. J4, 2012, 47(3): 1-4.
[6] 朱世伟,赛 英 . 基于主成分分析和粗径向基神经网络的财务预警模型研究[J]. J4, 2008, 43(11): 48-53 .
[7] 杨绍华,林 盘,潘 晨 . 利用小波变换提高基于KPCA方法的人脸识别性能[J]. J4, 2007, 42(9): 96-100 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张京友,张培爱,钟海萍. 进化图论在知识型企业组织结构设计中的应用[J]. J4, 2013, 48(1): 107 -110 .
[2] 郭兰兰1,2,耿介1,石硕1,3,苑飞1,雷丽1,杜广生1*. 基于UDF方法的阀门变速关闭过程中的#br# 水击压强计算研究[J]. 山东大学学报(理学版), 2014, 49(03): 27 -30 .
[3] 史开泉. 信息规律智能融合与软信息图像智能生成[J]. 山东大学学报(理学版), 2014, 49(04): 1 -17 .
[4] 汤晓宏1,胡文效2*,魏彦锋2,蒋锡龙2,张晶莹2,. 葡萄酒野生酿酒酵母的筛选及其生物特性的研究[J]. 山东大学学报(理学版), 2014, 49(03): 12 -17 .
[5] 曾文赋1,黄添强1,2,李凯1,余养强1,郭躬德1,2. 基于调和平均测地线核的局部线性嵌入算法[J]. J4, 2010, 45(7): 55 -59 .
[6] 郭文鹃,杨公平*,董晋利. 指纹图像分割方法综述[J]. J4, 2010, 45(7): 94 -101 .
[7] 孟祥波1,张立东1,杜子平2. 均值-方差标准下带跳的保险公司投资与再保险策略[J]. 山东大学学报(理学版), 2014, 49(05): 36 -40 .
[8] 彭振华,徐义红*,涂相求. 近似拟不变凸集值优化问题弱有效元的最优性条件[J]. 山东大学学报(理学版), 2014, 49(05): 41 -44 .
[9] 胡明娣1,2,折延宏1,王敏3. L3*系统中逻辑度量空间的拓扑性质[J]. J4, 2010, 45(6): 86 -90 .
[10] 何海伦, 陈秀兰*. 变性剂和缓冲系统对适冷蛋白酶MCP-01和中温蛋白酶BP-01构象影响的圆二色光谱分析何海伦, 陈秀兰*[J]. 山东大学学报(理学版), 2013, 48(1): 23 -29 .