基于双层堆叠分类模型的水军评论检测

doi:10.6040/j.issn.1671-9352.1.2018.077

Abstract

Abstract:

For the issue of review spam detection, on the one hand, the time and space complexity of existing methods is high when extracting user behavior relationships and training neural network. On the other hand, the non-standard writing format of E-commercial reviews leads to the indistinct contextual features and most experiment did not consider the effect of the imbalance of data. Therefore, we propose a method for review spam detection based on a two-level stacking classification model. In the method, the relationship between users and products is represented by a triplet. In order to characterize user's behavior and reduce complexity, low-dimensional feature representations are obtained by the principal component analysis. Then, the extracted paragraphs vector representation, information entropy and text similarity is represented as discrete feature to avoid indistinct of contextual features. Finally, the three connections are taken as the overall features combining text and behavioral features. These features are regarded as the input of the two-level stacking classification model in order to improve performance in unbalanced dataset. We conducted experiments in the Yelp 2013 dataset. Experimental results show the F₁ value of our proposed method is 1.7%—5.2% better than the state-of-the-art method. What's more, the classification performance is significantly improved in the unbalanced dataset.

Key words: review detection, feature fusion, ensemble learning, principal component analysis

CLC Number:

TP391

Xiang-wen LIAO,Yang XU,Jing-jing WEI,Ding-da YANG,Guo-long CHEN. Review spam detection based on the two-level stacking classification model[J].JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2019, 54(7): 57-67.

Figures/Tables 12

Fig.1

Table 1

Fig.2

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

Fig.3

Fig.4

References 17

1	OTT M, CHOI Y, CARDIE C, et al. Finding deceptive opinion spam by any stretch of the imagination[C]// Proceedings of the Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACM, 2011: 309-319.
2	KIM S, CHANG H, LEE S, et al. Deep semantic frame-based deceptive opinion spam analysis[C]// Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York: ACM, 2015: 1131-1140.
3	KO M C, CHEN H H. Analysis of cyber army's behaviours on web forum for elect campaign[C]// Proceedings of the Asia Information Retrieval Symposium. Switzerland: Springer, Cham, 2015: 394-399.
4	LI Huayi, FEI Geli, SHAO Weixiang, et al. Bimodal distribution and co-bursting in review spam detection[C]// Proceedings of the International Conference on World Wide Web. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2017: 1063-1072.
5	REN Yafeng, ZHANG Yue. Deceptive opinion spam detection using neural network[C]// Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka: The COLING 2016 Organizing Committee, 2016: 140-150.
6	WANG Xuepeng, LIU Kang, ZHAO Jun. Handling cold-start problem in review spam detection by jointly embedding texts and behaviors[C]// Proceedings of the Meeting of the Association for Computational Linguistics. Vancouver: ACM, 2017: 366-376.
7	KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: EMNLP, 2014: 1746-1751.
8	SANTOSH K C, MAITY S K, MUKHERJEE A. ENWalk: learning network features for spam detection in twitter[C]// Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation. Switzerland; Springer, Cham, 2017: 90-101.
9	RAYANA S, AKOGLU L. Collective opinion spam detection: bridging review networks and metadata[C]// Proceedings of the 21th ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 985-994.
10	WANG Xuepeng, LIU Kang, HE Shizhu, et al. Learning to represent review with tensor decomposition for spam detection[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Austin: EMNLP, 2016: 866-875.
11	WANG Yalin , SUN Kenan , YUAN Xiaofeng , et al. A novel sliding window PCA-IPF based steady-state detection framework and its industrial application[J]. IEEE Access, 2018, 6: 20995- 21004. doi: 10.1109/ACCESS.2018.2825451
12	LE Q, MIKOLOV T.Distributed representations of sentences and documents[C]// Proceedings of the International Conference on Machine Learning. Beijing: JMLR, 2014: 1188-1196.
13	CHEN Yijun, MAN Leungwong.Optimizing stacking ensemble by an ant colony optimization approach[C]// Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation. New York: ACM, 2011: 7-8.
14	SANTOSH K C, ARJUN Mukherjee. On the temporal dynamics of opinion spamming: case studies on yelp[C]// Proceedings of the 25th International Conference on World Wide Web. Republic and Canton of Geneva, Switzerland: WWW, 2016: 369-379.
15	MUKHERJEE A, VENKATARAMAN V, LIU B, et al. What yelp fake review filter might be doing[C]// Proceedings of the International AAAI Conference on Web and Social Media. Menlo Park: AAAI, 2013: 409-418.
16	HAI Zeng, ZHAO Peilin, CHENG Peng, et al. Deceptive review spam detection via exploiting task relatedness and unlabeled data[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Austin: EMNLP, 2016: 1817-1826.
17	FAKHRAEI S, SHASHANKA M. Collective spammer detection in evolving multi-relational social networks[C]// Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 1769-1778.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 10

[1]	ZHANG Jing-you, ZHANG Pei-ai, ZHONG Hai-ping. The application of evolutionary graph theory in the design of knowledge-based enterprises’ organization strucure[J]. J4, 2013, 48(1): 107 -110 .
[2]	GUO Lan-lan1,2, GENG Jie1, SHI Shuo1,3, YUAN Fei1, LEI Li1, DU Guang-sheng1*. Computing research of the water hammer pressure in the process of #br# the variable speed closure of valve based on UDF method[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 27 -30 .
[3]	SHI Kai-quan. P-information law intelligent fusion and soft information #br# image intelligent generation[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(04): 1 -17 .
[4]	TANG Xiao-hong1, HU Wen-xiao2*, WEI Yan-feng2, JIANG Xi-long2, ZHANG Jing-ying2, SHAO Xue-dong3. Screening and biological characteristics studies of wide wine-making yeasts[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 12 -17 .
[5]	ZENG Weng-fu1, HUANG Tian-qiang1,2, LI Kai1, YU YANG-qiang1, GUO Gong-de1,2. A local linear emedding agorithm based on harmonicmean geodesic kernel[J]. J4, 2010, 45(7): 55 -59 .
[6]	GUO Wen-juan, YANG Gong-ping*, DONG Jin-li. A review of fingerprint image segmentation methods[J]. J4, 2010, 45(7): 94 -101 .
[7]	MENG Xiang-bo1, ZHANG Li-dong1, DU Zi-ping2. Investment and reinsurance strategy for insurers under #br# mean-variance criterion with jumps#br#[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 36 -40 .
[8]	PENG Zhen-hua, XU Yi-hong*, TU Xiang-qiu. Optimality conditions for weakly efficient elements of nearly preinvex set-valued optimizaton#br#[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 41 -44 .
[9]	HU Ming-Di, SHE Yan-Hong, WANG Min. Topological properties of three-valued logic metric space[J]. J4, 2010, 45(6): 86 -90 .
[10]	HE Hai-lun， CHEN Xiu-lan* . Circular dichroism detection of the effects of denaturants and buffers on the conformation of cold-adapted protease MCP-01 and mesophilic protease BP01[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2013, 48(1): 23 -29 .

类别	特征
用户行为离散特征	用户粉丝数(FC) 用户一天内最大发布评论数(MN) 用户极端评论比例(包括1星或者5星)(ER) 用户评论等级分布信息熵(RE)
评论文本离散特征	该评论与其他同商品评论的平均等级差(AR) 是否是极端评论(1星或5星)(IE) 该评论与同商品评论最大余弦相似度(MC) 该评论被点赞次数(AN)

数据项目	酒店	饭店
水军评论数	802	8 368
非水军评论数	4 876	50 149
水军评论占比	14.1%	14.3%
总评论数	5 678	58 517
总评论者数	5 124	35 593

参数类型	可调参数	参数值
PCA	首次降维目标维度	150
PCA	第二次降维目标维度	(M+N)/2
Doc2vec	段落向量维度	80
Doc2vec	词向量训练模型	c-bow
堆叠分类器	交叉验证次数	5
XGBoost分类器	学习率	0.005
XGBoost分类器	最大深度	5

方法	数据分布	酒店数据集				饭店数据集
方法	数据分布	Precision	Recall	F₁	Accuracy	Precision	Recall	F₁	Accuracy
M_BF+BIGRAM	50:50	82.8	86.9	84.8	85.1	82.8	88.5	85.6	83.3
M_BF+BIGRAM	ND	46.5	82.5	59.4	84.9	48.2	87.9	62.3	78.5
HAAT	50:50	61.3	54.7	57.8	64.4	69.4	59.0	63.8	66.5
HAAT	ND	32.7	53.1	40.8	56.4	35.9	78.9	48.1	68.3
SPEAGLE	50:50	75.7	83.0	79.1	81.0	80.5	83.2	81.8	82.5
SPEAGLE	ND	26.5	56.0	36.0	80.4	48.2	70.5	58.6	82.0
Rescal+BIGRAM	50:50	84.2	89.9	87.0	86.5	86.8	91.8	89.2	89.9
Rescal+BIGRAM	ND	48.2	85.0	61.5	85.9	58.2	90.3	70.8	87.8
Our_Method	50:50	87.3	90.7	88.9	88.8	88.7	93.2	90.9	90.7
Our_Method	ND	52.0	90.0	65.9	86.6	64.6	92.4	76.0	88.3

方法	数据分布	酒店数据集				饭店数据集
方法	数据分布	Precision	Recall	F₁	Accuracy	Precision	Recall	F₁	Accuracy
Rescal	50:50	83.4	87.0	84.5	84.1	82.8	88.5	85.6	83.3
Rescal	ND	44.3	87.7	58.8	83.6	53.9	88.3	66.9	83.4
Our_Method(SVM)	50:50	85.4	86.0	85.7	85.7	86.8	93.1	89.8	89.5
Our_Method(SVM)	ND	48.6	86.3	62.1	85.0	61.3	92.4	73.7	87.3

Review spam detection based on the two-level stacking classification model

RichHTML

PDF (PC)

Abstract

Cite this article

share this article

Figures/Tables 12

References 17

Related Articles 6

Metrics

Comments

Recommended 10

特征	离散型交互特征数据
特征	FC	MN	ER	RE	AR	IE	MC	AN
A	19.0	3.0	0.14	0.79	1.8	0.0	0.02	5.0
B	0.0	1.0	1.0	0.0	2.6	1.0	0.09	0.0

方法	评价指标	酒店数据集 (50:50)	酒店数据集 (ND)	饭店数据集 (50:50)	饭店数据集 (ND)
不考虑离散交互特征	F1	87.2	58.8	90.3	73.1
不考虑离散交互特征	A	86.2	80.1	89.2	86.6
考虑离散交互特征(全特征)	F1	88.9	65.9	90.9	76.0
考虑离散交互特征(全特征)	A	88.8	86.6	90.7	88.3

[1]	LI Run-chuan, ZAN Hong-ying, SHEN Sheng-ya, BI Yin-long, ZHANG Zhong-jun. Spam messages identification based on multi-feature fusion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 73-79.
[2]	SHAO Wei1, ZHU Li-ping2, LIU Fu-Guo2, WANG Qiu-Ping2. Sparse principal component analysis for symmetric matrix and application in sufficient dimension reduction [J]. J4, 2012, 47(4): 116-120.
[3]	ZHOU Juan1, WANG Ren-qing2, GUO Wei-hua2*, WANG Qiang2, WANG Wei2, . Soil geochemical elements in the Yutai high quality rice base [J]. J4, 2012, 47(3): 5-9.
[4]	WANG De-liang, LI Ke, LU Li-ling. Analysis of habitat characteristics of Tanichthys albonubes in Shimen National Forest Park [J]. J4, 2012, 47(3): 1-4.
[5]	ZHU Shi-wei,SAI Ying . The prediction model of financial distress of Chinese listed corporations based on a hybrid RPR model [J]. J4, 2008, 43(11): 48-53 .
[6]	YANG Shao-hua,LIN Pan,PAN Chen . Performance improvement of face recognition based on kernel principal component analysis using wavelet transform [J]. J4, 2007, 42(9): 96-100 .