JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2019, Vol. 54 ›› Issue (7): 57-67.doi: 10.6040/j.issn.1671-9352.1.2018.077

Review spam detection based on the two-level stacking classification model

Xiang-wen LIAO1,2,3,*(),Yang XU1,2,3,Jing-jing WEI4,Ding-da YANG1,2,3,Guo-long CHEN1,2,3   

  1. 1. College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, Fujian, China
    2. Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, Fujian, China
    3. Digital Fujian Institute of Financial Big Data, Fuzhou 350116, Fujian, China
    4. College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, Fujian, China
  • Received:2018-10-17 Online:2019-07-20 Published:2019-06-27
  • Contact: Xiang-wen LIAO
  • Supported by:


For the issue of review spam detection, on the one hand, the time and space complexity of existing methods is high when extracting user behavior relationships and training neural network. On the other hand, the non-standard writing format of E-commercial reviews leads to the indistinct contextual features and most experiment did not consider the effect of the imbalance of data. Therefore, we propose a method for review spam detection based on a two-level stacking classification model. In the method, the relationship between users and products is represented by a triplet. In order to characterize user's behavior and reduce complexity, low-dimensional feature representations are obtained by the principal component analysis. Then, the extracted paragraphs vector representation, information entropy and text similarity is represented as discrete feature to avoid indistinct of contextual features. Finally, the three connections are taken as the overall features combining text and behavioral features. These features are regarded as the input of the two-level stacking classification model in order to improve performance in unbalanced dataset. We conducted experiments in the Yelp 2013 dataset. Experimental results show the F1 value of our proposed method is 1.7%—5.2% better than the state-of-the-art method. What's more, the classification performance is significantly improved in the unbalanced dataset.

Key words: review detection, feature fusion, ensemble learning, principal component analysis

CLC Number: 

  • TP391


Review spam detection based on two-layer stacking model"

Table 1

Discrete feature"



Construction of classification model based on ensemble learning"

Table 2

Review spam detection algorithm based on two-layer stacking classification model"

输入:评论数据集合X{x1, x2, …, xn}、预设参数
输出:评论检测结果集合Y{y1, y2, …, yn}
5:拼接特征F=concatenate{T, D, L}得到总特征表示;
9:输出结果Y{y1, y2, …, yn}

Table 3

Dataset statistics"

水军评论数8028 368
非水军评论数4 87650 149
总评论数5 67858 517
总评论者数5 12435 593

Table 4

Experimental parameter setting"


Table 5

Models evaluation results"


Table 6

Comparison of different feature extraction methods"


Table 7

Two types of reviews typical discrete feature"


Table 8

Comparison of discrete feature effects"



Comparison of F1 values of different classifier models"


Impact of different distribution datasets on F1 values"

