JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2018, Vol. 53 ›› Issue (3): 36-45.doi: 10.6040/j.issn.1671-9352.1.2017.093

Previous Articles     Next Articles

Text feature extraction method for sentiment analysis based on order-preserving submatrix and frequent sequential pattern mining

CHEN Xin1,2, XUE Yun1,3*, LU Xin1, LI Wan-li1, ZHAO Hong-ya2, HU Xiao-hui1   

  1. 1. School of Physics and Telecommunication Engineering, South China Normal University, Guangdong 510006, Guangzhou, China;
    2. Shenzhen PolyTechnic, Shenzhen 518055, Guangdong, China;
    3. Guangdong Provincial Engineering Technology Research Center for Data Science, Guangdong 510006, Guangzhou, China
  • Received:2017-07-04 Online:2018-03-20 Published:2018-03-13

Abstract: Feature extraction is one of the key steps in text sentiment analysis, which is also the main factor that affects the result. According to the variant expression of online review, the synonyms TF-IDF(term frequency-inverse document frequency)weight vector is obtained based on the semantic similarity. Then in view of the different length of online review, the local patterns of the feature vectors are identified with OPSM(order-preserving submatrix)biclustering algorithm. We improve PrefixSpan algorithm to detect the frequent classification phrase feature, which contain word order information. Furthermore some important factors, such as the separation of word, are also employed to improve the discriminative ability of sentiment orientation. Finally, the proposed method is applied to the sentiment analysis task experiment of the product reviews, and the results show that the text feature extraction has a better performance.

Key words: feature extraction, biclustering, frequent phrase feature, sentiment analysis

CLC Number: 

  • TP391
[1] PANG Bo, LEE L, VAITHYANATHAN S. Thumbs up? Sentiment classification using machine learning techniques[C] // Proceedings of 2002 Conference on Empirical Methods in Natural Language Processing. Somerset: ACL, 2002: 79-86.
[2] TAN Songbo, ZHANG Jin. An empirical study of sentiment analysis for chinese documents[J]. Expert Systems with Applications, 2008, 34(4):2622-2629.
[3] SIVIC J, ZISSERMAN A. Efficient visual search of videos cast as text retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(4): 591-606.
[4] ZELLIG S. H. Distributional structure [J]. Word, 1954, 10(2-3):146-162.
[5] BEN-DOR A, CHOR B, KARP R, et al. Discovering local structure in gene expression data: the order-preserving submatrix problem[C] // Proceedings of the 6th Annual International Conference on Computational Biology(RECOMB '02). New York: ACM, 2002: 49-57.
[6] PEI Jian, HAN Jiawei, MORTAZAVI-ASL B, et al. Mining sequential patterns by pattern-growth: the prefixspan approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(11):1424-1440.
[7] TAN Songbo. ChnSentiCorp[DB/OL].[2010-06-29].
[8] SALTON G, YU C. On the construction of effective vocabularies for information retrieval[J]. SIGPLAN Notices, 1975, 10(1):48-60.
[9] BENGIO Y, DUCHARME R, VINCENT Pascal, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6):1137-1155.
[10] MNIH A, HINTON G E. A scalable hierarchical distributed language model[C] // Proceedings of the 21st International Conference on Neural Information Processing Systems(NIPS'08).[S.l.] : Curran Associates Inc, 2008: 1081-1088.
[11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013. arXiv:1301.3781v3.
[12] TAI Kaisheng, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[J]. Computer Science, 2015, 5(1):36.
[13] BOJANOWCKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[EB/OL].[2017-03-15].
[14] KAUFMAN L, ROUSSEEUW P J. Finding groups in data: an introduction to cluster analysis[M]. New York: John Wiley & Sons, 2009.
[15] TÖRÖNEN P, KOLEHMAINEN M, WONG G, et al. Analysis of gene expression data using self-organizing maps[J]. Febs Letters, 1999, 451(2):142-146.
[16] KANG S H, SANDBERG B, YIP A M. A regularized k-means and multiphase scale segmentation[J]. Inverse Problems & Imaging, 2017, 5(2):407-429.
[17] CHENG Yinong, CHURCH G M. Biclustering of expression data[C] // Proceedings of International Society for Computational Biology.[S.l.] : AAAI Press, 2000: 93-103.
[18] KRIEGEL H P, ZIMEK A. Clustering high-dimensional data:a survey on subspace clustering, pattern-based clustering,and correlation clustering[J]. ACM Transactions on Knowledge Discovery from Data, 2009, 3(1):1-58.
[19] LAZZERONI L C, OWEN A. Plaid models for gene expression data[J]. Statistica Sinica, 2002: 61-86.
[20] MATSUMOTO S, TAKAMURA H, OKUMURA M. Sentiment classification using word sub-sequences and dependency sub-trees[C] // Proceedings of the 9th Pacific/Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer-Verlag, 2005: 301-311.
[21] LIU Zhiwen, XUE Yue, LI Meihang, et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining[J]. International Journal of Data Mining & Bioinformatics, 2017, 17(3):217-237.
[22] WANG Hui. All common subsequences[C] // Proceedings of the International Joint Conference on Artificial Intelligence. Freiburg: IJCAI-INT, 2007: 635-640.
[23] LIU Yiqun, CHEN Fei, KONG Weize, et al. Identifying web spam with the wisdom of the crowds[J]. ACM Transactions on the Web, 2012, 6(1):1-30.
[24] ZHANG Huaping, YU Hongkui, XIONG Deyi, et al. HHMM-based chinese lexical analyzer ICTCLAS[C] // Sighan Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003: 758-759.
[25] ZHANG Huaping. ICTCLAS[CP/OL].[2017-03-14].
[26] PEDREGOSA F, VAROQUAUX G, GRAMFORT A, et al. Scikit-learn: machine learning in Python[J]. Journal of Machine Learning Research, 2012, 12(10):2825-2830.
[1] YU Chuan-ming, FENG Bo-lin, TIAN Xin, AN Lu. Deep representative learning based sentiment analysis in the cross-lingual environment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 13-23.
[2] SUN Jian-dong, GU Xiu-sen, LI Yan, XU Wei-ran. Chinese entity relation extraction algorithms based on COAE2016 datasets [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 7-12.
[3] SHI Han-xiao, LI Xiao-jun, HAO Teng-da, LIU Hong, ZHU Liu-qing. Emotion analysis on Microblog short text [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 80-90.
[4] XU Ye, XU Wei-ran. Algorithm of knowledge base cumulative citation recommendation based on semantic features expansion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(11): 26-32.
[5] HE Yan-xiang, LIU Jian-bo, SUN Song-tao, WEN Wei-dong. Product reviews sentiment classification in Micro-blog based on cascaded conditional random field [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(11): 67-73.
[6] ZHU Zhu, LI Shou-shan, DAI Min, ZHOU Guo-dong. Opinion target extraction with active-learning and automatic annotation [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(07): 38-44.
[7] WANG Hui, CHEN Guang. Feature extraction method based on Bootstrapping in English product comment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(12): 23-29.
[8] ZHOU Wen, ZHANG Shu-qing, OUYANG Chun-ping, LIU Zhi-ming, YANG Xiao-hua. Topic sentiment analysis of Chinese news based on emotional dependency tuple [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(12): 1-6.
[9] YANG Jia-neng, YANG Ai-min, ZHOU Yong-mei. Sentiment classification method of Chinese Micro-blog based on semantic analysis [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(11): 14-21.
[10] ZHU Xi, DONG Xi-shuang, GUAN Yi, LIU Zhi-guang. Sentiment analysis of Chinese Micro-blog based on semi-supervised learning [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(11): 37-42.
[11] LIU Ming, ZAN Hong-ying, YUAN Hui-bin. Key sentiment sentence prediction using SVM and RNN [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(11): 68-73.
[12] SUN Song-tao, HE Yan-xiang, CAI Rui, LI Fei, HE Fei-yan. Comparative study of methods for Micro-blog sentiment evaluation tasks [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(11): 43-50.
[13] ZHAO Jing, MA Qin, CUI Yu-quan. Comparative study on macro-economic regionalization: an application of biclustering algorithm [J]. J4, 2012, 47(9): 71-77.
[14] ZHANG Cheng-gong 1, 2, LIU Pei-yu1, 2*, ZHU Zhen-fang1,2, FANG Ming1,2. A sentiment analysis method based on a polarity lexicon [J]. J4, 2012, 47(3): 47-50.
Full text



No Suggested Reading articles found!