您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2018, Vol. 53 ›› Issue (3): 36-45.doi: 10.6040/j.issn.1671-9352.1.2017.093

• • 上一篇    下一篇

基于保序子矩阵和频繁序列模式挖掘的文本情感特征提取方法

陈鑫1,2,薛云1,3*,卢昕1,李万理1,赵洪雅2,胡晓晖1   

  1. 1.华南师范大学物理与电信工程学院, 广东 广州 510006;2.深圳职业技术学院, 广东 深圳 518055;3. 广东省数据科学工程技术研究中心, 广东 广州 510006
  • 收稿日期:2017-07-04 出版日期:2018-03-20 发布日期:2018-03-13
  • 通讯作者: 薛云(1975— ),男,博士,教授,研究方向为自然语言处理、情感分析、个性化推荐. E-mail:xueyun@scnu.edu.cn E-mail:chenxin@m.scnu.edu.cn
  • 作者简介:陈鑫(1992— ),男,硕士研究生,研究方向为自然语言处理、情感分析、个性化推荐. E-mail:chenxin@m.scnu.edu.cn
  • 基金资助:
    全国统计科学研究项目(2016LY98);广东省科技计划项目(2016A010101020,2016A010101021,2016A010101022);深圳市科创委基础研究项目(JCYJ20160527172144272);广东省数据科学工程技术研究中心课题项目(2016KF09,2016KFl0);广东科学技术职业学院科研项目(XJSC2016206);华南师范大学研究生创新计划资助项目(2015lkxm37)

Text feature extraction method for sentiment analysis based on order-preserving submatrix and frequent sequential pattern mining

CHEN Xin1,2, XUE Yun1,3*, LU Xin1, LI Wan-li1, ZHAO Hong-ya2, HU Xiao-hui1   

  1. 1. School of Physics and Telecommunication Engineering, South China Normal University, Guangdong 510006, Guangzhou, China;
    2. Shenzhen PolyTechnic, Shenzhen 518055, Guangdong, China;
    3. Guangdong Provincial Engineering Technology Research Center for Data Science, Guangdong 510006, Guangzhou, China
  • Received:2017-07-04 Online:2018-03-20 Published:2018-03-13

摘要: 特征提取是进行文本情感分析的关键步骤之一,是影响其结果好坏的主要因素。针对网络评论语句中表达形式多变的特点,结合语义相似度计算得到近义词TF-IDF(term frequency—inverse document frequency)权重向量;根据评论语句长短不一的特点,基于OPSM(order-preserving submatrix)双聚类算法挖掘出权重向量中的局部模式;使用改进的PrefixSpan算法挖掘分类频繁短语特征,这类特征能有效利用词语的顺序信息,同时也通过词语间隔等限制来提升频繁短语区分情感倾向的能力。最后将该方法用于处理商品评论语料,并进行情感分析任务实验,结果表明所提取的文本特征效果有较大的提升。

关键词: 情感分析, 特征提取, 双聚类, 频繁短语特征

Abstract: Feature extraction is one of the key steps in text sentiment analysis, which is also the main factor that affects the result. According to the variant expression of online review, the synonyms TF-IDF(term frequency-inverse document frequency)weight vector is obtained based on the semantic similarity. Then in view of the different length of online review, the local patterns of the feature vectors are identified with OPSM(order-preserving submatrix)biclustering algorithm. We improve PrefixSpan algorithm to detect the frequent classification phrase feature, which contain word order information. Furthermore some important factors, such as the separation of word, are also employed to improve the discriminative ability of sentiment orientation. Finally, the proposed method is applied to the sentiment analysis task experiment of the product reviews, and the results show that the text feature extraction has a better performance.

Key words: feature extraction, biclustering, frequent phrase feature, sentiment analysis

中图分类号: 

  • TP391
[1] PANG Bo, LEE L, VAITHYANATHAN S. Thumbs up? Sentiment classification using machine learning techniques[C] // Proceedings of 2002 Conference on Empirical Methods in Natural Language Processing. Somerset: ACL, 2002: 79-86.
[2] TAN Songbo, ZHANG Jin. An empirical study of sentiment analysis for chinese documents[J]. Expert Systems with Applications, 2008, 34(4):2622-2629.
[3] SIVIC J, ZISSERMAN A. Efficient visual search of videos cast as text retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(4): 591-606.
[4] ZELLIG S. H. Distributional structure [J]. Word, 1954, 10(2-3):146-162.
[5] BEN-DOR A, CHOR B, KARP R, et al. Discovering local structure in gene expression data: the order-preserving submatrix problem[C] // Proceedings of the 6th Annual International Conference on Computational Biology(RECOMB '02). New York: ACM, 2002: 49-57.
[6] PEI Jian, HAN Jiawei, MORTAZAVI-ASL B, et al. Mining sequential patterns by pattern-growth: the prefixspan approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(11):1424-1440.
[7] TAN Songbo. ChnSentiCorp[DB/OL].[2010-06-29]. http://www.nlpir.org/?action-viewnews-itemid-77.
[8] SALTON G, YU C. On the construction of effective vocabularies for information retrieval[J]. SIGPLAN Notices, 1975, 10(1):48-60.
[9] BENGIO Y, DUCHARME R, VINCENT Pascal, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6):1137-1155.
[10] MNIH A, HINTON G E. A scalable hierarchical distributed language model[C] // Proceedings of the 21st International Conference on Neural Information Processing Systems(NIPS'08).[S.l.] : Curran Associates Inc, 2008: 1081-1088.
[11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013. arXiv:1301.3781v3.
[12] TAI Kaisheng, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[J]. Computer Science, 2015, 5(1):36.
[13] BOJANOWCKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[EB/OL].[2017-03-15].http://arxiv.org/abs/1607.04606.
[14] KAUFMAN L, ROUSSEEUW P J. Finding groups in data: an introduction to cluster analysis[M]. New York: John Wiley & Sons, 2009.
[15] TÖRÖNEN P, KOLEHMAINEN M, WONG G, et al. Analysis of gene expression data using self-organizing maps[J]. Febs Letters, 1999, 451(2):142-146.
[16] KANG S H, SANDBERG B, YIP A M. A regularized k-means and multiphase scale segmentation[J]. Inverse Problems & Imaging, 2017, 5(2):407-429.
[17] CHENG Yinong, CHURCH G M. Biclustering of expression data[C] // Proceedings of International Society for Computational Biology.[S.l.] : AAAI Press, 2000: 93-103.
[18] KRIEGEL H P, ZIMEK A. Clustering high-dimensional data:a survey on subspace clustering, pattern-based clustering,and correlation clustering[J]. ACM Transactions on Knowledge Discovery from Data, 2009, 3(1):1-58.
[19] LAZZERONI L C, OWEN A. Plaid models for gene expression data[J]. Statistica Sinica, 2002: 61-86.
[20] MATSUMOTO S, TAKAMURA H, OKUMURA M. Sentiment classification using word sub-sequences and dependency sub-trees[C] // Proceedings of the 9th Pacific/Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer-Verlag, 2005: 301-311.
[21] LIU Zhiwen, XUE Yue, LI Meihang, et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining[J]. International Journal of Data Mining & Bioinformatics, 2017, 17(3):217-237.
[22] WANG Hui. All common subsequences[C] // Proceedings of the International Joint Conference on Artificial Intelligence. Freiburg: IJCAI-INT, 2007: 635-640.
[23] LIU Yiqun, CHEN Fei, KONG Weize, et al. Identifying web spam with the wisdom of the crowds[J]. ACM Transactions on the Web, 2012, 6(1):1-30.
[24] ZHANG Huaping, YU Hongkui, XIONG Deyi, et al. HHMM-based chinese lexical analyzer ICTCLAS[C] // Sighan Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003: 758-759.
[25] ZHANG Huaping. ICTCLAS[CP/OL].[2017-03-14]. http://ictclas.nlpir.org/.
[26] PEDREGOSA F, VAROQUAUX G, GRAMFORT A, et al. Scikit-learn: machine learning in Python[J]. Journal of Machine Learning Research, 2012, 12(10):2825-2830.
[1] 余传明,冯博琳,田鑫,安璐. 基于深度表示学习的多语言文本情感分析[J]. 山东大学学报(理学版), 2018, 53(3): 13-23.
[2] 徐也,徐蔚然. 基于语义特征扩展的知识库增量引文推荐算法[J]. 山东大学学报(理学版), 2016, 51(11): 26-32.
[3] 何炎祥, 刘健博, 孙松涛, 文卫东. 基于层叠条件随机场的微博商品评论情感分类[J]. 山东大学学报(理学版), 2015, 50(11): 67-73.
[4] 朱珠, 李寿山, 戴敏, 周国栋. 结合主动学习和自动标注的评价对象抽取方法[J]. 山东大学学报(理学版), 2015, 50(07): 38-44.
[5] 周文, 张书卿, 欧阳纯萍, 刘志明, 阳小华. 基于情感依存元组的新闻文本主题情感分析[J]. 山东大学学报(理学版), 2014, 49(12): 1-6.
[6] 杨佳能, 阳爱民, 周咏梅. 基于语义分析的中文微博情感分类方法[J]. 山东大学学报(理学版), 2014, 49(11): 14-21.
[7] 朱玺, 董喜双, 关毅, 刘志广. 基于半监督学习的微博情感倾向性分析[J]. 山东大学学报(理学版), 2014, 49(11): 37-42.
[8] 孙松涛, 何炎祥, 蔡瑞, 李飞, 贺飞艳. 面向微博情感评测任务的多方法对比研究[J]. 山东大学学报(理学版), 2014, 49(11): 43-50.
[9] 夏梦南, 杜永萍, 左本欣. 基于依存分析与特征组合的微博情感分析[J]. 山东大学学报(理学版), 2014, 49(11): 22-30.
[10] 赵晶,马勤,崔玉泉. 宏观经济区划比较研究:双聚类算法的应用[J]. J4, 2012, 47(9): 71-77.
[11] 张成功1,2,刘培玉1,2*,朱振方1,2,方明1,2. 一种基于极性词典的情感分析方法[J]. J4, 2012, 47(3): 47-50.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!