山东大学学报(理学版) ›› 2018, Vol. 53 ›› Issue (3): 36-45.doi: 10.6040/j.issn.1671-9352.1.2017.093
陈鑫1,2,薛云1,3*,卢昕1,李万理1,赵洪雅2,胡晓晖1
CHEN Xin1,2, XUE Yun1,3*, LU Xin1, LI Wan-li1, ZHAO Hong-ya2, HU Xiao-hui1
摘要: 特征提取是进行文本情感分析的关键步骤之一,是影响其结果好坏的主要因素。针对网络评论语句中表达形式多变的特点,结合语义相似度计算得到近义词TF-IDF(term frequency—inverse document frequency)权重向量;根据评论语句长短不一的特点,基于OPSM(order-preserving submatrix)双聚类算法挖掘出权重向量中的局部模式;使用改进的PrefixSpan算法挖掘分类频繁短语特征,这类特征能有效利用词语的顺序信息,同时也通过词语间隔等限制来提升频繁短语区分情感倾向的能力。最后将该方法用于处理商品评论语料,并进行情感分析任务实验,结果表明所提取的文本特征效果有较大的提升。
中图分类号:
[1] PANG Bo, LEE L, VAITHYANATHAN S. Thumbs up? Sentiment classification using machine learning techniques[C] // Proceedings of 2002 Conference on Empirical Methods in Natural Language Processing. Somerset: ACL, 2002: 79-86. [2] TAN Songbo, ZHANG Jin. An empirical study of sentiment analysis for chinese documents[J]. Expert Systems with Applications, 2008, 34(4):2622-2629. [3] SIVIC J, ZISSERMAN A. Efficient visual search of videos cast as text retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(4): 591-606. [4] ZELLIG S. H. Distributional structure [J]. Word, 1954, 10(2-3):146-162. [5] BEN-DOR A, CHOR B, KARP R, et al. Discovering local structure in gene expression data: the order-preserving submatrix problem[C] // Proceedings of the 6th Annual International Conference on Computational Biology(RECOMB '02). New York: ACM, 2002: 49-57. [6] PEI Jian, HAN Jiawei, MORTAZAVI-ASL B, et al. Mining sequential patterns by pattern-growth: the prefixspan approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(11):1424-1440. [7] TAN Songbo. ChnSentiCorp[DB/OL].[2010-06-29]. http://www.nlpir.org/?action-viewnews-itemid-77. [8] SALTON G, YU C. On the construction of effective vocabularies for information retrieval[J]. SIGPLAN Notices, 1975, 10(1):48-60. [9] BENGIO Y, DUCHARME R, VINCENT Pascal, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6):1137-1155. [10] MNIH A, HINTON G E. A scalable hierarchical distributed language model[C] // Proceedings of the 21st International Conference on Neural Information Processing Systems(NIPS'08).[S.l.] : Curran Associates Inc, 2008: 1081-1088. [11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013. arXiv:1301.3781v3. [12] TAI Kaisheng, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[J]. Computer Science, 2015, 5(1):36. [13] BOJANOWCKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[EB/OL].[2017-03-15].http://arxiv.org/abs/1607.04606. [14] KAUFMAN L, ROUSSEEUW P J. Finding groups in data: an introduction to cluster analysis[M]. New York: John Wiley & Sons, 2009. [15] TÖRÖNEN P, KOLEHMAINEN M, WONG G, et al. Analysis of gene expression data using self-organizing maps[J]. Febs Letters, 1999, 451(2):142-146. [16] KANG S H, SANDBERG B, YIP A M. A regularized k-means and multiphase scale segmentation[J]. Inverse Problems & Imaging, 2017, 5(2):407-429. [17] CHENG Yinong, CHURCH G M. Biclustering of expression data[C] // Proceedings of International Society for Computational Biology.[S.l.] : AAAI Press, 2000: 93-103. [18] KRIEGEL H P, ZIMEK A. Clustering high-dimensional data:a survey on subspace clustering, pattern-based clustering,and correlation clustering[J]. ACM Transactions on Knowledge Discovery from Data, 2009, 3(1):1-58. [19] LAZZERONI L C, OWEN A. Plaid models for gene expression data[J]. Statistica Sinica, 2002: 61-86. [20] MATSUMOTO S, TAKAMURA H, OKUMURA M. Sentiment classification using word sub-sequences and dependency sub-trees[C] // Proceedings of the 9th Pacific/Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer-Verlag, 2005: 301-311. [21] LIU Zhiwen, XUE Yue, LI Meihang, et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining[J]. International Journal of Data Mining & Bioinformatics, 2017, 17(3):217-237. [22] WANG Hui. All common subsequences[C] // Proceedings of the International Joint Conference on Artificial Intelligence. Freiburg: IJCAI-INT, 2007: 635-640. [23] LIU Yiqun, CHEN Fei, KONG Weize, et al. Identifying web spam with the wisdom of the crowds[J]. ACM Transactions on the Web, 2012, 6(1):1-30. [24] ZHANG Huaping, YU Hongkui, XIONG Deyi, et al. HHMM-based chinese lexical analyzer ICTCLAS[C] // Sighan Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003: 758-759. [25] ZHANG Huaping. ICTCLAS[CP/OL].[2017-03-14]. http://ictclas.nlpir.org/. [26] PEDREGOSA F, VAROQUAUX G, GRAMFORT A, et al. Scikit-learn: machine learning in Python[J]. Journal of Machine Learning Research, 2012, 12(10):2825-2830. |
[1] | 余传明,冯博琳,田鑫,安璐. 基于深度表示学习的多语言文本情感分析[J]. 山东大学学报(理学版), 2018, 53(3): 13-23. |
[2] | 徐也,徐蔚然. 基于语义特征扩展的知识库增量引文推荐算法[J]. 山东大学学报(理学版), 2016, 51(11): 26-32. |
[3] | 何炎祥, 刘健博, 孙松涛, 文卫东. 基于层叠条件随机场的微博商品评论情感分类[J]. 山东大学学报(理学版), 2015, 50(11): 67-73. |
[4] | 朱珠, 李寿山, 戴敏, 周国栋. 结合主动学习和自动标注的评价对象抽取方法[J]. 山东大学学报(理学版), 2015, 50(07): 38-44. |
[5] | 周文, 张书卿, 欧阳纯萍, 刘志明, 阳小华. 基于情感依存元组的新闻文本主题情感分析[J]. 山东大学学报(理学版), 2014, 49(12): 1-6. |
[6] | 杨佳能, 阳爱民, 周咏梅. 基于语义分析的中文微博情感分类方法[J]. 山东大学学报(理学版), 2014, 49(11): 14-21. |
[7] | 朱玺, 董喜双, 关毅, 刘志广. 基于半监督学习的微博情感倾向性分析[J]. 山东大学学报(理学版), 2014, 49(11): 37-42. |
[8] | 孙松涛, 何炎祥, 蔡瑞, 李飞, 贺飞艳. 面向微博情感评测任务的多方法对比研究[J]. 山东大学学报(理学版), 2014, 49(11): 43-50. |
[9] | 夏梦南, 杜永萍, 左本欣. 基于依存分析与特征组合的微博情感分析[J]. 山东大学学报(理学版), 2014, 49(11): 22-30. |
[10] | 赵晶,马勤,崔玉泉. 宏观经济区划比较研究:双聚类算法的应用[J]. J4, 2012, 47(9): 71-77. |
[11] | 张成功1,2,刘培玉1,2*,朱振方1,2,方明1,2. 一种基于极性词典的情感分析方法[J]. J4, 2012, 47(3): 47-50. |
|