您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (12): 18-22.doi: 10.6040/j.issn.1671-9352.3.2014.295

• 论文 • 上一篇    下一篇

基于词矢量相似度的短文本分类

马成龙, 姜亚松, 李艳玲, 张艳, 颜永红   

  1. 中国科学院声学研究所, 北京 100190
  • 收稿日期:2014-08-28 修回日期:2014-10-17 出版日期:2014-12-20 发布日期:2014-12-20
  • 作者简介:马成龙(1989-),男,博士研究生,研究方向为自然语言处理。E-mail:machenglong@hccl.ioa.ac.cn
  • 基金资助:
    国家自然科学基金资助项目(11161140319,91120001,61271426);中国科学院战略性先导科技专项项目(XDA06030100,XDA06030500);国家国家高技术研究发展计划(863计划)项目(2012AA012503);中科院重点部署项目(KGZD-EW-103-2)

Short text classification based on word embedding similarity

MA Cheng-long, JIANG Ya-song, LI Yan-ling, ZHANG Yan, YAN Yong-hong   

  1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2014-08-28 Revised:2014-10-17 Online:2014-12-20 Published:2014-12-20

摘要: 互联网中出现的短文本内容短小,相互共享的词汇较少,因此在分类过程中容易出现大量的集外词,导致分类性能降低。鉴于此,提出了一种基于词矢量相似度的分类方法,首先利用无监督的方法对无标注数据进行训练得到词矢量,然后通过词矢量之间的相似度对测试集中出现的集外词进行扩展。通过与基线系统的对比表明,该方法的分类正确率均优于基线系统1%~2%,尤其是在训练数据较少的情况下,所提出的方法的正确率相对提高10%以上。

关键词: 短文本分类, 集外词, 词矢量相似度

Abstract: As the short length of the Web short text and less shared words, a lot of out of vocabulary (OOV) words would appear, and these words make the task of text classification more difficult. To solve this problem, a new general framework based on word embedding similarity was proposed. First, get the word embedding file with unsupervised learning method based on unlabeled data. Second, extend the OOVs with the similar words in training data through computing the similarities of different word embeddings. The comparison with the baseline system shows that the proposed method gets better 1%-2% rate and outperforms more 10% rate on small training data set.

Key words: short text classification, out of vocabulary, word embedding similarity

中图分类号: 

  • TP391
[1] JOACHIMS T. Text categorization with support vector machines: learning with many relevant features [J]. Lecture Notes in Computer Science, 1998, 1398: 137-142.
[2] KWON O W, LEE J H. Text categorization based on k-nearest neighbor approach for Web site classification [J]. Information Processing & Management, 2003, 39(1):25-44.
[3] NIGAM K, LAFFERTY J, MCCALLUM A. Using maximum entropy for text classification [C]//Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering.[S.l.]:[s.n.], 1999: 61-67.
[4] SEBASTIANI F. Machine learning in automated text categorization [J]. ACM Computing Surveys (CSUR), 2002, 34(1): 1-47.
[5] ZELIKOVITZ S, HIRSH H. Improving short text classification using unlabeled background knowledge to assess document similarity [C]//Proceedings of the 17th International Conference on Machine Learning.[S.l.]:[s.n.], 2000: 1183-1190.
[6] BOLLEGALA D, MATSUO Y, ISHIZUKA M. Measuring semantic similarity between words using web search engines [C]//Proceedings of World Wide Web Conference Committee (IW3C2). Banff, Alberta, Canada, 2007:757-766.
[7] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using wikipedia-based explicit semantic analysis [C]//Proceedings of the 20th International Joint Conference On Artificial Intelligence (IJCAI). Freiburg, Germany: IJCAI-INT, 2007: 1606-1611.
[8] BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using wikipedia [C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007: 787-788.
[9] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections [C]//Proceedings of the 17th International Conference on World Wide Web. New York: ACM, 2008: 91-100.
[10] TURIAN J, RATINOV L, BENGIO Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Philadelphia,PA,USA: Association for Computational Linguistics, 2010: 384-394.
[11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [C]//Proceedings of Workshop at ICLR. 2013,
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality [C]//Proceedings of the Advances in Neural Information Processing Systems, 2013:3111-3119.
[13] MORIN F, BENGIO Y. Hierarchical probabilistic neural network language model [C]//Proceedings of the International Workshop on Artificial Intelligence and Statistics, 2005:246-252.
[1] 龚双双,陈钰枫,徐金安,张玉洁. 基于网络文本的汉语多词表达抽取方法[J]. 山东大学学报(理学版), 2018, 53(9): 40-48.
[2] 余传明,左宇恒,郭亚静,安璐. 基于复合主题演化模型的作者研究兴趣动态发现[J]. 山东大学学报(理学版), 2018, 53(9): 23-34.
[3] 严倩,王礼敏,李寿山,周国栋. 结合新闻和评论文本的读者情绪分类方法[J]. 山东大学学报(理学版), 2018, 53(9): 35-39.
[4] 原伟,唐亮,易绵竹. 基于本体的俄文新闻话题检测设计与实现[J]. 山东大学学报(理学版), 2018, 53(9): 49-54.
[5] 廖祥文,张凌鹰,魏晶晶,桂林,程学旗,陈国龙. 融合时间特征的社交媒介用户影响力分析[J]. 山东大学学报(理学版), 2018, 53(3): 1-12.
[6] 余传明,冯博琳,田鑫,安璐. 基于深度表示学习的多语言文本情感分析[J]. 山东大学学报(理学版), 2018, 53(3): 13-23.
[7] 张军,李竞飞,张瑞,阮兴茂,张烁. 基于网络有效阻抗的社区发现算法[J]. 山东大学学报(理学版), 2018, 53(3): 24-29.
[8] 庞博,刘远超. 融合pointwise及深度学习方法的篇章排序[J]. 山东大学学报(理学版), 2018, 53(3): 30-35.
[9] 陈鑫,薛云,卢昕,李万理,赵洪雅,胡晓晖. 基于保序子矩阵和频繁序列模式挖掘的文本情感特征提取方法[J]. 山东大学学报(理学版), 2018, 53(3): 36-45.
[10] 王彤,马延周,易绵竹. 基于DTW的俄语短指令语音识别[J]. 山东大学学报(理学版), 2017, 52(11): 29-36.
[11] 张晓东,董唯光,汤旻安,郭俊锋,梁金平. 压缩感知中基于广义Jaccard系数的gOMP重构算法[J]. 山东大学学报(理学版), 2017, 52(11): 23-28.
[12] 孙建东,顾秀森,李彦,徐蔚然. 基于COAE2016数据集的中文实体关系抽取算法研究[J]. 山东大学学报(理学版), 2017, 52(9): 7-12.
[13] 王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报(理学版), 2017, 52(9): 13-18.
[14] 张帆,罗成,刘奕群,张敏,马少平. 异质搜索环境下的用户偏好性预测方法研究[J]. 山东大学学报(理学版), 2017, 52(9): 26-34.
[15] 杨艳,徐冰,杨沐昀,赵晶晶. 一种基于联合深度学习模型的情感分类方法[J]. 山东大学学报(理学版), 2017, 52(9): 19-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!