山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (12): 18-22.doi: 10.6040/j.issn.1671-9352.3.2014.295
马成龙, 姜亚松, 李艳玲, 张艳, 颜永红
MA Cheng-long, JIANG Ya-song, LI Yan-ling, ZHANG Yan, YAN Yong-hong
摘要: 互联网中出现的短文本内容短小,相互共享的词汇较少,因此在分类过程中容易出现大量的集外词,导致分类性能降低。鉴于此,提出了一种基于词矢量相似度的分类方法,首先利用无监督的方法对无标注数据进行训练得到词矢量,然后通过词矢量之间的相似度对测试集中出现的集外词进行扩展。通过与基线系统的对比表明,该方法的分类正确率均优于基线系统1%~2%,尤其是在训练数据较少的情况下,所提出的方法的正确率相对提高10%以上。
中图分类号:
[1] JOACHIMS T. Text categorization with support vector machines: learning with many relevant features [J]. Lecture Notes in Computer Science, 1998, 1398: 137-142. [2] KWON O W, LEE J H. Text categorization based on k-nearest neighbor approach for Web site classification [J]. Information Processing & Management, 2003, 39(1):25-44. [3] NIGAM K, LAFFERTY J, MCCALLUM A. Using maximum entropy for text classification [C]//Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering.[S.l.]:[s.n.], 1999: 61-67. [4] SEBASTIANI F. Machine learning in automated text categorization [J]. ACM Computing Surveys (CSUR), 2002, 34(1): 1-47. [5] ZELIKOVITZ S, HIRSH H. Improving short text classification using unlabeled background knowledge to assess document similarity [C]//Proceedings of the 17th International Conference on Machine Learning.[S.l.]:[s.n.], 2000: 1183-1190. [6] BOLLEGALA D, MATSUO Y, ISHIZUKA M. Measuring semantic similarity between words using web search engines [C]//Proceedings of World Wide Web Conference Committee (IW3C2). Banff, Alberta, Canada, 2007:757-766. [7] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using wikipedia-based explicit semantic analysis [C]//Proceedings of the 20th International Joint Conference On Artificial Intelligence (IJCAI). Freiburg, Germany: IJCAI-INT, 2007: 1606-1611. [8] BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using wikipedia [C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007: 787-788. [9] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections [C]//Proceedings of the 17th International Conference on World Wide Web. New York: ACM, 2008: 91-100. [10] TURIAN J, RATINOV L, BENGIO Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Philadelphia,PA,USA: Association for Computational Linguistics, 2010: 384-394. [11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [C]//Proceedings of Workshop at ICLR. 2013, [12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality [C]//Proceedings of the Advances in Neural Information Processing Systems, 2013:3111-3119. [13] MORIN F, BENGIO Y. Hierarchical probabilistic neural network language model [C]//Proceedings of the International Workshop on Artificial Intelligence and Statistics, 2005:246-252. |
[1] | 龚双双,陈钰枫,徐金安,张玉洁. 基于网络文本的汉语多词表达抽取方法[J]. 山东大学学报(理学版), 2018, 53(9): 40-48. |
[2] | 余传明,左宇恒,郭亚静,安璐. 基于复合主题演化模型的作者研究兴趣动态发现[J]. 山东大学学报(理学版), 2018, 53(9): 23-34. |
[3] | 严倩,王礼敏,李寿山,周国栋. 结合新闻和评论文本的读者情绪分类方法[J]. 山东大学学报(理学版), 2018, 53(9): 35-39. |
[4] | 原伟,唐亮,易绵竹. 基于本体的俄文新闻话题检测设计与实现[J]. 山东大学学报(理学版), 2018, 53(9): 49-54. |
[5] | 廖祥文,张凌鹰,魏晶晶,桂林,程学旗,陈国龙. 融合时间特征的社交媒介用户影响力分析[J]. 山东大学学报(理学版), 2018, 53(3): 1-12. |
[6] | 余传明,冯博琳,田鑫,安璐. 基于深度表示学习的多语言文本情感分析[J]. 山东大学学报(理学版), 2018, 53(3): 13-23. |
[7] | 张军,李竞飞,张瑞,阮兴茂,张烁. 基于网络有效阻抗的社区发现算法[J]. 山东大学学报(理学版), 2018, 53(3): 24-29. |
[8] | 庞博,刘远超. 融合pointwise及深度学习方法的篇章排序[J]. 山东大学学报(理学版), 2018, 53(3): 30-35. |
[9] | 陈鑫,薛云,卢昕,李万理,赵洪雅,胡晓晖. 基于保序子矩阵和频繁序列模式挖掘的文本情感特征提取方法[J]. 山东大学学报(理学版), 2018, 53(3): 36-45. |
[10] | 王彤,马延周,易绵竹. 基于DTW的俄语短指令语音识别[J]. 山东大学学报(理学版), 2017, 52(11): 29-36. |
[11] | 张晓东,董唯光,汤旻安,郭俊锋,梁金平. 压缩感知中基于广义Jaccard系数的gOMP重构算法[J]. 山东大学学报(理学版), 2017, 52(11): 23-28. |
[12] | 孙建东,顾秀森,李彦,徐蔚然. 基于COAE2016数据集的中文实体关系抽取算法研究[J]. 山东大学学报(理学版), 2017, 52(9): 7-12. |
[13] | 王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报(理学版), 2017, 52(9): 13-18. |
[14] | 张帆,罗成,刘奕群,张敏,马少平. 异质搜索环境下的用户偏好性预测方法研究[J]. 山东大学学报(理学版), 2017, 52(9): 26-34. |
[15] | 杨艳,徐冰,杨沐昀,赵晶晶. 一种基于联合深度学习模型的情感分类方法[J]. 山东大学学报(理学版), 2017, 52(9): 19-25. |
|