基于词矢量相似度的短文本分类

doi:10.6040/j.issn.1671-9352.3.2014.295

Abstract

Abstract: As the short length of the Web short text and less shared words, a lot of out of vocabulary (OOV) words would appear, and these words make the task of text classification more difficult. To solve this problem, a new general framework based on word embedding similarity was proposed. First, get the word embedding file with unsupervised learning method based on unlabeled data. Second, extend the OOVs with the similar words in training data through computing the similarities of different word embeddings. The comparison with the baseline system shows that the proposed method gets better 1%-2% rate and outperforms more 10% rate on small training data set.

Key words: short text classification, out of vocabulary, word embedding similarity

CLC Number:

TP391

MA Cheng-long, JIANG Ya-song, LI Yan-ling, ZHANG Yan, YAN Yong-hong. Short text classification based on word embedding similarity[J].JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(12): 18-22.

References

[1] JOACHIMS T. Text categorization with support vector machines: learning with many relevant features [J]. Lecture Notes in Computer Science, 1998, 1398: 137-142.
[2] KWON O W, LEE J H. Text categorization based on k-nearest neighbor approach for Web site classification [J]. Information Processing & Management, 2003, 39(1):25-44.
[3] NIGAM K, LAFFERTY J, MCCALLUM A. Using maximum entropy for text classification [C]//Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering.[S.l.]:[s.n.], 1999: 61-67.
[4] SEBASTIANI F. Machine learning in automated text categorization [J]. ACM Computing Surveys (CSUR), 2002, 34(1): 1-47.
[5] ZELIKOVITZ S, HIRSH H. Improving short text classification using unlabeled background knowledge to assess document similarity [C]//Proceedings of the 17th International Conference on Machine Learning.[S.l.]:[s.n.], 2000: 1183-1190.
[6] BOLLEGALA D, MATSUO Y, ISHIZUKA M. Measuring semantic similarity between words using web search engines [C]//Proceedings of World Wide Web Conference Committee (IW3C2). Banff, Alberta, Canada, 2007:757-766.
[7] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using wikipedia-based explicit semantic analysis [C]//Proceedings of the 20th International Joint Conference On Artificial Intelligence (IJCAI). Freiburg, Germany: IJCAI-INT, 2007: 1606-1611.
[8] BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using wikipedia [C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007: 787-788.
[9] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections [C]//Proceedings of the 17th International Conference on World Wide Web. New York: ACM, 2008: 91-100.
[10] TURIAN J, RATINOV L, BENGIO Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Philadelphia,PA,USA: Association for Computational Linguistics, 2010: 384-394.
[11] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [C]//Proceedings of Workshop at ICLR. 2013,
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality [C]//Proceedings of the Advances in Neural Information Processing Systems, 2013:3111-3119.
[13] MORIN F, BENGIO Y. Hierarchical probabilistic neural network language model [C]//Proceedings of the International Workshop on Artificial Intelligence and Statistics, 2005:246-252.

Related Articles 15

[1]	GONG Shuang-shuang, CHEN Yu-feng, XU Jin-an, ZHANG Yu-jie. Extraction of Chinese multiword expressions based on Web text [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 40-48.
[2]	YU Chuan-ming, ZUO Yu-heng, GUO Ya-jing, AN Lu. Dynamic discovery of authors research interest based on the combined topic evolutional model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 23-34.
[3]	. Reader emotion classification with news and comments [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 35-39.
[4]	. Design and implementation of topic detection in Russian news based on ontology [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 49-54.
[5]	LIAO Xiang-wen, ZHANG Ling-ying, WEI Jing-jing, GUI Lin, CHENG Xue-qi, CHEN Guo-long. User influence analysis of social media with temporal characteristics [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 1-12.
[6]	YU Chuan-ming, FENG Bo-lin, TIAN Xin, AN Lu. Deep representative learning based sentiment analysis in the cross-lingual environment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 13-23.
[7]	ZHANG Jun, LI Jing-fei, ZHANG Rui, RUAN Xing-mao, ZHANG Shuo. Community detection algorithm based on effective resistance of network [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 24-29.
[8]	PANG Bo, LIU Yuan-chao. Fusion of pointwise and deep learning methods for passage ranking [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 30-35.
[9]	CHEN Xin, XUE Yun, LU Xin, LI Wan-li, ZHAO Hong-ya, HU Xiao-hui. Text feature extraction method for sentiment analysis based on order-preserving submatrix and frequent sequential pattern mining [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 36-45.
[10]	WANG Tong, MA Yan-zhou, YI Mian-zhu. Speech recognition of Russian short instructions based on DTW [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 29-36.
[11]	ZHANG Xiao-dong, DONG Wei-guang, TANG Min-an, GUO Jun-feng, LIANG Jin-ping. gOMP reconstruction algorithm based on generalized Jaccard coefficient for compressed sensing [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 23-28.
[12]	SUN Jian-dong, GU Xiu-sen, LI Yan, XU Wei-ran. Chinese entity relation extraction algorithms based on COAE2016 datasets [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 7-12.
[13]	WANG Kai, HONG Yu, QIU Ying-ying, WANG Jian, YAO Jian-min, ZHOU Guo-dong. Study on boundary detection of users query intents [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 13-18.
[14]	ZHANG Fan, LUO Cheng, LIU Yi-qun, ZHANG Min, MA Shao-ping. User preference prediction in heterogeneous search environment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 26-34.
[15]	YANG Yan, XU Bing, YANG Mu-yun, ZHAO Jing-jing. An emotional classification method based on joint deep learning model [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 19-25.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Short text classification based on word embedding similarity

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0