基于高斯分布和汉字组件特征的中文词表示学习

doi:10.6040/j.issn.1671-9352.1.2020.032

摘要/Abstract

摘要： 使用一种基于密度的分布式嵌入式表示,并给出一种学习高斯分布空间表示的方法,以更好地捕获关于表示及其关系的不确定性,比点积余弦相似度更自然地表达词语的不对称性;同时,针对中文汉字本身特点,将组成汉字的组件即子汉字的语义信息加入词表示训练。与现有方法对比,该文的模型性能在词语相似度或下游任务等方面有更好的效果,且能更好地表达词语的不确定性。

关键词: 词表示学习, 高斯分布, 汉字组件, 语义不确定性

Abstract: We use a distributed embedded representation based on density, and give a method to learn the space representation of the Gaussian distribution, so as to better capture the uncertainty about the representation and its relationship, to express the asymmetry of the words more naturally than the dot product cosine similarity. At the same time, according to the characteristics of Chinese characters, the semantic information of the Chinese characters components is added to the word embedding training. Compared with existing methods, our model has better performance in terms of word similarity or downstream tasks, and can express the uncertainty of words.

Key words: word representation learning, Gaussian distribution, Chinese characters components, semantic uncertainty

中图分类号:

TP391

易洁,钟茂生,刘根,王明文. 基于高斯分布和汉字组件特征的中文词表示学习[J]. 《山东大学学报(理学版)》, 2021, 56(5): 85-91.

YI Jie, ZHONG Mao-sheng, LIU Gen, WANG Ming-wen. Chinese word representation learning based on Gaussian distribution and Chinese character component characteristics[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2021, 56(5): 85-91.

参考文献

[1] ZHANG Yue, YANG Jie. Chinese NER using lattice LSTM[C] //Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 1554-1564.
[2] XIAO Huiru, LIU Xin, SONG Yangqiu. Efficient path prediction for semi-supervised and weakly supervised hier-archical text classification[C] //Proceedings of the 2019 World Wide Web Conference.San Francisco: WWW, 2019: 3370-3376.
[3] LI Xin, BING Lidong, LI Piji, et al. A unified model for opinion target extraction and target sentiment prediction[EB/OL]. arXiv, 2018. arXiv: 1811.05082.
[4] KRISHNA K, IYYER M. Generating question-answer hierarchies[EB/OL]. arXiv, 2019. arXiv:1906.02622.
[5] MIKOLOV T, CHEN K, CORRADO G S, et al. Efficient estimation of word representations in vector space[EB/OL]. arXiv, 2013. arXiv:1301.3781.
[6] MIKOLOV T, CHEN K, CORRADO G S, et al. Distributed representations of words and phrases and their compositionality[C] //Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2. New York: ACM, 2013: 3111-3119.
[7] VILNIS L, MCCALLUM A. Word representations via gaussian embedding[EB/OL]. arXiv, 2014. arXiv:1412.6623.
[8] ATHIWARATKUN B, WILSON A G. Multimodal word distributions[EB/OL]. arXiv, 2017. arXiv:1704.08424.
[9] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C] //Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Doha: ACL, 2014: 1532-1543.
[10] SUN Yaming, LIN Lei, YANG Nan, et al. Radical-enhanced chinese character embedding[C] //Proceedings of the International Conference on Neural Information Processing. Cham: Springer, 2014: 279-286.
[11] LI Yanran, LI Wenjie, SUN Fei, et al. Component-enhanced chinese character embeddings[EB/OL]. arXiv, 2015. arXiv:1508.06669.
[12] COLLOBERT R, WESTON J. A unified architecture for natural language processing: deep neural networks with multitask learning[C] //Proceedings of the 25th International Conference on Machine Learning. Helsinki: ICML, 2008: 160-167.
[13] CHEN Xinxiong, XU Lei, LIU Zhiyuan, et al. Joint learning of character and word embeddings[C] //Proceedings of the Twenty-fourth International Joint Conference on Artificial Intelligence. Buenos Aires: IJCAI, 2015.
[14] SU T R, LEE H Y. Learning chinese word repre-sentations from glyphs of characters[EB/OL]. arXiv, 2017. arXiv:1708.04755.
[15] YU Jinxing, JIAN Xun, XIN Hao, et al. Joint embeddings of chinese words, characters, and fine-grained subcharacter components[C] //Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017: 286-291.
[16] CAO Shaosheng, LU Wei, ZHOU Jun, et al. cw2vec: Learning chinese word embeddings with stroke n-gram information[C] ///Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018.
[17] LECUN Y, CHOPRA S, HADSELL R, et al. A tutorial on energy-based learning[M] //Predicting Structured Data. Cambridge: MIT Press, 2006.
[18] JOACHIMS T. Optimizing search engines using clickthrough data[C] //Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2002: 133-142.
[19] WESTON J, BENGIO S, USUNIER N. Wsabie: scaling up to large vocabulary image annotation[C] //Proceedings of the Twenty-second International Joint Conference on Artificial Intelligence. Barcelona: IJCAI, 2011.
[20] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7):257-269.
[21] 汪祥,贾焰,周斌,等. 基于中文维基百科链接结构与分类体系的语义相关度计算[J]. 小型微型计算机系统, 2011, 32(11):2237-2242. WANG Xiang, JIA Yan, ZHOU Bin, et al. Computing semantic relatedness using Chinese Wikipedia links and taxonom[J]. Journal of Chinese Computer Systems, 2011, 32(11):2237-2242.
[22] JIN P, WU Y F. SemEval-2012 Task 4: evaluating Chinese word similarity[M] //Proceedings of the First Joint Conference on Lexical and Computational Semantics. Montreal:[s.n.] , 2012: 374-377.
[23] 董强,董振东.基于知网的相关概念场的构建[M] //语言计算与基于内容的文本处理.北京: 清华大学出版社,2003:364-370. DONG Qiang, DONG Zhendong. Construction of hownet-based relevant concept field[C] //Language Computing and Content-based Text Processing Translation. Beijing: Tsinghua University Press, 2003: 364-370.

相关文章 15

[1]	张一鸣,王国胤,胡军,傅顺. 基于密度峰值和网络嵌入的重叠社区发现[J]. 《山东大学学报(理学版)》, 2021, 56(1): 91-102.
[2]	许侃,刘瑞鑫,林鸿飞,刘海峰,冯娇娇,李家平,林原,徐博. 基于异质网络嵌入的学术论文推荐方法[J]. 《山东大学学报(理学版)》, 2020, 55(11): 35-45.
[3]	张凌,任雪芳. 数据智能分类与分类智能检索-识别[J]. 《山东大学学报(理学版)》, 2020, 55(10): 7-14.
[4]	林明星. 基于变分结构引导滤波的低照度图像增强算法[J]. 《山东大学学报(理学版)》, 2020, 55(9): 72-80.
[5]	王佳麒,杨沐昀,赵铁军,赵臻宇. 检务文书检索数据集的构建[J]. 《山东大学学报(理学版)》, 2020, 55(7): 81-87.
[6]	余鹰,吴新念,王乐为,张应龙. 基于标记相关性的多标记三支分类算法[J]. 《山东大学学报(理学版)》, 2020, 55(3): 81-88.
[7]	温柳英,袁伟. 多标签符号型属性值划分的聚类方法[J]. 《山东大学学报(理学版)》, 2020, 55(3): 58-69.
[8]	张敏情,周能,刘蒙蒙,王涵,柯彦. 基于Paillier的同态加密域可逆信息隐藏[J]. 《山东大学学报(理学版)》, 2020, 55(3): 1-8,18.
[9]	王新乐,杨文峰,廖华明,王永庆,刘悦,俞晓明,程学旗. 基于多维度特征的主题标签流行度预测[J]. 《山东大学学报(理学版)》, 2020, 55(1): 94-101.
[10]	李妮,关焕梅,杨飘,董文永. 基于BERT-IDCNN-CRF的中文命名实体识别方法[J]. 《山东大学学报(理学版)》, 2020, 55(1): 102-109.
[11]	张迪,查东东,刘华勇. 带两类形状参数的三次λμ-α-DP曲线的构造[J]. 《山东大学学报(理学版)》, 2019, 54(9): 114-126.
[12]	杨亚茹, 王永庆, 张志斌, 刘悦, 程学旗. 基于多元信息融合的用户关联模型[J]. 《山东大学学报(理学版)》, 2019, 54(9): 105-113.
[13]	郝长盈,兰艳艳,张海楠,郭嘉丰,徐君,庞亮,程学旗. 基于拓展关键词信息的对话生成模型[J]. 《山东大学学报(理学版)》, 2019, 54(7): 68-76.
[14]	廖祥文,徐阳,魏晶晶,杨定达,陈国龙. 基于双层堆叠分类模型的水军评论检测[J]. 《山东大学学报(理学版)》, 2019, 54(7): 57-67.
[15]	徐洋,孙建忠,黄磊,谢晓尧. 基于WiFi定位的区域人群轨迹模型[J]. 《山东大学学报(理学版)》, 2019, 54(5): 8-20.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed