基于大规模语料的新词语识别方法

基于大规模语料的新词语识别方法

施水才，俞鸿魁，吕学强，李渝勤

北京信息科技大学中文信息处理研究中心，北京 100101

收稿日期:2006-03-29 修回日期:1900-01-01 出版日期:2006-10-24 发布日期:2006-10-24
通讯作者: 施水才

New word identification based on largescale corpus

SHI Shui-cai,YU Hong-kui,LV Xue-qiang,LI Yu-qin

Chinese Information Processing and Research Center, Beijing Information Science & Technology Univ.,

Received:2006-03-29 Revised:1900-01-01 Online:2006-10-24 Published:2006-10-24
Contact: SHI Shui-cai

摘要/Abstract

摘要： 根据新词语的不同特征，提出了一整套自动检测新词语的方法，通过大规模地统计分析，分别建立字，词，N元组的词典，从中自动检测出新词语来，然后再根据构词规则对自动检测的结果进行进一步的过滤，最终抽取出语料中的新词语. 根据此方案实现的系统，可以抽取不限长度不限领域的新词语.

关键词: 新词语, 流行语, 语料库

Abstract: String frequent static, sub string reduction and several filtering method are used to analyze one set Chinese new word mining system and identify new word by using character, word and N－gram dictionary based on statistic largescale corpus.With the system based on those methods, new word without length and domain limit can be identified.

Key words: corpus , catchword, new word

施水才,俞鸿魁,吕学强,李渝勤 . 基于大规模语料的新词语识别方法[J]. J4, 2006, 41(3): 42-45 .

SHI Shui-cai,YU Hong-kui,LV Xue-qiang,LI Yu-qin . New word identification based on largescale corpus[J]. J4, 2006, 41(3): 42-45 .

参考文献

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

[1]	原伟,易绵竹. 基于维基百科的俄汉可比语料库构建及可比度计算[J]. 山东大学学报（理学版）, 2017, 52(9): 1-6.
[2]	唐亮, 李倩, 许洪波, 易绵竹. 基于多策略过滤的汉日多词短语抽取和对齐[J]. 山东大学学报（理学版）, 2015, 50(09): 21-28.
[3]	张亮,,王树梅,黄河燕,张孝飞 . 面向中文问答系统的问句句法分析[J]. J4, 2006, 41(3): 30-33 .