对数字化科技论文的自动分类研究

对数字化科技论文的自动分类研究

李森¹，马军¹，赵嫣¹，雷景生^1,2

山东大学计算机科学与技术学院，山东济南250061；

收稿日期:2006-03-29 修回日期:1900-01-01 出版日期:2006-10-24 发布日期:2006-10-24
通讯作者: 李森

The study on automitic classification of digital documents of scientific papers

LI Sen,MA Jun,ZHAO Yan,LEI Jing-sheng

School of Computer Science and Technology， Shandong Univ., Jinan 250061， Shandong, China；

Received:2006-03-29 Revised:1900-01-01 Online:2006-10-24 Published:2006-10-24
Contact: LI Sen

摘要/Abstract

摘要： 针对科技论文具有半结构化的特点，提出利用科技论文的元数据的多层次分类模型. 这里元数据包含论文的标题、关键词集合和摘要等信息. 实验证明，若只利用元数据，可以取得与传统的基于全文信息分类方法近似的分类精度；若基于领域知识所产生的分类法, 先利用元数据进行粗分类，然后再进行全文分类，所得到的分类精度要高于已知最好算法. 因元数据的尺寸远远小于论文全文的尺寸，而粗分类后每类的论文数要远远小于全体论文数，故在分类类别数目较多且分类文本分布较为平均的情况下，可极大地缩短分类的时间.

关键词: 科技论文, 文本分类, 分类效率 , 分类精度, 层次结构

Abstract: Abstract: Since scientific papers are usually semistructural documents, a hierarchy classification model based on the metadata of scientific papers is proposed, where the metadata include the titles, keyword sets, abstracts and so on.Experiments show the precision of the classification based on the metadata of papers is close to that of the classification based on the full text of papers. Furthermore, the classification precisions are better than the best known classification algorithm if the papers are classified based on taxonomy of application domains as follows: first, the metadata are used to classify paper roughly based on the higher evels of taxonomy, then full texts are utilized to classify these papers on the lower levels of taxonomy. Since the size of metadata is less than that of full text and the number of papers classified in a subclass is less than that of total number of papers, the new model enhances the efficiency of paper classification when the number of classes is bigger and the documents are distributed averagely in the given taxonomy.

Key words: efficiency , accuracy, hierarchy, text categorization, technical literature

李森,马军,赵嫣,雷景生, . 对数字化科技论文的自动分类研究[J]. J4, 2006, 41(3): 81-84 .

LI Sen,MA Jun,ZHAO Yan,LEI Jing-sheng . The study on automitic classification of digital documents of scientific papers[J]. J4, 2006, 41(3): 81-84 .

参考文献

相关文章 15

[1]	万中英,王明文,左家莉,万剑怡. 结合全局和局部信息的特征选择算法[J]. 山东大学学报（理学版）, 2016, 51(5): 87-93.
[2]	马成龙, 姜亚松, 李艳玲, 张艳, 颜永红. 基于词矢量相似度的短文本分类[J]. 山东大学学报（理学版）, 2014, 49(12): 18-22.
[3]	郑妍, 庞琳, 毕慧, 刘玮, 程工. 基于情感主题模型的特征选择方法[J]. 山东大学学报（理学版）, 2014, 49(11): 74-81.
[4]	刘伍颖,易绵竹,张兴. 一种时空高效的多类别文本分类算法[J]. J4, 2013, 48(11): 99-104.
[5]	裴海峰. ε-函数粗集及其在投资组合中的应用[J]. J4, 2012, 47(11): 88-93.
[6]	蒋盛益1,庞观松2,张建军3. 基于聚类的垃圾邮件识别技术研究[J]. J4, 2011, 46(5): 71-76.
[7]	黄贤立，罗冬梅. 倾向性文本迁移学习中的特征重要性研究[J]. J4, 2010, 45(7): 13-17.
[8]	易超群,李建平,朱成文. 一种基于分类精度的特征选择支持向量机[J]. J4, 2010, 45(7): 119-121.
[9]	袁晓航,杜小勇 . iRIPPER——一种改进的基于规则学习的文本分类算法[J]. J4, 2007, 42(11): 66-68 .
[10]	张华伟,王明文,甘丽新 . 基于随机森林的文本分类模型研究[J]. J4, 2006, 41(3): 139-143 .
[11]	万海平,何华灿,周延泉 . 局部核方法及其应用[J]. J4, 2006, 41(3): 18-20 .
[12]	余俊英,王明文,盛俊 . 文本分类中的类别信息特征选择方法[J]. J4, 2006, 41(3): 144-148 .
[13]	张国英,沙芸,江慧娜 . 基于粒子群优化的快速KNN分类算法[J]. J4, 2006, 41(3): 34-36 .
[14]	袁方,苑俊英 . 基于类别核心词的朴素贝叶斯中文文本分类[J]. J4, 2006, 41(3): 46-49 .
[15]	白如江,王效岳 . 基于粗糙集理论和BP神经网络的文本自动分类方法研究[J]. J4, 2006, 41(3): 70-75 .

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed