一种块增量偏最小二乘算法

doi:10.6040/j.issn.1671-9352.1.2018.051

摘要/Abstract

摘要： 增量学习模型是一种有效挖掘大规模数据的数据处理技术。增量式偏最小二乘(incremental partial least square, IPLS)模型是一种基于增量技术的偏最小二乘算法改进模型,具有不错的数据降维效果,但是,IPLS模型每新增1个样本都需要对模型进行增量更新,导致模型的训练时间较长。针对这一问题,基于数据分块更新的思想提出了一种块增量偏最小二乘算法(chunk incremental partial least square, CIPLS)。CIPLS算法将样本数据划分为数个的数据块(chunk),然后再以数据块为单位对模型进行增量更新,从而大幅减少了模型的更新频率,提高了模型的学习效率。在K8版本的p53蛋白数据集和路透文本分类语料库上的对比实验表明,CIPLS算法大幅度缩短了增量式偏最小二乘模型的训练时间。

关键词: 增量学习, 偏最小二乘, 数据块, 数据降维

Abstract: For the data mining of large-scale data, incremental learning is an effective and efficient technique. As an improved partial least square(PLS)method based on incremental learning, incremental partial least square(IPLS)has a competitive dimension reduction performance. However, there is a drawback in this approach that training samples must be learned one by one, which consumes a lot of time on the issue of on-line learning. To overcome this problem, we propose an extension of IPLS called chunk incremental partial least square(CIPLS)in which a chunk of training samples is processed at a time. Comparative experiments on k8 cancer rescue mutants data set and Reuter-21578 text classification corpus show the proposed CIPLS algorithm is much more efficient than IPLS without sacrifice dimension reduction performance.

Key words: incremental learning, partial least square, data chunk, dimension reduction

中图分类号:

TP311

曾雪强,叶震麟,左家莉,万中英,吴水秀. 一种块增量偏最小二乘算法[J]. 《山东大学学报(理学版)》, 2019, 54(3): 93-101.

ZENG Xue-qiang, YE Zhen-lin, ZUO Jia-li, WAN Zhong-ying, WU Shui-xiu. A chunk increment partial least square algorithm[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2019, 54(3): 93-101.

参考文献

[1] WOLD S. Principal component analysis[J]. Chemometrics & Intelligent Laboratory Systems, 1987, 2(1): 37-52.
[2] LANDAUER T K, FOLTZ P W, LAHAM D. Introduction to latent semanticanalysis[J]. Discourse Processes, 1998, 25(2/3): 259-284.
[3] BOULESTEIX A L. PLS dimension reduction for classification with microarraydata[J]. Statistical Applications in Genetics and Molecular Biology, 2004, 3(1): 1-30.
[4] ZENG X Q, LI G Z, YANG J Y, et al. Dimension reduction with redundant gene elimination for tumor classification[J]. BMC Bioinformatics, 2008, 9(Suppl 6): S8.
[5] YAN J, ZHANG B, LIU N, et al. Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(3):320-333.
[6] 李雪, 蒋树强. 智能交互的物体识别增量学习技术综述[J]. 智能系统学报, 2017, 12(2):140-149. LI Xue, JIANG Shuqiang. Incremental learning and object recognition system based on intelligent HCI: a survey[J]. CAAI Transactions on Intelligent System, 2017, 12(2): 140-149.
[7] 卜范玉, 陈志奎, 张清辰. 支持增量式更新的大数据特征学习模型[J]. 计算机工程与应用, 2015, 51(12):21-26. BU Fanyu, CHEN Zhikui, ZHANG Qingchen. Incremental updating method for big data feature learning[J]. Computer Engineering and Applications, 2015, 51(12): 21-26.
[8] OZAWA S, PANG S, KASABOV N. Online feature extraction for evolving intelligent systems[M] //OZAWA S, PANG S, KASABOV N. eds. Evolving Intelligent Systems. Hoboken: John Wiley & Sons, Inc., 2010: 151-171.
[9] WENG J Y, ZHANG Y L, HWANG W S. Candid covariance-free incremental principal componentanalysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(8): 1034-1040.
[10] ZENG X Q, LI G Z. Dimension reduction for p53 protein recognition by using incremental partial leastsquares[J]. IEEE Transactions on NanoBioscience, 2014, 13(2): 73-79.
[11] HIRAOKA K, HIDAI K, HAMAHIRA M, et al. Successive learning of linear discriminant analysis: sanger-type algorithm[C] //International Conference on Pattern Recognition, 2000. Borcelona: IEEE, 2000:664-667.
[12] PANG S, OZAWA S, KASABOV N. Incremental linear discriminant analysis for classification of datastreams[J]. IEEE Transactions on Systems, Man and Cybernetics: Part b(Cybernetics), 2005, 35(5): 905-914.
[13] OZAWA S, PANG S, KASABOV N. Incremental learning of chunk data for online pattern classification systems[J]. IEEE Transactionson Neural Networks, 2008, 19(6):1061-1074.
[14] 曾雪强, 赵丙娟, 向润,等. 基于偏最小二乘的人脸年龄估计[J]. 南昌大学学报(工科版), 2017, 39(4):380-385. ZENG Xueqiang, ZHAO Bingjuan, XIANG Run, et al. Partial least squares based facial age estimation[J]. Journal of Nanchang University(Engineering & Technology), 2017, 39(4): 380-385.
[15] MARTÍNEZ J L, SAULO H, ESCOBAR H B, et al. A new model selection criterion for partial least squaresregression[J]. Chemometrics and Intelligent Laboratory Systems, 2017, 169: 64-78.
[16] HELLAND I S. On the structure of partial least squaresregression[J]. Communications in Statistics - Simulation and Computation, 1988, 17(2): 581-607.
[17] DE JONG S. SIMPLS: an alternative approach to partial least squaresregression[J]. Chemometrics and Intelligent Laboratory Systems, 1993, 18(3): 251-263.
[18] DANZIGER S A, BARONIO R, HO L, et al. Predicting positive p53 cancer rescue regions using most informative positive(MIP)active learning[J]. PLOS Computational Biology, 2009, 5(9): e1000498.
[19] HTUN P T, KHAINGK T. Important roles of data mining techniques for anomaly intrusion detectionsystem[J]. International Journal of Advanced Research in Computer Engineering & Technology, 2013, 2(5): 1850-1854.
[20] WITTEN I, FRANK E. Datamining: practical machine learning tools and techniques [J]. ACM Sigmod Record, 2005, 31(1):76-77.
[21] YANG Y, LIU X. A re-examination of text categorization methods [C] // Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley: ACM Press, 1999: 42-49.

相关文章 15

[1]	晏燕,郝晓弘. 差分隐私密度自适应网格划分发布方法[J]. 山东大学学报（理学版）, 2018, 53(9): 12-22.
[2]	随云仙,刘勇. 基于二步邻居拓扑的E-Burt结构洞检测算法[J]. 山东大学学报（理学版）, 2017, 52(9): 59-68.
[3]	张中军,张文娟,于来行,李润川. 基于网络距离和内容相似度的微博社交网络社区划分方法[J]. 山东大学学报（理学版）, 2017, 52(7): 97-103.
[4]	毕晓迪,梁英,史红周,田辉. 一种基于隐私偏好的二次匿名位置隐私保护方法[J]. 山东大学学报（理学版）, 2017, 52(5): 75-84.
[5]	董红斌,苟乃康,杨雪. 基于兴趣度的广告拍卖模型研究[J]. 山东大学学报（理学版）, 2017, 52(3): 1-7.
[6]	陈晓云,廖梦真,陈慧娟. 模式收缩最小二乘回归子空间分割[J]. 山东大学学报（理学版）, 2016, 51(12): 108-115.
[7]	李钊,孙占全,李晓,李诚. 基于信息损失量的特征选择方法研究及应用[J]. 山东大学学报（理学版）, 2016, 51(11): 7-12.
[8]	刘大福,苏旸. 一种基于证据的软件可信性度量模型[J]. 山东大学学报（理学版）, 2016, 51(11): 58-65.
[9]	高元照,李炳龙,吴熙曦. 基于物理内存的注册表逆向重建取证分析算法[J]. 山东大学学报（理学版）, 2016, 51(9): 127-136.
[10]	翟鹏,李登道. 基于高斯隶属度的包容性指标模糊聚类算法[J]. 山东大学学报（理学版）, 2016, 51(5): 102-105.
[11]	邓松. 面向旅游人文信息集成的Web数据源选择[J]. 山东大学学报（理学版）, 2016, 51(3): 70-76.
[12]	李瑞霞, 刘仁金, 周先存. 基于哈希表的MapReduce算法优化[J]. 山东大学学报（理学版）, 2015, 50(07): 66-70.
[13]	吴熙曦, 李炳龙, 张天琪. 基于KNN的Android智能手机微信取证方法[J]. 山东大学学报（理学版）, 2014, 49(09): 150-153.
[14]	卢琦蓓1,2,郭飞鹏3. 基于改进型FP-Tree的分布式关联分类算法[J]. 山东大学学报（理学版）, 2014, 49(1): 71-75.
[15]	戚丽丽,孙静宇*,陈俊杰. 基于均模型的IBCF算法研究[J]. J4, 2013, 48(11): 105-110.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed