您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2025, Vol. 60 ›› Issue (1): 14-28.doi: 10.6040/j.issn.1671-9352.4.2023.0212

• • 上一篇    

融合矩阵分解和空间划分的微生物数据扩增方法

温柳英,吴俊,闵帆   

  1. 西南石油大学计算机与软件学院, 四川 成都 610500
  • 发布日期:2025-01-10
  • 作者简介:温柳英(1983— ),女,副教授,博士,研究方向为机器学习、不平衡学习和微生物信息学. E-mail: wenliuying1983@163.com
  • 基金资助:
    中央引导地方科技发展专项资助项目(2021ZYD0003)

Fusing matrix factorization and space partition microbial data augmentation algorithm

WEN Liuying, WU Jun, MIN Fan   

  1. School of Computer and Software, Southwest Petroleum University, Chengdu 610500, Sichuan, China
  • Published:2025-01-10

摘要: 针对微生物数据类内和类间不平衡、高稀疏性的问题,提出一种融合矩阵分解和空间划分的数据扩增算法。采用矩阵分解技术将原始数据空间分解为对象子空间和特征子空间,提取潜在空间表示,对象子空间划分为多个数据子空间,缓解了类内不平衡问题。为了解决类间不平衡问题,在每个数据子空间中生成合成样本,使用欧氏距离对合成样本进行过滤,获得高质量的样本。在9个微生物数据集上实验,再与9个采样算法进行性能对比。 结果表明,本文算法生成的样本在多样性上具有较大优势,采用多个分类器时,能识别出更多的阳性样本。

关键词: 矩阵分解, 空间划分, 类内不平衡, 类间不平衡, 对象子空间, 特征子空间

Abstract: Aiming at the problems of intra-class imbalance and inter-class imbalance and high sparsity of microbial data, a data augmentation method that fuses matrix factorization and space partition is proposed. Matrix factorization technology is used to decompose the original data space into object subspace and feature subspace to extract the latent space representation. The object subspace is divided into multiple data subspaces to alleviate the intra-class imbalance problem. Synthetic samples are then generated in each data subspace to address the inter-class imbalance. Synthetic samples are filtered using Euclidean distance to obtain high-quality samples. The experiment is conducted on 9 microbial data sets, and the performance is compared with 9 sampling algorithms. The results show that the samples generated by the proposed method have great advantages in diversity, and more positive samples can be identified under multiple classifiers.

Key words: matrix factorization, space partition, intra-class imbalance, inter-class imbalance, object subspace, feature subspace

中图分类号: 

  • TP391
[1] WEN Liuying, WANG Xi, MIN Fan. Cost-sensitive microbial data augmentation through matrix factorization[J]. Applied Intelligence, 2022, 53(10):12684-12700.
[2] ZHANG Chong, TAN Kaychen, LI Haizhou, et al. A cost-sensitive deep belief network for imbalanced classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 30(1):109-122.
[3] LAPIERRE N, WANG W, ZHOU G, et al. Metapheno: a critical evaluation of deep learning and machine learning in metagenome-based disease prediction[J]. Methods, 2019, 15(166):74-82.
[4] ZHANG Yong, ZHANG Heping. Microbiota associated with type 2 diabetes and its related complications[J]. Food Science and Human Wellness, 2013, 2(3):167-172.
[5] 张玉凤,荆功超,李劲华,等. 基于微生物组大数据搜索的疾病检测[J]. 科学, 2021, 73(2):24-30. ZHANG Yufeng, JING Gongchao, LI Jinhua, et al. Disease detection based on microbiome big data search[J]. Science, 2021, 73(2):24-30.
[6] GARCIA V, SANCHEZ S J, MARTIN F R, et al. Surrounding neighborhood-based smote for learning from imbalanced data sets[J]. Progress in Artificial Intelligence, 2013, 1(4):347-362.
[7] SPELMEN V S, PORKODI R. A review on handling imbalanced data[C] //International Conference on Current Trends Towards Converging Technologies(ICCTCT), Coimbatore. New Delhi: IEEE, 2018:1-11.
[8] NGUYEN H T, TRAN T B, BUI M Q, et al. Enhancing disease prediction on imbalanced metagenomic dataset by cost-sensitive[J]. International Journal of Advanced Computer Science and Applications, 2020, 11(7):1-6.
[9] PETROSINO J F. The microbiome in precision medicine: the way forward[J]. Genome Medicine, 2018, 10(1):1-4.
[10] PENG Minlong, ZHANG Qi, XING Xiaoyu, et al. Trainable undersampling for class-imbalance learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1):4707-4714.
[11] BARUA S, ISLAM M M, YAO X, et al. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2):405-425.
[12] CHAWLA V N, BOWYER W K, HALL O L, et al. SMOTE: synthetic minority over-sampling technique[J]. The Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[13] LI Wenjie. Imbalanced data optimization combining k-means and smote[J]. International Journal of Performability Engineering, 2019, 15(8):2173-2181.
[14] LI Ma, FAN Suohai. Cure-smote algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J]. BMC Bioinformatics, 2017, 18(1):1-16.
[15] BUNKHUMPORNPAT C, SINAPIROMSARAN K. DBMUTE: density-based majority under-sampling technique[J]. Knowledge and Information Systems, 2017, 50(3):827-850.
[16] 赵增,李明勇,胡航飞. 基于邻居聚类的近似最近邻搜索[J]. 智能计算机与应用, 2020, 10(11):70-78. ZHAO Zeng, LI Mingyong, HU Hangfei. Approximate nearest neighbor search based on neighbor clustering[J]. Intelligent Computers and Applications, 2020, 10(11):70-78.
[17] 卫泽刚, 侯一凡, 张小丹, 等. 微生物操作分类单元划分算法研究[J]. 宝鸡文理学院学报, 2022, 42(1):80-88. WEI Zegang, HOU Yifan, ZHANG Xiaodan, et al. Research on the algorithm for division of microbial operation taxonomic units[J]. Journal of Baoji University of Arts and Sciences, 2022, 42(1):80-88.
[18] ANDREAS H, VAKHTANG K. SVD approach to data unfolding[J]. Nuclear Instruments & Methods in Physics Research Section(Aaccelerators Spectrometers Detectors and Associated Equipment), 1996, 372(3):469-481.
[19] LI Wuzhou, LIANG Zhiwen, CAO Yi, et al. Estimating intrafraction tumor motion during fiducial-based liver stereotactic radiotherapy via an iterative closest point(ICP)algorithm[J]. Radiation Oncology, 2019, 14(1):1-8.
[20] YUE Xiaokui, LIU Qicheng. Improved funkSVD algorithm based on RMSProp[J]. Journal of Circuits, Systems and Computers, 2022, 31(8):1-14.
[21] RAJEEV K, VERMA B K, SHYAM R S. Social popularity based SVD++ recommender system[J]. International Journal of Computer Applications, 2014, 87(14):33-37.
[22] 徐彭娜,魏静,林劼,等. 基于位置信息熵的局部敏感哈希聚类方法[J]. 计算机应用与软件, 2018, 35(3):230-235. XU Pengna, WEI Jing, LIN Jie, et al. Locality-sensitive hash clustering method based on location information entropy [J]. Computer Applications and Software, 2018, 35(3):230-235.
[23] RAM P, GRAY A G. Which space partitioning tree to use for search?[C] //Annual Conference on Neural Information Processing Systems. Lake Tahoe: NeurIPS, 2013:1-9.
[24] SANJOY D, YOAV F. Random projection trees and low dimensional manifolds[C] //Annual ACM symposium on Theory of Computing. Baltimore, MD: Dove Medical Press, 2008:537-546.
[25] JIANG Kun, LU Jingshu, XIA Kuiliang. A novel algorithm for imbalance data classification based on genetic algorithm improved smote[J]. Arabian Journal for Science and Engineering, 2016, 41(8):3255-3266.
[26] WEN Liuying, ZHANG Xiaomin, MIN Fan, et al. KGA: integrating KPCA and GAN for microbial data augmentation[J]. International Journal of Machine Learning and Cybernetics, 2022, 14(4):1427-1444.
[27] 王曦,温柳英,闵帆. 融合矩阵分解和代价敏感的微生物数据扩增算法[J]. 数据采集与处理, 2023, 38(2):1-12. WANG Xi, WEN Liuying, MIN Fan. Combining matrix decomposition and cost-sensitive microbial data augmentation algorithm [J]. Journal of Data Acquisition & Processing, 2023, 38(2):1-12.
[28] BATISTA G E A P A, PRATI C R, MONARD C R. A study of the behavior of several methods for balancing machine learning training data[J]. Association for Computing Machinery, 2004, 6(1):20-29.
[29] HE H B, BAI Y, EDWARDO A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C] //IEEE International Joint Conference on Neural Networks. Atlanta, GA: IEEE, 2008:1322-1328.
[30] WANG Juanjuan, XU Mantao, WANG Hui, et al. Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding[C] //International Conference on Signal Processing. Guilin: IEEE, 2006:1-4.
[1] 吴贤君,唐绍诗,王明秋. 融合基础属性和通信行为的移动用户个性化推荐[J]. 《山东大学学报(理学版)》, 2023, 58(9): 81-93.
[2] 韦芳,王长鹏. 基于双高斯先验的低秩矩阵分解模型[J]. 《山东大学学报(理学版)》, 2023, 58(3): 101-108.
[3] 李心雨,范辉,刘惊雷. 基于自适应图调节和低秩矩阵分解的鲁棒聚类[J]. 《山东大学学报(理学版)》, 2022, 57(8): 21-38.
[4] 柳利芳,马园园. 基于多视角对称非负矩阵分解的跨模态信息检索方法[J]. 《山东大学学报(理学版)》, 2022, 57(7): 65-72.
[5] 晏燕,郝晓弘. 差分隐私密度自适应网格划分发布方法[J]. 山东大学学报(理学版), 2018, 53(9): 12-22.
[6] 黄淑芹,徐勇,王平水. 基于概率矩阵分解的用户相似度计算方法及推荐应用[J]. 山东大学学报(理学版), 2017, 52(11): 37-43.
[7] 杨元慧,李国栋,吴春富,王小龙. 单目视觉SLAM车载摄像机快速位姿估计及景物重构[J]. 山东大学学报(理学版), 2016, 51(12): 116-124.
[8] 唐庆顺, 吴春富, 李国栋, 王小龙, 周风余. 移动机器人车载摄像机位姿的高精度快速求解[J]. 山东大学学报(理学版), 2015, 50(03): 32-39.
[9] 吴春富1,唐庆顺1,谢煌生1,周风余2*. 一种新型的本质矩阵解析分解算法[J]. 山东大学学报(理学版), 2014, 49(03): 31-36.
[10] 杜吉祥1,2,余庆1,翟传敏1. 基于稀疏性约束非负矩阵分解的人脸年龄估计方法[J]. J4, 2010, 45(7): 65-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!