您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2025, Vol. 60 ›› Issue (1): 63-73.doi: 10.6040/j.issn.1671-9352.4.2023.0213

• • 上一篇    

结合深度信息引导和多尺度通道注意力机制的单目三维目标检测算法

刘青1,李伟1*,余少勇2,宋宇萍3,周启迪1,邹伟林1   

  1. 1.厦门理工学院计算机与信息工程学院, 福建 厦门 361024;2.龙岩学院数学与信息工程学院, 福建 龙岩 364012;3.厦门大学数学科学学院, 福建 厦门 361005
  • 发布日期:2025-01-10
  • 通讯作者: 李伟(1979— ),男,副教授,博士,研究方向为机器学习、粒计算、图像处理、信息安全. E-mail: drweili@hotmail.com
  • 作者简介:刘青(1998— ),女,硕士研究生,研究方向为深度学习、三维目标检测. E-mail: 2783687819@qq.com*通信作者:李伟(1979— ),男,副教授,博士,研究方向为机器学习、粒计算、图像处理、信息安全. E-mail: drweili@hotmail.com
  • 基金资助:
    教育部人文社会科学研究规划基金资助项目(23YJAZH067);国家留学基金资助项目(202308350042);厦门市科学技术局产学研资助项目(2023CXY0409);厦门理工学院研究生教育教学改革研究资助项目(YJS20220617)

Monocular 3D object detection algorithm combining depth guidance and multi-scale channel attention mechanism

LIU Qing1, LI Wei1*, YU Shaoyong2, SONG Yuping3, ZHOU Qidi1, ZOU Weilin1   

  1. 1. School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, Fujian, China;
    2. School of Mathematics and Information Engineering, Longyan University, Longyan 364012, Fujian, China;
    3. School of Mathematical Sciences, Xiamen University, Xiamen 361005, Fujian, China
  • Published:2025-01-10

摘要: 针对三维边界框无法从缺少空间线索的单目图像中准确估计的问题,本文提出一种基于深度信息引导和多尺度通道注意力机制的单目三维目标检测算法。为了引入三维信息并有效地获取和利用不同尺度特征图的空间信息,在特征提取模块中利用多尺度分割注意力算法,分别从单目图像和深度图中提取多尺度预处理特征图,利用通道注意力算法进行权重标定,提高了特征图的表征能力。通过深度引导动态局部卷积网络,将包含多尺度信息的深度图特征作为单目图像特征的特定卷积核,引入三维信息作为指导,减少直接融合的误差累积,并解决单目视觉中近大远小的尺度敏感问题。选择不同的评估指标对模型的性能进行评价与比较。实验结果表明,同其他算法相比,本文算法的自动驾驶数据集中汽车、行人、骑自行车的人的三维目标检测平均精度均提高。

关键词: 单目三维目标检测, 深度引导, 多尺度通道注意力机制, 自动驾驶

Abstract: For issues where the absence of essential spatial structure signals makes it highly challenging to estimate 3D bounding boxes accurately from a single picture, a monocular 3D object detection algorithm is proposed based on a multi-scale channel attention mechanism plus depth guidance to conquer these challenges. To introduce 3D data and effectively capture spatial information from different scales of feature maps, the depth maps and monocular image feature maps are pre-processed in the feature extraction module using a pyramid split algorithm, respectively, and then on the basic of the weight using the channel-wise attention module to calibrate the corresponding feature vectors to generate a refined feature map which is richer in multi-scale feature information. A depth-guided dynamic local convolution network is suggested for applying depth maps as specific kernels that contain spatial structure signals to monocular image feature maps. This method mitigates error accumulation from direct fusion and addresses the scale sensitivity issue of objects looking larger or smaller with distance. The models performance is assessed and also compared using various evaluation metrics. Experimental results demonstrate that the method proposed in this paper improves the 3D detection accuracy for cars,pedestrians and cyclists in the autonomous driving datasets when compared to other algorithms.

Key words: monocular 3D object detection, depth guidance, multi-scale channel-wise attention mechanism, autonomous driving

中图分类号: 

  • TP391
[1] MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu: IEEE, 2017:5632-5640.
[2] QIN Zengyi, WANG Jinglu, LU Yan. Monogrnet: a geometric reasoning network for monocular 3D object localization[C] //2019 33th AAAI Conference on Artificial Intelligence(AAAI-19). Hawaii: AAAI Press, 2019:8851-8858.
[3] SIMONELLI A, BULO S R, PORZI L, et al. Disentangling monocular 3D object detection: from single to multi-class recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(3):1219-1231.
[4] BRAZIL G, LIU X M. M3D-RPN: monocular 3D region proposal network for object detection[C] //2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul: IEEE, 2019:9286-9295.
[5] XIANG Y, CHOI W, LIN Y Q, et al. Data-driven 3D voxel patterns for object category recognition[C] //2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston: IEEE, 2015:1903-1911.
[6] LIU Zongdai, ZHOU Dingfu, LU Feixiang, et al. Autoshape: real-time shape-aware monocular 3D object detection[C] //2021 IEEE/CVF International Conference on Computer Vision(ICCV). Montreal: IEEE, 2021:15621-15630.
[7] SONG Xibin, LI Wei, ZHOU Dingfu, et al. MLDA-Net: multi-level dual attention-based network for self-supervised monocular depth estimation[J]. IEEE Transactions on Image Processing, 2021, 30:4691-4705.
[8] GODARD C, AODHA O M, FIRMAN M, et al. Digging into self-supervised monocular depth estimation[C] //2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul: IEEE, 2019:3827-3837.
[9] WANG Qi, CHEN Jian, DENG Jiangqiang, et al. 3D-CenterNet: 3D object detection network for point clouds with center estimation priority[J]. Pattern Recognition, 2021, 115:107884.
[10] CHEN Yongjian, TAI Lei, SUN Kai, et al. Monopair: monocular 3D object detection using pairwise spatial relationships[C] //2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle: IEEE Computer Society, 2020:12090-12099.
[11] YIN T W, ZHOU X Y, KRAHENBUHL P. Center-based 3D object detection and tracking[C] //2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Nashville: IEEE, 2021:11779-11788.
[12] FU Huan, GONG Mingming, WANG Chaohui, et al. Deep ordinal regression network for monocular depth estimation[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Salt Lake City: IEEE, 2018:2002-2011.
[13] MAYER N, ILG E, HAUSSER P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation[C] //2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas: IEEE Computer Society, 2016:4040-4048.
[14] WANG Y, CHAO W L, GARG D, et al. Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving[C] //2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach: IEEE, 2019:8437-8445.
[15] PARK D, AMBRUS R, GUIZILINI V, et al. Is pseudo-lidar needed for monocular 3D object detection?[C] //2021 IEEE/CVF International Conference on Computer Vision(ICCV). Montreal: IEEE, 2021:3142-3152.
[16] MA Xinzhu, WANG Zhihui, LI Haojie, et al. Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving[C] //2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul: IEEE, 2019:6850-6859.
[17] GARG D, WANG Y, HARIHARAN B, et al. Wasserstein distances for stereo disparity estimation[J]. Advances in Neural Information Processing Systems, 2020, 33:22517-22529.
[18] ZHANG Hu, ZU Keke, LU Jian, et al. EPSANet: an efficient pyramid squeeze attention block on convolutional neural network[C] //Asian Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022:541-557.
[19] WU Bichen, WAN Alvin, YUE Xiangyu, et al. Shift: a zero flop, zero parameter alternative to spatial convolutions[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Salt Lake City: IEEE, 2018:9127-9135.
[20] SHEPLEY A J, FALZON G, KWAN P, et al. Confluence: a robust non-IoU alternative to non-maxima suppression in object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10):11561-11574.
[21] HE Kaiminh, ZHANG Xiangyu, REN Shaoqi, et al. Deep residual learning for image recognition[C] //2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas: IEEE Computer Society, 2016:770-778.
[22] HU Jie, SHEN Li, SUN Gang. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8):2011-2023.
[23] BRABANDERE D B, JIA X, TUYTELAARS T, et al. Dynamic filter networks[J]. Proceedings NIPS 2016, 2016, 29:1-9.
[24] WANG Xin, LV Rongrong, ZHAO Yang, et al. Multi-scale context aggregation network with attention-guided for crowd counting[C] //2020 15th IEEE International Conference on Signal Processing(ICSP). Beijing: IEEE, 2020:240-245.
[25] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C] //2016 14th European Conference on Computer Vision(ECCV). Amsterdam: Springer, 2016:21-37.
[26] GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? the kitti vision benchmark suite[C] //2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Providence: IEEE, 2012:3354-3361.
[27] LI Peixuan, ZHAO Huaici, LIU Pengfei, et al. Rtm3D: real-time monocular 3D detection from object keypoints for autonomous driving[C] //2020 16th European Conference on Computer Vision(ECCV). Beilin: Springer, 2020:644-660.
[28] CHEN X Z, KUNDU K, ZHU Y K, et al. 3D object proposals using stereo imagery for accurate object class detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017, 40(5):1259-1272.
[29] QIN Zengyi, WANG Jinglu, LU Yan. Triangulation learning network: from monocular to stereo 3D object detection[C] //2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach: IEEE, 2019:7607-7615.
[30] XU Bin, CHEN Zhenzhou. Multi-level fusion based 3D object detection from monocular images[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Salt Lake City: IEEE, 2018:2345-2353.
[31] SHI X P, CHEN Z X, KIM T K. Distance-normalized unified representation for monocular 3D object detection[C] //2020 16th European Conference on Computer Vision(ECCV). Berlin: Springer, 2020:91-107.
[32] CAI Yingjie, LI Buyu, JIAO Zeyu, et al. Monocular 3D object detection with decoupled structured polygon estimation and height-guided depth estimation[C] //2020 34th AAAI Conference on Artificial Intelligence(AAAI-20). New York: AAAI Press, 2020:10478-10485.
[33] DENG J, DONG W, SOCHER R, et al. Imagenet: a large-scale hierarchical image database[C] //2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Miami: IEEE, 2009:248-255.
[34] DAI Jifeng, QI Haozhi, XIONG Yuwen, et al. Deformable convolutional networks[C] //2017 IEEE International Conference on Computer Vision(ICCV). Venice: IEEE Computer Society, 2017:764-773.
[1] 梁霞,郭洁. 基于在线评论的线上教学平台选择方法[J]. 《山东大学学报(理学版)》, 2024, 59(9): 108-118.
[2] 黎超,廖薇. 基于医疗知识驱动的中文疾病文本分类模型[J]. 《山东大学学报(理学版)》, 2024, 59(7): 122-130.
[3] 纪杰,孙承杰,单丽莉,尚伯乐,林磊. 基于提示学习的电信网络诈骗案件分类方法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 113-121.
[4] 罗奇,苟刚. 基于聚类和群组归一化的多模态对话情绪识别[J]. 《山东大学学报(理学版)》, 2024, 59(7): 105-112.
[5] 赵峰叙,王健,林原,林鸿飞. 面向排序学习的概率分布优化模型[J]. 《山东大学学报(理学版)》, 2024, 59(7): 95-104.
[6] 黄兴宇,赵明宇,吕子钰. 面向图神经网络表征学习的类别知识探针[J]. 《山东大学学报(理学版)》, 2024, 59(7): 85-94.
[7] 桂梁,徐遥,何世柱,张元哲,刘康,赵军. 基于动态邻居选择的知识图谱事实错误检测方法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 76-84.
[8] 咸宁,范意兴,廉涛,郭嘉丰. 融合多重特征的噪声网络对齐方法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 64-75.
[9] 孙承杰,李宗蔚,单丽莉,林磊. 一种基于核心论元的篇章级事件抽取方法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 53-63.
[10] 刘沛羽,姚博文,高泽峰,赵鑫. 基于矩阵乘积算符表示的序列化推荐模型[J]. 《山东大学学报(理学版)》, 2024, 59(7): 44-52, 104.
[11] 邵伟,朱高宇,于雷,郭嘉丰. 高维数据的降维与检索算法[J]. 《山东大学学报(理学版)》, 2024, 59(7): 27-43.
[12] 杨纪元,马沐阳,任鹏杰,陈竹敏,任昭春,辛鑫,蔡飞,马军. 基于自监督的预训练在推荐系统中的研究[J]. 《山东大学学报(理学版)》, 2024, 59(7): 1-26.
[13] 陈海粟,廖佳纯,姚思诚. 政府开放数据中个人信息披露识别与统计方法[J]. 《山东大学学报(理学版)》, 2024, 59(3): 95-106.
[14] 温欣,李德玉. 基于属性加权的ML-KNN方法[J]. 《山东大学学报(理学版)》, 2024, 59(3): 107-117.
[15] 曾雪强,孙雨,刘烨,万中英,左家莉,王明文. 基于情感分布的emoji嵌入式表示[J]. 《山东大学学报(理学版)》, 2024, 59(3): 81-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!