结合深度信息引导和多尺度通道注意力机制的单目三维目标检测算法

doi:10.6040/j.issn.1671-9352.4.2023.0213

摘要/Abstract

摘要： 针对三维边界框无法从缺少空间线索的单目图像中准确估计的问题,本文提出一种基于深度信息引导和多尺度通道注意力机制的单目三维目标检测算法。为了引入三维信息并有效地获取和利用不同尺度特征图的空间信息,在特征提取模块中利用多尺度分割注意力算法,分别从单目图像和深度图中提取多尺度预处理特征图,利用通道注意力算法进行权重标定,提高了特征图的表征能力。通过深度引导动态局部卷积网络,将包含多尺度信息的深度图特征作为单目图像特征的特定卷积核,引入三维信息作为指导,减少直接融合的误差累积,并解决单目视觉中近大远小的尺度敏感问题。选择不同的评估指标对模型的性能进行评价与比较。实验结果表明,同其他算法相比,本文算法的自动驾驶数据集中汽车、行人、骑自行车的人的三维目标检测平均精度均提高。

关键词: 单目三维目标检测, 深度引导, 多尺度通道注意力机制, 自动驾驶

Abstract: For issues where the absence of essential spatial structure signals makes it highly challenging to estimate 3D bounding boxes accurately from a single picture, a monocular 3D object detection algorithm is proposed based on a multi-scale channel attention mechanism plus depth guidance to conquer these challenges. To introduce 3D data and effectively capture spatial information from different scales of feature maps, the depth maps and monocular image feature maps are pre-processed in the feature extraction module using a pyramid split algorithm, respectively, and then on the basic of the weight using the channel-wise attention module to calibrate the corresponding feature vectors to generate a refined feature map which is richer in multi-scale feature information. A depth-guided dynamic local convolution network is suggested for applying depth maps as specific kernels that contain spatial structure signals to monocular image feature maps. This method mitigates error accumulation from direct fusion and addresses the scale sensitivity issue of objects looking larger or smaller with distance. The models performance is assessed and also compared using various evaluation metrics. Experimental results demonstrate that the method proposed in this paper improves the 3D detection accuracy for cars,pedestrians and cyclists in the autonomous driving datasets when compared to other algorithms.

Key words: monocular 3D object detection, depth guidance, multi-scale channel-wise attention mechanism, autonomous driving

中图分类号:

TP391

刘青,李伟,余少勇,宋宇萍,周启迪,邹伟林. 结合深度信息引导和多尺度通道注意力机制的单目三维目标检测算法[J]. 《山东大学学报(理学版)》, 2025, 60(1): 63-73.

LIU Qing, LI Wei, YU Shaoyong, SONG Yuping, ZHOU Qidi, ZOU Weilin. Monocular 3D object detection algorithm combining depth guidance and multi-scale channel attention mechanism[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2025, 60(1): 63-73.

参考文献

[1] MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu: IEEE, 2017:5632-5640.
[2] QIN Zengyi, WANG Jinglu, LU Yan. Monogrnet: a geometric reasoning network for monocular 3D object localization[C] //2019 33th AAAI Conference on Artificial Intelligence(AAAI-19). Hawaii: AAAI Press, 2019:8851-8858.
[3] SIMONELLI A, BULO S R, PORZI L, et al. Disentangling monocular 3D object detection: from single to multi-class recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(3):1219-1231.
[4] BRAZIL G, LIU X M. M3D-RPN: monocular 3D region proposal network for object detection[C] //2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul: IEEE, 2019:9286-9295.
[5] XIANG Y, CHOI W, LIN Y Q, et al. Data-driven 3D voxel patterns for object category recognition[C] //2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston: IEEE, 2015:1903-1911.
[6] LIU Zongdai, ZHOU Dingfu, LU Feixiang, et al. Autoshape: real-time shape-aware monocular 3D object detection[C] //2021 IEEE/CVF International Conference on Computer Vision(ICCV). Montreal: IEEE, 2021:15621-15630.
[7] SONG Xibin, LI Wei, ZHOU Dingfu, et al. MLDA-Net: multi-level dual attention-based network for self-supervised monocular depth estimation[J]. IEEE Transactions on Image Processing, 2021, 30:4691-4705.
[8] GODARD C, AODHA O M, FIRMAN M, et al. Digging into self-supervised monocular depth estimation[C] //2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul: IEEE, 2019:3827-3837.
[9] WANG Qi, CHEN Jian, DENG Jiangqiang, et al. 3D-CenterNet: 3D object detection network for point clouds with center estimation priority[J]. Pattern Recognition, 2021, 115:107884.
[10] CHEN Yongjian, TAI Lei, SUN Kai, et al. Monopair: monocular 3D object detection using pairwise spatial relationships[C] //2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle: IEEE Computer Society, 2020:12090-12099.
[11] YIN T W, ZHOU X Y, KRAHENBUHL P. Center-based 3D object detection and tracking[C] //2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Nashville: IEEE, 2021:11779-11788.
[12] FU Huan, GONG Mingming, WANG Chaohui, et al. Deep ordinal regression network for monocular depth estimation[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Salt Lake City: IEEE, 2018:2002-2011.
[13] MAYER N, ILG E, HAUSSER P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation[C] //2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas: IEEE Computer Society, 2016:4040-4048.
[14] WANG Y, CHAO W L, GARG D, et al. Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving[C] //2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach: IEEE, 2019:8437-8445.
[15] PARK D, AMBRUS R, GUIZILINI V, et al. Is pseudo-lidar needed for monocular 3D object detection?[C] //2021 IEEE/CVF International Conference on Computer Vision(ICCV). Montreal: IEEE, 2021:3142-3152.
[16] MA Xinzhu, WANG Zhihui, LI Haojie, et al. Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving[C] //2019 IEEE/CVF International Conference on Computer Vision(ICCV). Seoul: IEEE, 2019:6850-6859.
[17] GARG D, WANG Y, HARIHARAN B, et al. Wasserstein distances for stereo disparity estimation[J]. Advances in Neural Information Processing Systems, 2020, 33:22517-22529.
[18] ZHANG Hu, ZU Keke, LU Jian, et al. EPSANet: an efficient pyramid squeeze attention block on convolutional neural network[C] //Asian Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022:541-557.
[19] WU Bichen, WAN Alvin, YUE Xiangyu, et al. Shift: a zero flop, zero parameter alternative to spatial convolutions[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Salt Lake City: IEEE, 2018:9127-9135.
[20] SHEPLEY A J, FALZON G, KWAN P, et al. Confluence: a robust non-IoU alternative to non-maxima suppression in object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10):11561-11574.
[21] HE Kaiminh, ZHANG Xiangyu, REN Shaoqi, et al. Deep residual learning for image recognition[C] //2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas: IEEE Computer Society, 2016:770-778.
[22] HU Jie, SHEN Li, SUN Gang. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8):2011-2023.
[23] BRABANDERE D B, JIA X, TUYTELAARS T, et al. Dynamic filter networks[J]. Proceedings NIPS 2016, 2016, 29:1-9.
[24] WANG Xin, LV Rongrong, ZHAO Yang, et al. Multi-scale context aggregation network with attention-guided for crowd counting[C] //2020 15th IEEE International Conference on Signal Processing(ICSP). Beijing: IEEE, 2020:240-245.
[25] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C] //2016 14th European Conference on Computer Vision(ECCV). Amsterdam: Springer, 2016:21-37.
[26] GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? the kitti vision benchmark suite[C] //2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Providence: IEEE, 2012:3354-3361.
[27] LI Peixuan, ZHAO Huaici, LIU Pengfei, et al. Rtm3D: real-time monocular 3D detection from object keypoints for autonomous driving[C] //2020 16th European Conference on Computer Vision(ECCV). Beilin: Springer, 2020:644-660.
[28] CHEN X Z, KUNDU K, ZHU Y K, et al. 3D object proposals using stereo imagery for accurate object class detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017, 40(5):1259-1272.
[29] QIN Zengyi, WANG Jinglu, LU Yan. Triangulation learning network: from monocular to stereo 3D object detection[C] //2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach: IEEE, 2019:7607-7615.
[30] XU Bin, CHEN Zhenzhou. Multi-level fusion based 3D object detection from monocular images[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Salt Lake City: IEEE, 2018:2345-2353.
[31] SHI X P, CHEN Z X, KIM T K. Distance-normalized unified representation for monocular 3D object detection[C] //2020 16th European Conference on Computer Vision(ECCV). Berlin: Springer, 2020:91-107.
[32] CAI Yingjie, LI Buyu, JIAO Zeyu, et al. Monocular 3D object detection with decoupled structured polygon estimation and height-guided depth estimation[C] //2020 34th AAAI Conference on Artificial Intelligence(AAAI-20). New York: AAAI Press, 2020:10478-10485.
[33] DENG J, DONG W, SOCHER R, et al. Imagenet: a large-scale hierarchical image database[C] //2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Miami: IEEE, 2009:248-255.
[34] DAI Jifeng, QI Haozhi, XIONG Yuwen, et al. Deformable convolutional networks[C] //2017 IEEE International Conference on Computer Vision(ICCV). Venice: IEEE Computer Society, 2017:764-773.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed