DIF-DETR: Dynamic Interactive Fusion Transformer with Adaptive Feature Enhancement for Efficient Aerial Small Object Detection

Jing Wang; Hejiang Li; Caihong Huangfu

doi:10.54097/b3psbw85

Authors

Jing Wang
Hejiang Li
Caihong Huangfu

DOI:

https://doi.org/10.54097/b3psbw85

Keywords:

Object Detection, Multi-scale Feature Enhancement, Transformer Architecture, UAV Images

Abstract

In recent years, object detection models based on Transformers have demonstrated outstanding performance in general scenarios due to their powerful global feature modeling capabilities. However, when directly applied to aerial image detection tasks, their performance often falls short of expectations. The root cause lies in the nature of aerial imagery, which typically contains numerous small objects. These objects occupy an extremely low proportion of pixels, resulting in weak feature representation. They are also susceptible to factors such as complex background noise and mutual interference from densely distributed targets, making it difficult for Transformer models to effectively capture and distinguish small object features. To address these challenges, this paper proposes an enhanced Transformer architecture for aerial small object detection: Dynamic Interactive Fusion DETR (DIF-DETR). Its core innovations comprise two aspects: First, introducing the DIENet backbone feature extraction network embedded with DIEBlocks. These DIEBlocks serve as feature enhancement units within the backbone network, leveraging dynamic Inception multi-branch deep convolutions and adaptive weight allocation mechanisms to efficiently capture multi-scale, long-range contextual information. Second, it introduces Context-Aware Bidirectional Fusion (CABF), which enables adaptive complementary fusion of high-level semantic features and low-level detail features within the FPN-PAN architecture of the neck network, effectively mitigating the issue of small target features being obscured by background interference. Experimental results demonstrate that on the highly challenging VisDrone and HIT-UAV aerial datasets, the proposed DIF-DETR network outperforms existing mainstream models with 30.5% mAP and 82.3% mAPtest, respectively. Simultaneously, it significantly reduces computational cost to 43.6 GFLOPs with only 13.4M parameters, achieving an optimal balance between detection accuracy and computational efficiency. This demonstrates that through the synergistic effects of three core innovations, DIF-DETR significantly enhances detection accuracy and robustness for small objects in aerial images, providing an effective solution for object detection tasks in aerial scenarios.

Downloads

Download data is not yet available.

References

[1] Tang Y, Wang B, He W, et al. Pointdet++: an object detection framework based on human local features with transformer encoder [J]. Neural Computing and Applications, 2023, 35(14): 10097-10108.

[2] Zeng K, Ma Q, Wu J, et al. NLFFTNet: A non-local feature fusion transformer network for multi-scale object detection [J]. Neurocomputing, 2022, 493: 15-27.

[3] Ding T, Feng K, Wei Y, et al. DeoT: an end-to-end encoder-only Transformer object detector [J]. Journal of Real-Time Image Processing, 2023, 20(1): 1.

[4] Chen G, Mao Z, Wang K, et al. HTDet: A hybrid transformer-based approach for underwater small object detection [J]. Remote Sensing, 2023, 15(4): 1076.

[5] Zhu X, Su W, Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection [J]. arXiv preprint arXiv:2010.04159, 2020.

[6] Roh B, Shin J W, Shin W, et al. Sparse detr: Efficient end-to-end object detection with learnable sparsity [J]. arXiv preprint arXiv:2111.14330, 2021.

[7] Zhao Y, Lv W, Xu S, et al. Detrs beat yolos on real-time object detection [C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 16965-16974.

[8] Ma Y, Chai L, Jin L. Scale decoupled pyramid for object detection in aerial images [J]. IEEE transactions on geoscience and remote sensing, 2023, 61: 1-14.

[9] Han J, Ren Y, Ding J, et al. Few-shot object detection via variational feature aggregation [C]//Proceedings of the AAAI conference on artificial intelligence. 2023, 37(1): 755-763.

[10] Chen Z, He Z, Lu Z M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention [J]. IEEE transactions on image processing, 2024, 33: 1002-1015.

[11] Tang Z, Liu X, Yang B. PENet: Object detection using points estimation in high definition aerial images [C]//2020 19th IEEE international conference on machine learning and applications (ICMLA). IEEE, 2020: 392-398.

[12] Huang Y, Chen J, Huang D. UFPMP-Det: Toward accurate and efficient object detection on drone imagery [C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(1): 1026-1033.

[13] Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale [J]. arXiv preprint arXiv:2010.11929, 2020.

[14] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers [C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 213-229.

[15] Meng D, Chen X, Fan Z, et al. Conditional detr for fast training convergence [C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 3651-3660.

[16] Zhang H, Li F, Liu S, et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection [J]. arXiv preprint arXiv:2203.03605, 2022.

[17] Wang Z, Li L, Xue Y, et al. FeNet: Feature enhancement network for lightweight remote-sensing image super-resolution [J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-12.

[18] Tan M, Pang R, Le Q V. Efficientdet: Scalable and efficient object detection [C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10781-10790.

[19] Zhang G, Lu S, Zhang W. CAD-Net: A context-aware detection network for objects in remote sensing imagery [J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 57(12): 10015-10024.

[20] Luo Y, Cao X, Zhang J, et al. CE-FPN: enhancing channel information for object detection [J]. Multimedia Tools and Applications, 2022, 81(21): 30685-30704.

[21] Shi D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 17773-17783.

[22] Du D, Zhu P, Wen L, et al. VisDrone-DET2019: The vision meets drone object detection in image cFALlenge results[C]//Proceedings of the IEEE/CVF international conference on computer vision workshops. 2019: 0-0.

[23] Suo J, Wang T, Zhang X, et al. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection [J]. Scientific Data, 2023, 10(1): 227.

[24] Yang C, Huang Z, Wang N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection [C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2022: 13668-13677.

[25] Feng C, Zhong Y, Gao Y, et al. Tood: Task-aligned one-stage object detection [C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society,2021: 3490-3499.

[26] Lyu C, Zhang W, Huang H, et al. Rtmdet: An empirical study of designing real-time object detectors [J]. arXiv preprint arXiv:2212.07784, 2022.

[27] Chen L, Liu C, Li W, et al. Dtssnet: dynamic training sample selection network for uav object detection [J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 1-16.

[28] Jocher, G., Chaurasia, A., & Qiu, J. (2023). YOLOv8. Retrieved from https://github.com/ultralytics/ultralytics

[29] Wang CY, Yeh IH, Liao HYM (2024) YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv preprint arXiv:2402.13616 https://doi.org/10.48550/arXiv.2402.13616

[30] Wang A, Chen H, Liu L, et al. (2024) YOLOv10: Real-time end-to-end object detection. Advances in Neural Information Processing Systems 37 107984–108011 https://doi.org/10.48550/arXiv.2405.14458

[31] Jocher G, Chaurasia A, Qiu J (2024) Ultralytics YOLO. Zenodo https://doi.org/10.5281/zenodo.10983461

[32] Tian Y, Ye Q, Doermann D. Yolov12: Attention-centric real-time object detectors [J]. arXiv preprint arXiv:2502.12524, 2025.

[33] Yang Z, Guan Q, Yu Z, et al. Mhaf-yolo: Multi-branch heterogeneous auxiliary fusion yolo for accurate object detection [J]. arXiv preprint arXiv:2502.04656, 2025.

[34] Xiao Y, Xu T, Xin Y, et al. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection [C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(8): 8673-8681.