Towards Efficient Retail Text Localization: A Lightweight Network with Saliency-Guided Sparse Attention

Zili Li

doi:10.54097/j3zk3g61

Authors

Zili Li

DOI:

https://doi.org/10.54097/j3zk3g61

Keywords:

Retail Packaging Text Detection, Lightweight Object Detection, YOLO Architecture, Context-Guided Downsampling, Sparse Attention, Multi-Scale Feature Fusion

Abstract

Retail product packaging text detection is crucial for intelligent label auditing and smart retail applications. While deep learning models, particularly the YOLO series, have demonstrated remarkable real-time inference capabilities, detecting text in complex retail scenarios presents formidable challenges due to extreme scale variations, dense distributions, and severe background interference. During the repeated downsampling processes in standard YOLO architectures, the fine-grained features of microscopic text are prone to degradation, and introducing conventional global attention mechanisms heavily inflates computational costs. To address these issues, this paper proposes an enhanced lightweight, real-time object detector optimized for retail packaging text. First, to prevent the loss of minute text details, we propose a Context Guide Downsample block that jointly aggregates local, surrounding, and global contextual descriptors during spatial resolution reduction. Second, to break the quadratic computational bottleneck of traditional self-attention, a Dynamic Saliency-guided Sparse Attention module is introduced into the encoder. DSSA adaptively filters out background redundancies and establishes efficient long-range dependencies across text regions. Finally, a Multi-scale Feature Mapping module replaces conventional feature concatenation, employing a scale-aware non-linear modulation strategy to align and fuse heterogeneous hierarchical semantics in the neck network. Extensive experiments on the Food-Product-Image dataset demonstrate that the proposed model achieves a superior trade-off between detection accuracy and inference speed, outperforming existing state-of-the-art lightweight detectors in complex text localization tasks.

Downloads

Download data is not yet available.

References

[1] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection [C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779-788.

[2] Rahima K, Muhammad H. YOLOv11: An Overview of the Key Architectural Enhancements [J]. arXiv, 2024.

[3] Wu T, Tang S, Zhang R, et al. CGNet: A Light-weight Context Guided Network for Semantic Segmentation [J]. arXiv, 2019.

[4] Zhang Y, Zhou S, Li H. Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing [C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).2024:2846~2855.

[5] Brosch C, Bouwens A, Bast S, et al. Creation and Evaluation of a Food Product Image Dataset for Product Property Extraction [J]. arXiv, 2025.

[6] Chen J R, Kao S H, He H, et al. Run, don't walk: chasing higher FLOPS for faster neural networks [C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12021-12031.