MS-VSSM: Multiscale Enhanced Visual State Space Model for Facial Expression Recognition
DOI:
https://doi.org/10.54097/9ee3rb64Keywords:
Facial Expression Recognition, Visual State Space Model, Multi-scale Feature FusionAbstract
Facial Expression Recognition (FER) is a key technology in fields such as human-computer interaction and mental health assessment. However, its performance is constrained by the subtleties and dynamics of expressions, as well as the complexity arising from individual differences and environmental variations. To overcome the limitations of traditional Convolutional Neural Networks (CNNs), such as their restricted receptive fields, and the high computational complexity of Transformers, this paper proposes a Multiscale Enhanced Visual State Space Model (MS-VSSM) based on the Visual State Space Model (VSSM), aiming to improve the accuracy and robustness of FER. The model introduces three core improvements upon VSSM: (1) integrating a Path-aware Channel Attention mechanism (SE-SS2D) into the SS2D module to enhance the targeted capture of critical local facial features; (2) embedding a Dense Spatial Pyramid Pooling module (DSPP) at the beginning of each network stage to achieve multi-scale contextual information fusion; and (3) employing a Layer Scale mechanism to finely adjust the scale of deep features, thereby improving training stability and representational flexibility. In a four-category emotion recognition task constructed on the DEAP dataset, MS-VSSM achieved an accuracy of 97.31% and a weighted average F1-score of 97.56%, significantly outperforming the original VSSM and various mainstream visual backbone networks. These results validate the effectiveness and advancement of the proposed method. This study provides a new solution for efficient and accurate fine-grained facial expression recognition.
Downloads
References
[1] Liu Y, Tian Y, Zhao Y, et al. VMamba: Visual State Space Model [J]. 2024.
[2] Touvron H, Cord M, El-Nouby A, et al. Three things everyone should know about Vision Transformers [J]. 2022. DOI:10.48550/arXiv.2203.09795.
[3] Touvron H, Cord M, Sablayrolles A, et al. Going deeper with Image Transformers [J]. 2021. DOI:10.48550/arXiv.2103.17239.
[4] Picard R W, Healey J. Affective wearables [J]. Personal Technologies, 1997, 1(4): 231-240.
[5] Zhihang T, Xiaming C, Dazhi J. An Artificial Emotion Model for the Mutual Mapping Between Discrete State and Dimensional Space [J]. Journal of System Simulation, 2021, 33(5): 1062-1069.
[6] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Jul. 2006.
[7] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 1–9.
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, 2012, pp. 1097–1105.
[10] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 10, pp. 1533–1545, Oct. 2014.
[11] X. Shi et al., “Deep learning for precipitation nowcasting: A benchmark and a new model,” in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 5622–5632.
[12] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human trajectory prediction in crowded spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, Las Vegas, NV, USA, 2016, pp. 961–971.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Computer Science and Artificial Intelligence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








