MS-VSSM: Multiscale Enhanced Visual State Space Model for Facial Expression Recognition

Qian Qian

doi:10.54097/9ee3rb64

Authors

Qian Qian

DOI:

https://doi.org/10.54097/9ee3rb64

Keywords:

Facial Expression Recognition, Visual State Space Model, Multi-scale Feature Fusion

Abstract

Facial Expression Recognition (FER) is a key technology in fields such as human-computer interaction and mental health assessment. However, its performance is constrained by the subtleties and dynamics of expressions, as well as the complexity arising from individual differences and environmental variations. To overcome the limitations of traditional Convolutional Neural Networks (CNNs), such as their restricted receptive fields, and the high computational complexity of Transformers, this paper proposes a Multiscale Enhanced Visual State Space Model (MS-VSSM) based on the Visual State Space Model (VSSM), aiming to improve the accuracy and robustness of FER. The model introduces three core improvements upon VSSM: (1) integrating a Path-aware Channel Attention mechanism (SE-SS2D) into the SS2D module to enhance the targeted capture of critical local facial features; (2) embedding a Dense Spatial Pyramid Pooling module (DSPP) at the beginning of each network stage to achieve multi-scale contextual information fusion; and (3) employing a Layer Scale mechanism to finely adjust the scale of deep features, thereby improving training stability and representational flexibility. In a four-category emotion recognition task constructed on the DEAP dataset, MS-VSSM achieved an accuracy of 97.31% and a weighted average F1-score of 97.56%, significantly outperforming the original VSSM and various mainstream visual backbone networks. These results validate the effectiveness and advancement of the proposed method. This study provides a new solution for efficient and accurate fine-grained facial expression recognition.

Downloads

Download data is not yet available.

References

[1] Liu Y, Tian Y, Zhao Y, et al. VMamba: Visual State Space Model [J]. 2024.

[2] Touvron H, Cord M, El-Nouby A, et al. Three things everyone should know about Vision Transformers [J]. 2022. DOI:10.48550/arXiv.2203.09795.

[3] Touvron H, Cord M, Sablayrolles A, et al. Going deeper with Image Transformers [J]. 2021. DOI:10.48550/arXiv.2103.17239.

[4] Picard R W, Healey J. Affective wearables [J]. Personal Technologies, 1997, 1(4): 231-240.

[5] Zhihang T, Xiaming C, Dazhi J. An Artificial Emotion Model for the Mutual Mapping Between Discrete State and Dimensional Space [J]. Journal of System Simulation, 2021, 33(5): 1062-1069.

[6] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Jul. 2006.

[7] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 1–9.

[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, 2012, pp. 1097–1105.

[10] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 10, pp. 1533–1545, Oct. 2014.

[11] X. Shi et al., “Deep learning for precipitation nowcasting: A benchmark and a new model,” in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 5622–5632.

[12] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human trajectory prediction in crowded spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, Las Vegas, NV, USA, 2016, pp. 961–971.

MS-VSSM: Multiscale Enhanced Visual State Space Model for Facial Expression Recognition

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

Cover

Indexing & Abstracting