A Lightweight Multimodal Feature Alignment Framework for Depression Detection
DOI:
https://doi.org/10.54097/2g478s98Keywords:
Multimodal learning, Feature alignment, Lightweight model, Depression severity estimation, PHQ-8Abstract
With the rapid advancement of artificial intelligence, computer vision, speech recognition, and natural language processing, automatic depression detection based on multimodal data has attracted increasing research attention. Compared with unimodal approaches, multimodal fusion leverages complementary information from speech, text, facial expressions, and other behavioral cues, thereby improving the accuracy and robustness of depression assessment. However, existing multimodal models are often computationally intensive and parameter-heavy, which limits their deployment on resource-constrained devices. In addition, semantic and distributional discrepancies across modalities pose significant challenges for effective feature alignment, adversely affecting fusion performance. To address these issues, this paper proposes a lightweight multimodal feature alignment framework for depression severity estimation. The proposed method constructs lightweight feature extraction networks and introduces a cross-modal feature alignment mechanism to enable effective mapping and fusion across heterogeneous feature spaces. While significantly reducing model size and computational complexity, the framework maintains competitive predictive performance. Experimental results on multiple public depression datasets demonstrate that the proposed approach achieves a mean absolute error (MAE) of 4.44 and a root mean square error (RMSE) of 5.77 in PHQ-8 score estimation, indicating strong generalization capability and practical deployment potential.
Downloads
References
[1] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,” Speech Communication, vol. 71, pp. 10–49, 2015.
[2] T. Al Hanai, M. Ghassemi, and J. Glass, “Detecting depression with audio/text sequence modeling of interviews,” in Proceedings of Interspeech, 2018, pp. 1716–1720.
[3] T. Baltruaitis, C. Ahuja, and L. philippe Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 423–443, 2017.
[4] J. Williamson, T. Quatieri, B. Helfer, R. Horwitz, B. Yu, and D. Mehta, “Vocal biomarkers of depression based on motor incoordination,” in Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge (AVEC 2013), 10 2013.
[5] N. Esmi, A. Shahbahrami, G. Gaydadjiev, and P. de Jonge, “Multimodal transformer for depression detection based on eeg and interview data,” Biomedical Signal Processing and Control, vol. 113, p. 109039, 2026.
[6] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2009.
[7] S. Patel, N. Shroff, and H. Shah, “Multimodal sentiment analysis using deep learning: A review,” in Advancements in Smart Computing and Information Security, S. Rajagopal, K. Popat, D. Meva, and S. Bajeja, Eds.Cham: Springer Nature Switzerland, 2024, pp. 13–29.
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4510–4520.
[9] F. Ringeval, M. Pantic, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, and M. Schmitt, “Avec 2017: Real-life depression and affect recognition workshop and challenge,” in Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge (AVEC 2017), 2017, pp. 3–9.
[10] S. Song, L. Shen, and M. Valstar, “Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features,” in Proceedings - 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018, ser. Proceedings - 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018. United States: Institute of Electrical and Electronics Engineers Inc., Jun. 2018, pp. 158–165, publisher Copyright: © 2018 IEEE.; 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018; Conference date: 15-05-2018 Through 19-05-2018.
[11] P.-C. Wei, K. Peng, A. Roitberg, K. Yang, J. Zhang, and R. Stiefelhagen, “Multi-modal depression estimation based on subattentional fusion,” in ECCV Workshops, 2022.
[12] Z. Du, W. Li, D. Huang, and Y. Wang, “Encoding visual behaviors with attentive temporal convolution for depression prediction,” in 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019, pp. 1–7.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Computer Science and Artificial Intelligence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








