Single-Person 3D Human Pose Estimation Based on Deep Learning: A Review

Authors

  • Aoyu Xia
  • Shouming Hou
  • Zixuan Lu
  • Weibo Yang

DOI:

https://doi.org/10.54097/28jqyv71

Keywords:

Deep Learning; Single-Person 3D Human Pose Estimation; Single-View; Multi-View.

Abstract

With the rapid development of deep learning technologies, human pose estimation has become a hot research topic in the field of computer vision. This paper provides a systematic review of the latest advances in single-person 3D human pose estimation based on deep learning, with a focus on the main challenges faced by both single-view and multi-view approaches, such as pose estimation accuracy, occlusion issues, and the effective utilization of depth information. Firstly, the paper explores single-view human pose estimation methods based on single-frame images and video data. Then, it introduces the research status of multi-view human pose estimation, and discusses in detail how these methods address pose occlusion and multi-view data fusion. In addition, the paper reviews commonly used datasets and evaluation metrics, and assesses the performance of various methods on standard datasets through comparative analysis. Finally, the paper outlines future research directions for single-person 3D human pose estimation, particularly in improving estimation accuracy and addressing pose variations and occlusions in complex scenarios.

Downloads

Download data is not yet available.

References

[1] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, Jul. 2014, doi: 10.1109/TPAMI.2013.248.

[2] L. Sigal, A. O. Balan, and M. J. Black, “HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion,” Int J Comput Vis, vol. 87, no. 1–2, pp. 4–27, Mar. 2010, doi: 10.1007/s11263-009-0273-6.

[3] D. Mehta et al., “Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision,” in 2017 International Conference on 3D Vision (3DV), Qingdao: IEEE, Oct. 2017, pp. 506–516. doi: 10.1109/3DV.2017.00064.

[4] X. Sun, J. Shang, S. Liang, and Y. Wei, “Compositional Human Pose Regression”.

[5] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1263–1272. doi: 10.1109/CVPR.2017.139.

[6] G. Pavlakos, X. Zhou, and K. Daniilidis, “Ordinal Depth Supervision for 3D Human Pose Estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT: IEEE, Jun. 2018, pp. 7307–7316. doi: 10.1109/CVPR.2018.00763.

[7] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A Simple Yet Effective Baseline for 3d Human Pose Estimation,” in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 2659–2668. doi: 10.1109/ICCV.2017.288.

[8] I. Katircioglu, B. Tekin, M. Salzmann, V. Lepetit, and P. Fua, “Learning Latent Representations of 3D Human Pose with Deep Neural Networks,” Int J Comput Vis, vol. 126, no. 12, pp. 1326–1341, Dec. 2018, doi: 10.1007/s11263-018-1066-6.

[9] C. Li and G. H. Lee, “Generating Multiple Hypotheses for 3D Human Pose Estimation With Mixture Density Network,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 9879–9887. doi: 10.1109/CVPR.2019.01012.

[10] B. Wandt and B. Rosenhahn, “RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 7774–7783. doi: 10.1109/CVPR.2019.00797.

[11] H. Choi, G. Moon, and K. M. Lee, “Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose,” in Computer Vision – ECCV 2020, Springer, Cham, 2020, pp. 769–787. doi: 10.1007/978-3-030-58571-6_45.

[12] J. Liu et al., “Feature Boosting Network For 3D Pose Estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 494–501, Feb. 2020, doi: 10.1109/TPAMI.2019.2894422.

[13] J. N. Kundu, S. Seth, R. M V, M. Rakesh, V. B. Radhakrishnan, and A. Chakraborty, “Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation,” AAAI, vol. 34, no. 07, pp. 11312–11319, Apr. 2020, doi: 10.1609/aaai.v34i07.6792.

[14] F. Zhou, J. Yin, and P. Li, “Lifting by Image–Leveraging Image Cues for Accurate 3D Human Pose Estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024, pp. 7632–7640. Accessed: Apr. 15, 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/28596

[15] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua, “Direct Prediction of 3D Body Poses from Motion Compensated Sequences,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 991–1000. doi: 10.1109/CVPR.2016.113.

[16] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 7745–7754. doi: 10.1109/CVPR.2019.00794.

[17] J. Zhang, Y. Wang, Z. Zhou, T. Luan, Z. Wang, and Y. Qiao, “Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos,” IEEE Trans. on Image Process., vol. 30, pp. 7914–7925, 2021, doi: 10.1109/TIP.2021.3109517.

[18] T. Xu and W. Takano, “Graph Stacked Hourglass Networks for 3D Human Pose Estimation,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 16100–16109. doi: 10.1109/CVPR46437.2021.01584.

[19] A. Newell, K. Yang, and J. Deng, “Stacked Hourglass Networks for Human Pose Estimation,” European Conference on Computer Vision, 2016, doi: 10.1007/978-3-319-46484-8_29.

[20] C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding, “3D Human Pose Estimation with Spatial and Temporal Transformers,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, Oct. 2021, pp. 11636–11645. doi: 10.1109/ICCV48922.2021.01145.

[21] Q. Zhao, C. Zheng, M. Liu, P. Wang, and C. Chen, “PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 8877–8886. doi: 10.1109/CVPR52729.2023.00857.

[22] J. Zhang, Z. Tu, J. Yang, Y. Chen, and J. Yuan, “MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 13222–13232. doi: 10.1109/CVPR52688.2022.01288.

[23] J. Gong, L. G. Foo, Z. Fan, Q. Ke, H. Rahmani, and J. Liu, “DiffPose: Toward More Reliable 3D Pose Estimation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 13041–13051. doi: 10.1109/CVPR52729.2023.01253.

[24] Z. Tang, Z. Qiu, Y. Hao, R. Hong, and T. Yao, “3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 4790–4799. doi: 10.1109/CVPR52729.2023.00464.

[25] Y. Zhong, G. Yang, D. Zhong, X. Yang, and S. Wang, “Frame-Padded Multiscale Transformer for Monocular 3D Human Pose Estimation,” IEEE Trans. Multimedia, vol. 26, pp. 6191–6201, 2024, doi: 10.1109/TMM.2023.3347095.

[26] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov, “Learnable Triangulation of Human Pose,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 7717–7726. doi: 10.1109/ICCV.2019.00781.

[27] E. Remelli, S. Han, S. Honari, P. Fua, and R. Wang, “Lightweight Multi-View 3D Pose Estimation Through Camera-Disentangled Representation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA: IEEE, Jun. 2020, pp. 6039–6048. doi: 10.1109/CVPR42600.2020.00608.

[28] H.-W. Kim et al., “MHCanonNet: Multi-Hypothesis Canonical lifting Network for self-supervised 3D human pose estimation in the wild video,” Pattern Recogn, vol. 145, p. 109908, Jan. 2024, doi: 10.1016/j.patcog.2023.109908.

[29] J. Liang and M. Lin, “Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 4351–4361. doi: 10.1109/ICCV.2019.00445.

[30] L. Chen, H. Ai, R. Chen, Z. Zhuang, and S. Liu, “Cross-View Tracking for Multi-Human 3D Pose Estimation at Over 100 FPS,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA: IEEE, Jun. 2020, pp. 3276–3285. doi: 10.1109/CVPR42600.2020.00334.

[31] R. Xie, C. Wang, and Y. Wang, “MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA: IEEE, Jun. 2020, pp. 13683–13692. doi: 10.1109/CVPR42600.2020.01370.

[32] Y. Zhang, L. An, T. Yu, X. Li, K. Li, and Y. Liu, “4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA: IEEE, Jun. 2020, pp. 1321–1330. doi: 10.1109/CVPR42600.2020.00140.

[33] Z. Dong, J. Song, X. Chen, C. Guo, and O. Hilliges, “Shape-aware Multi-Person Pose Estimation from Multi-View Images,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, Oct. 2021, pp. 11138–11148. doi: 10.1109/ICCV48922.2021.01097.

[34] H. Ye, W. Zhu, C. Wang, R. Wu, and Y. Wang, “Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection,” in Computer Vision – ECCV 2022, vol. 13666, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., in Lecture Notes in Computer Science, vol. 13666. , Cham: Springer Nature Switzerland, 2022, pp. 142–159. doi: 10.1007/978-3-031-20068-7_9.

[35] J. Dong et al., “Fast and Robust Multi-Person 3D Pose Estimation and Tracking From Multiple Views,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6981–6992, Oct. 2022, doi: 10.1109/TPAMI.2021.3098052.

Downloads

Published

27-02-2025

Issue

Section

Articles