Implementation Paths of Generative AI in Multimodal Learning
DOI:
https://doi.org/10.54097/rwr0vn28Keywords:
Generative AI, multimodal learning, modal alignment, cross-modal generation, pre-trained model, generation quality assessmentAbstract
When AI is currently applied to complex scenarios, it only processes single information such as text and images, which cannot meet people's needs to "perceive things as they do." Instead, multimodal learning, which can integrate different information such as text, images, audio, and video, has become a key direction for the development of generative AI. However, the current multimodal applications of generative AI are still stuck in several hurdles: First, the different modalities are quite different. For example, text is segmented and images are pixelated, which can easily cause problems when put together; second, the generated content does not match the intended meaning of the multimodal representation; and third, the quality of cross-modal generation is also unstable, sometimes good and sometimes bad. The stakeholders outlined its implementation approach across five key aspects: data foundation, feature processing, model architecture, training strategy, and quality assessment. First, pre-process and standardize multimodal data to establish a solid data foundation; then, cross-modal feature alignment is used to address modality differences; then, the generative model architecture is adapted to support cross-modal generation; then, multimodal pre-training and incremental learning are used to adapt the model to a wider range of scenarios; and finally, a scientific quality assessment system is employed to optimize the generation results. This research aims to provide usable technical logic for the implementation of generative AI in multimodal scenarios such as intelligent interaction, content creation, and medical diagnosis, and to promote multimodal generation from "being able to generate" to "generating well"
Downloads
References
[1] Suzuki M, Matsuo Y. A Review of Multimodal Deep Generative Models [J]. Advanced Robotics, 2022, 36(5-6): 261-278.
[2] Radford A, Kim JW, Hallacy C et al. Learning Transferable Visual Models from Natural Language Supervision [C] // International Conference on Machine Learning. PmLR, 2021: 8748-8763.
[3] Rombach R, Blattmann A, Lorenz D et al. High-Resolution Image Synthesis Based on Latent Diffusion Models [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10684-10695.
[4] Sun Lifeng, Song Xinhang, Jiang Shuqiang, et al. Preface to the special topic of multimodal collaborative perception and fusion technology [J]. Journal of Software, 2024, 35(05): 2099-2100. DOI: 10.13328/j.cnki.jos.007030.
[5] Chen Gongguan, Liu Hui, Li Hengtai, et al. Research on subgraph matching contrastive learning method for multimodal pre-training [J]. Journal of Computers, 2025, 48(04): 893-909.
[6] Yao J, Hu Y, Yi Y, et al. MMMG: A Comprehensive and Robust Evaluation Suite for Multi-Task Multimodal Generation [J]. arXiv preprint arXiv:2505.17613, 2025.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Computer Science and Artificial Intelligence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








