Analysis of the Integration Strategies of LLM and VLM Models with the Transformer Architecture

Yao Zhang

doi:10.54097/3fhs5d75

Authors

Yao Zhang

DOI:

https://doi.org/10.54097/3fhs5d75

Keywords:

LLM, VLM, Transformer architecture, Integration strategies

Abstract

With the rapid development of artificial intelligence technology, Transformer architecture has become the core framework of natural language processing (NLP) and multimodal domain. In this paper, the fusion strategies of Large Language Model (LLM) and Visual Language Model (VLM) with Transformer architecture are deeply studied. This paper first introduces the basic principles and characteristics of Transformer architecture, LLM and VLM models, and then makes a comprehensive analysis of the advantages and challenges of different fusion strategies, and demonstrates the practical application effect of these fusion strategies in multimodal tasks through application cases such as visual question answering (VQA) and image description generation. The results show that by optimizing the model structure, training strategy and data processing, the integration of LLM and VLM with Transformer architecture can significantly improve the performance of the model in language and visual tasks, which provides a new idea and method for the development of multimodal artificial intelligence.

Downloads

Download data is not yet available.

References

[1] Miah, M. S. U., Kabir, M. M., Sarwar, T. B., et al.: A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM, Scientific Reports, Vol. 14 (2024) No. 1: 9603.

[2] Zhou, L., Zhang, Y., Yu, J., et al.: LLM-Augmented Linear Transformer–CNN for Enhanced Stock Price Prediction, Mathematics, Vol. 13 (2025) No. 3: 487.

[3] Alberts, I. L., Mercolli, L., Pyka, T., et al.: Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be? European Journal of Nuclear Medicine and Molecular Imaging, Vol. 50 (2023) No. 6: 1549-1552.

[4] Yadav, B.: Generative AI in the Era of Transformers: Revolutionizing Natural Language Processing with LLMs, J. Image Process. Intell. Remote Sens., Vol. 4 (2024) No. 2: 54-61.

[5] Zheng, L., Kandula, R. P., Kandasamy, K., et al.: New modulation and impact of transformer leakage inductance on current-source solid-state transformer, IEEE Transactions on Power Electronics, Vol. 37 (2021) No. 1: 562-576.