Multimodal Large Model Guided Diffusion Model for Transformer Oil Leakage Image Generation
DOI:
https://doi.org/10.54097/wx2t2298Keywords:
Transformer oil leakage, Diffusion model, Industrial defect image generationAbstract
Transformer oil leakage is a critical defect in power equipment inspection, yet its automatic detection remains challenging because real leakage samples are scarce, highly irregular in shape, variable in color, and easily confused with water stains, rust, shadows, and complex structural backgrounds. To alleviate the data scarcity problem and improve downstream segmentation performance, this paper proposes a multimodal semantic-heuristic diffusion framework, termed MSH-Diff, for automatic transformer oil-leakage image synthesis. The proposed framework exploits the visual reasoning capability and prior industrial knowledge of a multimodal large language model to identify potential leakage-prone components, such as flanges, valves, and radiator fin roots, through expert-role prompt engineering. The recognized component semantics are further mapped to spatial Gaussian anchors by analyzing cross-attention responses in the latent diffusion model, enabling leakage-center localization without manually drawn masks or key-point annotations. In addition, MSH-Diff automatically constructs dense scene descriptions involving illumination, viewpoint, and surface material, which are encoded as semantic constraints to enhance structural consistency and physical realism during diffusion sampling. Experimental results demonstrate that MSH-Diff achieves competitive image quality and diversity, with an FID of 38.14 and an IC-L score of 0.29. When the generated samples are incorporated into downstream semantic segmentation training, the mIoU of DeepLabV3+ increases from 60.09% to 62.31%, confirming the effectiveness of the proposed framework for industrial defect data augmentation.
Downloads
References
[1] Wang, Q., Gao, C., Zhang, Z., et al. (2023). SIRN: An iterative reasoning network for transmission lines based on scene prior knowledge. Engineering Applications of Artificial Intelligence, 125. https://doi.org/10.1016/j.engappai.2023.107168
[2] Freitas-Gutierres, L. F., Maresch, K., & Quatrin, A. D. N. (2025). Advancing substation inspection: The Hilbert-Huang transform approach for partial discharge recognition and assessment. Measurement, 116846. https://doi.org/10.1016/j.measurement.2025.116846
[3] Krizhevsky, A., Sutskever, I., & Hinton, E. G. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
[4] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (Vol. 33, pp. 6840–6851).
[5] Li, Y., Liu, H., Wu, Q., et al. (2023). GLIGEN: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 22511–22521).
[6] Zavadski, D., Feiden, J. F., & Rother, C. (2024). ControlNet-XS: Rethinking the control of text-to-image diffusion models as feedback-control systems. In Proceedings of the European Conference on Computer Vision (pp. 343–362). Springer Nature Switzerland.
[7] Zhao, S., Chen, D., Chen, Y. C., et al. (2023). Uni-ControlNet: All-in-one control to text-to-image diffusion models. In Advances in Neural Information Processing Systems (Vol. 36, pp. 11127–11150).
[8] Lugmayr, A., Danelljan, M., Romero, A., et al. (2023). Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11461–11471).
[9] Song, J., Park, D., Baek, K., et al. (2025). DefectFill: Realistic defect generation with inpainting diffusion model for visual inspection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18718–18727).
[10] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., et al. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (pp. 2256–2265). PMLR.
[11] Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684–10705). https://doi.org/10.1109/CVPR52688.2022.01042
[12] Xie, Y., Pi, X., Zhang, Y., et al. (2025). Structured guided diffusion models for industrial defect image generation. Knowledge-Based Systems, 114642. https://doi.org/10.1016/j.knosys.2025.114642
[13] Hu, T., Zhang, J., Yi, R., et al. (2024). AnomalyDiffusion: Few-shot anomaly image generation with diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 8, pp. 8526–8534). https://doi.org/10.1609/aaai.v38i8.28627
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Computer Science and Artificial Intelligence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








