Multimodal Large Model Guided Diffusion Model for Transformer Oil Leakage Image Generation

Wenqing Zhao; Cen Yang; Xi Chen; Jian Shi; An Bo; Zenghua Ji; Xi Chen; Congcong Ma; Jing Teng; Leipeng Zuo; Jieshi Qi; Shuang Liang; Dongyang Zhang

doi:10.54097/wx2t2298

Authors

Wenqing Zhao School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China
Cen Yang School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China
Xi Chen Baoding Power Supply Branch, State Grid Hebei Electric Power Co., Ltd, Baoding, 071000, China
Jian Shi Baoding Tianwei Baobian Electric Co., Ltd, Baoding, 071000, China
An Bo Baoding Power Supply Branch, State Grid Hebei Electric Power Co., Ltd, Baoding, 071000, China
Zenghua Ji Baoding Tianwei Xinyu Technology Development Co., Ltd, Baoding, 071000, China
Xi Chen Baoding Power Supply Branch, State Grid Hebei Electric Power Co., Ltd, Baoding, 071000, China
Congcong Ma School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China
Jing Teng School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China
Leipeng Zuo Baoding Power Supply Branch, State Grid Hebei Electric Power Co., Ltd, Baoding, 071000, China
Jieshi Qi Baoding Power Supply Branch, State Grid Hebei Electric Power Co., Ltd, Baoding, 071000, China
Shuang Liang Baoding Power Supply Branch, State Grid Hebei Electric Power Co., Ltd, Baoding, 071000, China
Dongyang Zhang School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China

DOI:

https://doi.org/10.54097/wx2t2298

Keywords:

Transformer oil leakage, Diffusion model, Industrial defect image generation

Abstract

Transformer oil leakage is a critical defect in power equipment inspection, yet its automatic detection remains challenging because real leakage samples are scarce, highly irregular in shape, variable in color, and easily confused with water stains, rust, shadows, and complex structural backgrounds. To alleviate the data scarcity problem and improve downstream segmentation performance, this paper proposes a multimodal semantic-heuristic diffusion framework, termed MSH-Diff, for automatic transformer oil-leakage image synthesis. The proposed framework exploits the visual reasoning capability and prior industrial knowledge of a multimodal large language model to identify potential leakage-prone components, such as flanges, valves, and radiator fin roots, through expert-role prompt engineering. The recognized component semantics are further mapped to spatial Gaussian anchors by analyzing cross-attention responses in the latent diffusion model, enabling leakage-center localization without manually drawn masks or key-point annotations. In addition, MSH-Diff automatically constructs dense scene descriptions involving illumination, viewpoint, and surface material, which are encoded as semantic constraints to enhance structural consistency and physical realism during diffusion sampling. Experimental results demonstrate that MSH-Diff achieves competitive image quality and diversity, with an FID of 38.14 and an IC-L score of 0.29. When the generated samples are incorporated into downstream semantic segmentation training, the mIoU of DeepLabV3+ increases from 60.09% to 62.31%, confirming the effectiveness of the proposed framework for industrial defect data augmentation.

Downloads

Download data is not yet available.

References

[1] Wang, Q., Gao, C., Zhang, Z., et al. (2023). SIRN: An iterative reasoning network for transmission lines based on scene prior knowledge. Engineering Applications of Artificial Intelligence, 125. https://doi.org/10.1016/j.engappai.2023.107168

[2] Freitas-Gutierres, L. F., Maresch, K., & Quatrin, A. D. N. (2025). Advancing substation inspection: The Hilbert-Huang transform approach for partial discharge recognition and assessment. Measurement, 116846. https://doi.org/10.1016/j.measurement.2025.116846

[3] Krizhevsky, A., Sutskever, I., & Hinton, E. G. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

[4] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (Vol. 33, pp. 6840–6851).

[5] Li, Y., Liu, H., Wu, Q., et al. (2023). GLIGEN: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 22511–22521).

[6] Zavadski, D., Feiden, J. F., & Rother, C. (2024). ControlNet-XS: Rethinking the control of text-to-image diffusion models as feedback-control systems. In Proceedings of the European Conference on Computer Vision (pp. 343–362). Springer Nature Switzerland.

[7] Zhao, S., Chen, D., Chen, Y. C., et al. (2023). Uni-ControlNet: All-in-one control to text-to-image diffusion models. In Advances in Neural Information Processing Systems (Vol. 36, pp. 11127–11150).

[8] Lugmayr, A., Danelljan, M., Romero, A., et al. (2023). Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11461–11471).

[9] Song, J., Park, D., Baek, K., et al. (2025). DefectFill: Realistic defect generation with inpainting diffusion model for visual inspection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18718–18727).

[10] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., et al. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (pp. 2256–2265). PMLR.

[11] Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684–10705). https://doi.org/10.1109/CVPR52688.2022.01042

[12] Xie, Y., Pi, X., Zhang, Y., et al. (2025). Structured guided diffusion models for industrial defect image generation. Knowledge-Based Systems, 114642. https://doi.org/10.1016/j.knosys.2025.114642

[13] Hu, T., Zhang, J., Yi, R., et al. (2024). AnomalyDiffusion: Few-shot anomaly image generation with diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 8, pp. 8526–8534). https://doi.org/10.1609/aaai.v38i8.28627