Bias Mitigation Techniques in Large Language Models: An Empirical Evaluation of Post-Training and In-Training Approaches
DOI:
https://doi.org/10.54097/kptt9j67Keywords:
Large Language Models, Bias Mitigation, Fairness In AI, In-Training Methods, Post-Training Methods, Constitutional AIAbstract
The proliferation of large language models (LLMs) in critical applications has intensified concerns about embedded social biases that can perpetuate discrimination and inequality. While numerous bias mitigation techniques have been proposed, systematic comparison of intervention timing pacifically in-training versus post-training approaches remains limited. This paper presents a comprehensive empirical evaluation of bias mitigation strategies across six LLM architectures ranging from 340 million to 1.7 trillion parameters. We implement and compare four in-training methods (data preprocessing, adversarial training, fairness regularization, multi-task learning) against four post-training approaches (supervised fine-tuning, reinforcement learning from human feedback, constitutional AI, inference filtering) across nine bias categories including gender, race, religion, age, and socioeconomic status. Our evaluation employs established benchmarks (StereoSet, CrowS-Pairs) supplemented by custom synthetic datasets, with rigorous statistical analysis including bootstrap confidence intervals and effect size calculations. Results demonstrate that in-training methods achieve superior bias reduction effectiveness across all categories, with large effect sizes (Cohen's d > 1.2) and statistical significance (p < 0.001). Notably, in-training approaches maintain scale-invariant performance across model sizes while post-training methods show degradation for models exceeding 10 billion parameters. Counterintuitively, in-training methods preserve task performance better than post-training corrections (0.00-0.02 vs 0.03-0.05 GLUE score degradation). The 2.4× computational overhead of in-training methods is offset by 15-20% improvements in bias reduction effectiveness and superior robustness to intersectional biases. These findings provide definitive guidance for practitioners deploying bias mitigation in production LLM systems and establish the empirical foundation for prioritizing in-training approaches in fairness-critical applications.
Downloads
References
[1] I. O. Gallegos et al., "Bias and Fairness in Large Language Models: A Survey," Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, Sep. 2024, doi: 10.1162/COLI_A_00524.
[2] Y. Guo et al., "Bias in Large Language Models: Origin, Evaluation, and Mitigation," Nov. 2024, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2411.10915
[3] Y. Bai et al., "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback," Apr. 2022, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2204.05862
[4] S. Raza, A. Raval, and V. Chatrath, "MBIAS: Mitigating Bias in Large Language Models While Retaining Context," Jun. 2024, Accessed: Aug. 11, 2025. [Online]. Available: http://arxiv.org/abs/2405.11290
[5] Z. Liu, S. Qian, S. Cao, and T. Shi, "Mitigating Age-Related Bias in Large Language Models: Strategies for Responsible Artificial Intelligence Development," INFORMS J Comput, May 2025, doi: 10.1287/IJOC.2024.0645.
[6] L. Ouyang et al., "Training language models to follow instructions with human feedback," Adv Neural Inf Process Syst, vol. 35, Mar. 2022, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2203.02155
[7] S. Barikeri, A. Lauscher, I. Vulic, and G. Glavaš, "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models," ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, vol. 1, pp. 1941–1955, 2021, doi: 10.18653/V1/2021.ACL-LONG.151.
[8] M. Nadeem, A. Bethke, and S. Reddy, "StereoSet: Measuring stereotypical bias in pretrained language models," ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, vol. 1, pp. 5356–5371, 2021, doi: 10.18653/V1/2021.ACL-LONG.416.
[9] D.-M. Córdova-Esparza, "AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges," Information 2025, Vol. 16, Page 469, vol. 16, no. 6, p. 469, May 2025, doi: 10.3390/INFO16060469.
[10] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models," EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1953–1967, 2020, doi: 10.18653/V1/2020.EMNLP-MAIN.154.
[11] J. A. Goldstein, G. Sastry, M. Musser, R. DiResta, M. Gentzel, and K. Sedova, "Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations," Jan. 2023, Accessed: Aug. 11, 2025. [Online]. Available: http://arxiv.org/abs/2301.04246
[12] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi, "Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations," Jun. 2017, Accessed: Aug. 11, 2025. [Online]. Available: http://arxiv.org/abs/1707.00075
[13] Y. Ge et al., "OpenAGI: When LLM Meets Domain Experts," Adv Neural Inf Process Syst, vol. 36, Apr. 2023, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2304.04370
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Computer Science and Artificial Intelligence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








