close
close

Modifiable fairness: fine-grained bias mitigation in language models

Modifiable fairness: fine-grained bias mitigation in language models

Abstract: Generating fair and accurate predictions plays a critical role in deploying large language models (LLMs) in the real world. However, existing debiasing methods inevitably generate unfair or incorrect predictions because they are designed and evaluated to achieve parity across different social groups but leave out individual common-sense facts, resulting in altered knowledge that elicits unreasonable or unwanted predictions. In this paper, we first establish a novel bias mitigation benchmark, BiaScope, which systematically evaluates performance by leveraging newly constructed datasets and metrics on knowledge retention and generalization. Then, we propose a novel debiasing approach, Fairness Stamp (FAST), which enables accurate calibration of individual social biases. FAST identifies the decisive layer responsible for storing social biases and then calibrates its outputs by integrating a small modular network, considering both bias mitigation and knowledge preservation requirements. Extensive experiments demonstrate that FAST outperforms state-of-the-art baselines with superior debiasing performance without compromising the model’s overall ability to retain knowledge and predict downstream outcomes. This highlights the potential of fine-grained debiasing strategies to achieve fairness in LLMs. The code will be publicly available.