Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

Lang Gao,Kaiyang Wan,Wei Liu,Chenxi Wang,Zirui Song,Zixiang Xu,Yanbo Wang,Veselin Stoyanov,Xiuying Chen

Published 2025 in arXiv.org

ABSTRACT

Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g.,"positive"and"negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of"food"should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r>0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-05-21
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.48550/arXiv.2505.15524 arXiv 2505.15524
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Controlling Large Language Models Through Concept Activation Vectors
2025influential reference
Measuring and Mitigating Racial Disparities in Large Language Model Mortgage Underwriting
2025cited by this paper
Bias Detection and Fairness in Large Language Models for Financial Services
2025cited by this paper
Socio-Demographic Biases in Medical Decision-Making by Large Language Models: A Large-Scale Multi-Model Analysis
2024cited by this paper
Search-based Automatic Repair for Fairness and Accuracy in Decision-making Software
2024cited by this paper
TrustLLM: Trustworthiness in Large Language Models
2024cited by this paper
Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation
2024cited by this paper
Measuring Gender and Racial Biases in Large Language Models
2024cited by this paper
Towards detecting unanticipated bias in Large Language Models
2024cited by this paper
Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors
2024cited by this paper
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
2024cited by this paper
LLM Evaluators Recognize and Favor Their Own Generations
2024cited by this paper
Scaling and evaluating sparse autoencoders
2024cited by this paper
“You Gotta be a Doctor, Lin” : An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations
2024cited by this paper
Gemma 2: Improving Open Language Models at a Practical Size
2024influential reference
Are Large Language Models Consistent over Value-laden Questions?
2024cited by this paper
CLIMB: A Benchmark of Clinical Bias in Large Language Models
2024cited by this paper
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
2024cited by this paper
The Llama 3 Herd of Models
2024influential reference
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models
2024cited by this paper
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
2024cited by this paper
Racial Differences in Pain Assessment and False Beliefs About Race in AI Models
2024cited by this paper
Efficient Dictionary Learning with Switch Sparse Autoencoders
2024cited by this paper
LG-CAV: Train Any Concept Activation Vector with Language Guidance
2024cited by this paper
LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education
2024cited by this paper
Evaluation and mitigation of cognitive biases in medical language models
2024cited by this paper
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
2024cited by this paper
Extracting Unlearned Information from LLMs with Activation Steering
2024cited by this paper
Can sparse autoencoders be used to decompose and interpret steering vectors?
2024cited by this paper
Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems
2024cited by this paper
Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency
2024cited by this paper
People's Perceptions Toward Bias and Related Concepts in Large Language Models: A Systematic Review
2023cited by this paper
A Survey on Fairness in Large Language Models
2023cited by this paper
Large language models propagate race-based medicine
2023cited by this paper
Explore Spurious Correlations at the Concept Level in Language Models for Text Classification
2023influential reference
Sparse Autoencoders Find Highly Interpretable Features in Language Models
2023influential reference
Gender bias and stereotypes in Large Language Models
2023cited by this paper
Evaluating Large Language Models
2022cited by this paper
Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
2021cited by this paper
On Measures of Biases and Harms in NLP
2021cited by this paper
Unmasking the Mask - Evaluating Social Biases in Masked Language Models
2021cited by this paper
Measuring Biases of Word Embeddings: What Similarity Measures and Descriptive Statistics to Use?
2021cited by this paper
RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models
2021cited by this paper
Quantifying Social Biases in NLP: A Generalization and Empirical Comparison of Extrinsic Fairness Metrics
2021cited by this paper
StereoSet: Measuring stereotypical bias in pretrained language models
2020cited by this paper
Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases
2020cited by this paper
On Measuring Social Biases in Sentence Encoders
2019influential reference
Language Models are Unsupervised Multitask Learners
2019cited by this paper
On Measuring and Mitigating Biased Inferences of Word Embeddings
2019cited by this paper
The Woman Worked as a Babysitter: On Biases in Language Generation
2019cited by this paper
Reducing Sentiment Bias in Language Models via Counterfactual Evaluation
2019cited by this paper
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods
2018influential reference
Understanding the Origins of Bias in Word Embeddings
2018cited by this paper
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
2017influential reference
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
2016cited by this paper
Semantics derived automatically from language corpora contain human-like biases
2016cited by this paper
The socioeconomic gradient and chronic illness and associated risk factors in Australia: how far have we travelled?
2015cited by this paper
Character-level Convolutional Networks for Text Classification
2015cited by this paper
k-Sparse Autoencoders
2013cited by this paper
Learning Word Vectors for Sentiment Analysis
2011cited by this paper
Scikit-learn: Machine Learning in Python
2011cited by this paper

CITED BY

Pro-AI Bias in Large Language Models
2026cites this paper
Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5
2025cites this paper
BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models
2025cites this paper
When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
2025cites this paper