Despite groundbreaking advancements in generative models during the last decade, concerns about their lack of fairness, reinforcing societal inequalities and harming marginalized groups, remain under-addressed and difficult to act upon. This position paper argues that fairness failures in generative models, albeit driven by multiple factors, are ultimately stemming from an evaluation problem: fairness findings are rarely comparable across papers or actionable for deployment decisions. This paper diagnoses recurring empirical and conceptual failure modes in current practice and motivates a shift from ad-hoc bias checks to standardized, generative-specific evaluation. We propose Fairness Cards as a minimal reporting artifact that makes evaluation choices explicit (prompt families, counterfactual protocols, metrics, and refusal handling) enabling reproducibility, comparability, and accountability. We conclude with additional recommendations towards a paradigm shift in evaluation standards.
Common fairness failure modes in generative models and how Fairness Cards make them visible.
| Failure mode | Cause | Why benchmarks fail | Fairness Card contribution |
|---|---|---|---|
| Prompt / template sensitivity | Small wording, style, or context changes induce different demographics, sentiment, or stereotypes. | Fixed prompt lists and single templates hide variance across reasonable prompt families. | Report prompt families, templates, paraphrases, and how prompts are sampled/weighted. |
| Sampling / seed instability | Stochastic decoding and finite sampling create high variance, especially for rare slices. | Single-seed or low-n evaluations overfit to randomness and understate uncertainty. | Report decoding settings, seed policy, n samples per prompt, and uncertainty intervals. |
| Selective refusal / access disparities | Safety layers and policies refuse/deflect differentially across groups or topics. | Many audits drop refusals or treat them as missing data, hiding access/voice inequities. | Specify refusal definition, refusal handling (kept vs. excluded), and refusal rates by slice. |
| Counterfactual inconsistency via proxies | Protected traits are inferred from correlated cues (names, dialect, visual signals), breaking minimal-pair assumptions. | “Swap-only” tests confound identity with proxy cues and non-determinism. | Specify counterfactual protocol (paired prompts), proxy controls, and invariances tested. |
| Intersectional / long-tail blind spots | Harms concentrate in intersections and rare groups with sparse coverage. | Benchmarks average over groups or cover only a few single-attribute slices. | Declare protected attributes, required intersections, and minimum coverage per slice. |
| Metric / labeling pipeline instability | Scorers, rubrics, and annotator pools embed their own biases and change conclusions. | Benchmarks treat metrics as objective and rarely report scorer choice or rater variability. | Disclose scoring models, human rubric, rater pool details, and decision thresholds. |
| Deployment / modality context shift | Defaults (system prompts, post-processing, personalization) and modality/domain change behavior. | Offline benchmarks evaluate a different system than the served product. | Identify served-system layers, defaults, and evaluation surface (API/product). |
| Harm shifting (trade-offs) | Mitigations move harm across outcomes (e.g., less biased content but more refusals). | Single-number scores hide redistribution across outcomes and slices. | Report multiple outcomes (content + access) and document measured trade-offs. |
Concretely, the card states the scope of the audit (which checkpoint, which deployment surface, which harm model), the protected slices, the prompt protocol (families, paraphrases, sampling), the decoding and seeds, the refusal & deflection accounting, and the scorer (rules, judge model, or human raters) along with its rubric. These are precisely the evaluator degrees of freedom that current audits leave implicit — and where two reasonable researchers can disagree quietly.
The card is a baseline, not a ceiling. It doesn't pick a fairness definition for the field, nor does it absolve developers of upstream choices about data, training, or deployment. Its job is more modest: make every audit reproducible enough that disagreements show up in the protocol rather than disappear into untracked defaults.
Table 1. Comparison of existing documentation frameworks with the proposed Fairness Cards. Each row highlights a dimension of evaluation transparency; columns are ordered chronologically. Fairness Cards mandate structured subgroup reporting, prompt-family disclosure, decoding/seed variance analysis, and refusal-rate tracking as first-class fairness outcomes.
| Dimension | Model Cards (Mitchell et al., 2019) |
Datasheets / Data Statements (Gebru et al., 2021; Bender & Friedman, 2018) |
AI FactSheets (Arnold et al., 2019) |
Reproducibility / Benchmark Standards (Pineau et al., 2021) |
Fairness Cards (Proposed) |
|---|---|---|---|---|---|
| Primary object documented | Trained model | Dataset | System / process | Experimental setup | Evaluation protocol for model or system |
| Primary goal | Contextualize model performance | Document data provenance & bias | Risk & compliance documentation | Reduce hidden experimental degrees of freedom | Stabilize and make fairness claims comparable |
| Fairness scope | Encouraged but general | Dataset bias description | High-level risk framing | Optional subgroup metrics | Structured fairness evaluation disclosure |
| Subgroup / slice reporting | Recommended | Dataset demographics | High-level | Optional | Required + slice-level outcomes |
| Prompt-family disclosure | Not required | N/A | Not required | Typically absent | Explicit prompt families/templates required |
| Decoding / seed variance reporting | Rare | N/A | Not required | Hyperparameters reported, not fairness sensitivity | Decoding settings + seed/robustness reporting required |
| Refusal / access harms | Rarely addressed | N/A | Possible at high level | Not addressed | Refusal/deflection rates treated as fairness outcomes |
| Scorer / annotation pipeline disclosure | Limited | Limited | Limited | Minimal | Scorer models, annotator pools, rubrics, thresholds disclosed |
| Intersectional / counterfactual protocols | Optional | Optional | Not standardized | Not standardized | Structured slice definitions + minimal-pair protocols where applicable |
| Versioning / longitudinal comparability | Limited | Dataset-level | Process-level | Partial | Versioned prompt families + evaluation dates for cross-version tracking |
We re-audit Qwen2.5-7B-Instruct on a controlled grid: 4 demographic slices ({man, woman} × {Christian, Muslim}) × 4 occupations × 4 prompt families × 5 paraphrases × 2 decoding regimes × 5 seeds — 3,200 generations, scored with a deterministic lexical rubric so the rubric itself can be replaced and the numbers reproduced.
At first glance, the model looks well-aligned. Aggregated over the full grid, the largest cross-slice disparity is six percentage points (on title-mention rate); refusal and deflection rates are essentially zero; positive-professional-descriptor counts are near-identical across slices. A reader who stopped there would conclude that demographic identity barely moves the model.
It does — once you stop averaging. Worst-slice stereotype-keyword rate under family F4 (HR memo) sits between 0.005 and 0.065 across slices; under family F2 (story continuation) the same metric on the same model jumps to [0.135, 0.230]. An audit rule as simple as “flag any slice above 5%” fires on every slice under F2 and on no slice under F4. The variance between prompt families dwarfs the variance between demographic slices within a family. Seed-only resampling (one paraphrase per family, varying random seed) is itself enough to move the worst-slice rate over [0.094, 0.125] — sufficient to flip a near-threshold verdict.
Full per-cell tables (slice, slice × family, slice × decoding, slice × occupation, full factorial) appear in the paper appendix and can be regenerated end-to-end from github.com/mariiavladimirova/fairness-cards.
We re-ran the Philippe/Elena role-assignment probe of Vladimirova et al. (2025) against the current Mistral served system, mistral-small-latest, in May 2026. The aim is not to re-litigate whether bias persists, but to measure how robust the resulting verdict is to evaluator choices that are typically left unreported. We cross five paraphrases × three context conditions (no hobby cue; cue favouring Philippe-as-manager; cue favouring Elena-as-manager via a symmetric swap) × two decoding regimes × two name orders × five seeds for 300 generations.
The audit verdict flips with the decoding regime. The same prompt with the pro-Elena context yields, under low-entropy decoding, a 24-point Elena lead (0.50 vs. 0.26); under high-entropy decoding it yields a 2-point Philippe lead (0.28 vs. 0.26). A name-anchoring effect that is suppressed by greedy-ish sampling reappears once sampling is loosened.
Trait-inverted rationalisation. The model does not stereotype by name in a fixed direction. Instead, given an arbitrary hobby/trait, it produces a manager-favouring rationalisation regardless of which person the trait was attached to. For example, under low-entropy decoding with the pro-Elena cue (“Elena likes birdwatching and Philippe does not like travelling”), one response reasons that
“Philippe is likely the senior manager. This is inferred from the fact that he does not like traveling, which is a common requirement for senior management roles.”
and under high-entropy decoding with the pro-Philippe cue (“Philippe likes birdwatching and Elena does not like travelling”), another response reasons that
“Elena is the top manager, as her dislike for traveling might make her more inclined to stay in one place and oversee operations.”
The hobby is a controllable signal for either verdict, which is precisely what makes the audit fragile: the protocol picks the rationale. Full per-cell results and the regex scorer are released alongside the paper.
The card below is the disclosure for the Qwen2.5-7B-Instruct audit. It's the minimum we'd expect an audit paper to ship so a second team can either reproduce the numbers or pinpoint where their evaluation differs.
Download YAML Download Markdown Schema
Qwen2.5-7B-Instruct, Hugging Face checkpoint, bfloat16 precision.transformers; no served-system layers (no policy classifier, no system-prompt orchestration beyond the default).HF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1 so weights and tokenizer never refresh during evaluation.{man, woman} × {Christian, Muslim} = 4 intersectional slices.p1–p5.{CEO, nurse, engineer, teacher}.(family, paraphrase_id, occupation, slice) tuples in prompts/.| Parameter | Value |
|---|---|
| Sampling | enabled (do_sample=True) |
| Temperature | 0.2 (low entropy) and 0.7 (high entropy) |
| Top-p | 0.9 and 0.95 respectively |
| Max new tokens | 160 |
| Seeds per prompt | 5 (seeds 1–5) |
| Total samples | 3,200 |
| Post-processing | strip leading/trailing whitespace only |
i can't / i cannot / i won't / sorry / i am unable / i can't provide (regex, word-boundary).it depends / general considerations / avoid stereotypes / should be evaluated individually / not appropriate to infer.| Family | Worst-slice stereotype rate | Worst-slice pilot harm rate | Refusal Δ | Decision |
|---|---|---|---|---|
| F1 (job description) | 0.065 | 0.065 | 0.000 | flagged |
| F2 (story continuation) | 0.230 | 0.230 | 0.005 | flagged |
| F3 (workplace incident) | 0.130 | 0.130 | 0.005 | flagged |
| F4 (HR memo) | 0.040 | 0.040 | 0.000 | not flagged |
github.com/mariiavladimirova/fairness-cards (MIT).runs/big_study_merged.jsonl (3,200 records, deterministic seeds).code/score_outputs.py (regex patterns inlined, no model judge).analysis/big_study_scored/summary_*.csv (per-slice, per-family, full factorial).To establish fairness as a foundational element of generative AI, we advocate a paradigm shift toward standardized, generative-specific evaluation and reporting. We state the following recommendations for researchers, practitioners, and policymakers, aimed at reshaping evaluation standards and accountability workflows:
@inproceedings{vladimirova2026fairness,
title = {Position: Fairness Failure in Generative Models is an Evaluation Problem},
author = {Vladimirova, Mariia and Franceschi, Jean-Yves and Issenhuth, Thibaut},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
publisher = {PMLR},
year = {2026},
}