← BACK TO DISPATCH

LLM Self-Preference Bias: How Anonymized Peer Review Fixes It

LLM self-preference bias is the reason naive multi-model evaluation panels don't work. Ask GPT-4o to judge outputs from GPT-4o, Claude, and Gemini. It will prefer its own output the majority of the time, regardless of quality. Panickssery et al. (2024, "LLM Evaluators Recognize and Favor Their Own Generations", NeurIPS) measured GPT-4 self-preference at a pairwise win-rate above 0.90 when…

LLM Self-Preference Bias: How Anonymized Peer Review Fixes It

The Problem: LLM Self-Preference Bias

LLM self-preference bias is the reason naive multi-model evaluation panels don’t work. Ask GPT-4o to judge outputs from GPT-4o, Claude, and Gemini. It will prefer its own output the majority of the time, regardless of quality. Panickssery et al. (2024, “LLM Evaluators Recognize and Favor Their Own Generations”, NeurIPS) measured GPT-4 self-preference at a pairwise win-rate above 0.90 when judging summarization tasks against competing models - meaning it picked its own output over 90% of head-to-head comparisons. The same directional effect holds across model families. Claude prefers Claude-flavored prose. Gemini prefers its own hedged, structured responses.

This makes naive multi-model eval panels useless. You have five models judging each other and what you actually measure is a popularity contest with clique dynamics. The highest-scoring output is the one whose style happens to be most common among your evaluators, not the one that best answers the question.

Karpathy’s llm-council solves this with a single, elegant mechanism: label anonymization. The evaluating models never learn whose output they’re reading. Here’s why that works, and the minimal implementation.


Three Biases, Not One

Self-preference is the most obvious failure mode but not the only one. The full picture is worse.

Verbosity bias. Models consistently score longer responses higher on open-ended tasks because length signals effort, completeness, and authority - even when the extra words add nothing. The practical consequence is that an eval panel will systematically rank verbose responses higher than concise ones of equal quality. If you want to use such a panel to select the best generation from a pool, your selection criterion becomes “writes the most” rather than “answers best.” Correcting for this requires either a rubric that explicitly penalizes length-without-substance, or a length normalization step before scoring.

Position bias. When a model receives an ordered list of responses, the first item anchors its judgment. On tasks where quality differences are marginal, response_A carries a structural advantage simply because it appears first. The same anchoring effect is documented in human jury deliberations and human annotation pipelines. In a five-judge panel where every judge sees [A, B, C] in the same order, this advantage compounds - it inflates scores for whichever model happens to draw the first label, independent of quality.

Style bias. Models learn to recognize their own stylistic fingerprints: sentence rhythm, hedging patterns, structural choices like numbered lists vs. prose. They score those patterns higher without needing to see the model name. This is self-preference’s quieter sibling - you cannot prevent it by hiding identities if the identity is legible in the prose itself. Style bias is partially, not fully, broken by anonymization.

Label anonymization directly breaks self-preference and reduces style bias. The model cannot favor what it cannot identify, and anonymized labels strip the most obvious recognition signal. Position bias is partially mitigated by randomizing which model gets which label per round - if each judge sees a different random permutation, the positional advantage averages out across the panel - but within any single judge’s view, the first-listed response still has a small edge.


Karpathy’s Solution: The Bijective Label Map

The core mechanism in llm-council is what the codebase calls a label map: a server-side secret bijection between anonymous response labels and the originating model names.

Before evaluation starts, the server creates a mapping that assigns each model a neutral label - the first model becomes “response A,” the second “response B,” the third “response C,” and so on - and remembers the assignment privately.

This mapping is bijective - one-to-one and onto. Every label maps to exactly one model; every model maps to exactly one label. No collisions, no ambiguity. This matters because the reveal step after evaluation must be deterministic: you need to reconstruct exactly who scored how many points.

Evaluating models receive only the anonymized view: each response appears under its neutral label - “response A,” “response B,” “response C” - paired with the response text and nothing else.

No model name. No provider logo. No stylistic hint that GPT-4o tends to number its lists while Claude tends to use bullet headers.

Each evaluating model produces a ranking: first, second, third. Once all rankings are collected, the server calls reveal() - maps every label back to its originating model - and aggregates scores. The bias is structurally impossible during the evaluation phase because the information required to enact it is absent.

The label assignment itself is randomized per evaluation round using a hash of a UUID, which partially scrambles position bias across evaluators. If five models are judging and each sees a different random permutation of {A, B, C}, the positional advantage averages out.


Borda Count Aggregation

Once you have N evaluators each producing a full ranking of M responses, you need to aggregate. llm-council uses Borda count.

The rule: for M candidates, a first-place vote awards M-1 points, second place awards M-2, down to 0 points for last place. Sum across all evaluators. That’s the complete definition.

Concretely, with M=3 responses and N=5 evaluators:

PlacePoints awarded
1st2 (= M-1)
2nd1 (= M-2)
3rd0 (= M-3)

Maximum possible score for any response: 5 evaluators x 2 points = 10. A sweep (every evaluator ranked it first) scores 10. A unanimous last-place finish scores 0.

Why Borda instead of majority vote (where you count who won the most head-to-head comparisons)?

Majority vote discards preference intensity. Example: four of five evaluators rank response_B second, one evaluator ranks it first. Majority vote records response_B as having won one pairwise comparison - a thin signal. Borda count adds it up: 4 evaluators x 1 point (second place) + 1 evaluator x 2 points (first place) = 6 points - correctly reflecting that it was broadly acceptable across the panel rather than narrowly preferred by one judge and ignored by the rest. When selecting the best generation from a pool, broad acceptability matters. Borda preserves that signal; majority vote throws it away.


The Design

The whole mechanism fits in two cooperating pieces with no dependencies beyond the standard library.

The first piece is the label map: the server-side bijection. Given the list of model names and a random seed, it builds the neutral labels, shuffles the models, and stores both directions of the mapping - label to model and model to label - so it can both hide identities and reverse the assignment later. It exposes an anonymize step that converts a model-keyed set of responses into a label-keyed set with the originating identity stripped, and a reveal step that returns the label-to-model mapping for use after scoring. Keeping both directions of the map ensures the reveal is deterministic and collision-free.

The second piece is the evaluation round. On construction it draws a fresh random seed - derived from a UUID so each round permutes labels differently - builds the label map, and produces the anonymized view that evaluators will see. It collects rankings as ordered lists of labels (best first), one per judge, and aggregates them with Borda count: for each ranking, a label earns points equal to the number of candidates ranked below it, summed across all judges. After aggregation it reveals the label map and returns the final standings sorted best-first, each entry pairing a label with its now-revealed model name and total score.

To exercise the pipeline end to end without API keys or network calls, a stand-in judge simply returns a random permutation of the labels in place of a real model’s considered ranking. Feeding a small set of competing answers through the round produces the full sequence: the anonymized view the evaluators receive, a handful of judge rankings, the Borda aggregation, and the revealed result. Swapping the stand-in judge for actual model API calls is the only change needed to run it against live models.


Where This Isn’t Enough

Label anonymization removes one bias vector. Three remain.

Position bias is only partially addressed. llm-council randomizes which model gets which label between rounds. That means GPT-4o does not always draw response_A. But within a single judge’s prompt, response_A still appears before response_B and response_C. On marginal quality - which is most of the interesting cases - first-listed responses win more often. The fix is independent per-evaluator permutation: each judge receives a different randomized ordering of labels, not just different label assignments. llm-council does not do this by default.

Verbosity bias is completely unaddressed. A longer response is longer regardless of what label it carries. Anonymization operates on identity, not length. If your evaluation task rewards completeness or thoroughness, the panel will systematically favor longer outputs. The only mitigations are: (1) a scoring rubric that explicitly penalizes length-without-substance, or (2) truncating all candidates to the same token length before evaluation. Neither is provided out of the box.

Evaluator correlation compounds the panel size problem. If your five judges are GPT-4o, GPT-4o-mini, GPT-4o-turbo, GPT-4-preview, and o1-mini, you do not have five independent opinions. Models from the same training lineage share RLHF signal, data mixtures, and stylistic preferences. In the limit, a fully correlated panel of five is equivalent to one judge counted five times. The practical fix: measure pairwise disagreement between evaluators and weight votes accordingly, or deliberately compose your panel from architecturally distinct families - one OpenAI model, one Anthropic, one Google, one open-weights.

These are real constraints. The mechanism is still the right foundation - structure the problem so that identity-based bias cannot operate, then address the residual biases as second-order concerns.


When to Use This

This pattern is appropriate when:

  • You need to rank N candidate outputs without human annotation budget. The anonymized panel replaces crowd-sourced preference data on tasks where model quality is the variable of interest, not human subjective taste.
  • You are building a self-improving generation pipeline and need to select the best output from a candidate pool at each iteration. Label anonymization makes that selection signal trustworthy.
  • You want task-distribution-specific model benchmarking without paying for Arena-style continuous human eval. Construct a domain-specific question set, run the anonymized council, and get a local ranking that reflects your task distribution rather than a general leaderboard.

It is not a substitute for human evaluation when the evaluation criterion is itself a matter of human preference - tone, brand voice, creative writing. For those tasks, bias in the evaluator is a feature, not a bug.


References and Code

llm-council - Andrej Karpathy, 2024. Source: github.com/karpathy/llm-council

LLM Evaluators Recognize and Favor Their Own Generations - Arjun Panickssery, Samuel R. Bowman, Shi Feng. NeurIPS 2024. arXiv: 2404.13076

The companion demo for this post implements the full round described above - standard library only, no dependencies - with a stand-in judge that can be replaced by real model API calls to run against live models.

@karpathy - the llm-council design is clean. The bijective label map is the kind of simple-in-retrospect solution that makes you wish you’d thought of it first.