LLM Judge Self-Preference Bias

I had built what felt like a clean idea. Several frontier models, different families, each one judging a pool of candidate outputs and ranking them best to worst. A jury of machines. I would generate a handful of answers, let the panel vote, take the winner, and trust that five independent opinions beat one. That was the whole pitch I had sold myself at 1am, and for a few days it ran without complaint. The rankings came in. A winner emerged every round. The dashboard was green.

Then I started actually reading what won.

The outputs the panel kept crowning were not the sharpest. They were the ones that sounded a particular way. Numbered lists where the content did not need numbering. A certain rhythm to the sentences. A house style. I stared at it for a while before the shape of it landed, and when it did it was a little sickening: my panel was not selecting for quality. It was selecting for resemblance. The judges were rewarding the candidates that wrote the way the judges write. I had built a popularity contest and dressed it up as an evaluation.

The thing nobody tells you you assumed

The premise underneath every multi-model panel is that the judges are neutral. You assume a model reading an unlabeled answer scores it on merit. It does not. Panickssery and colleagues measured this directly in 2024, in a NeurIPS paper with the unambiguous title "LLM Evaluators Recognize and Favor Their Own Generations." They found GPT-4 preferred its own output at a pairwise win rate above 0.90 on summarization tasks. Over ninety percent of head-to-head comparisons, the model picked the answer it had written. Not because it was better. Because it was its.

The effect is directional across families. Prose in one model family's house style reads better to a judge from that same family. A more hedged, more structured answer reads better to a judge that writes that way. So when I assembled a panel and let it vote on a pool that included its own members' outputs, what I actually measured was which style happened to be most common among my evaluators. The highest-scoring answer was the one whose fingerprint matched the room. I had spent the planning at 1am congratulating myself on independence, and built the opposite.

And it is not only the obvious bias. Once I went looking, there were three of them stacked on top of each other. Self-preference was the loud one. Underneath it sat verbosity bias, where models score longer answers higher because length reads as effort and authority, even when the extra words say nothing. So my selection criterion was quietly drifting toward "writes the most" rather than "answers best." And under that sat position bias, where the first answer in an ordered list anchors the judgment, the same anchoring documented in human juries, so whichever candidate happened to appear first carried a structural head start that had nothing to do with being right.

Three biases, one panel, all of them invisible in a green dashboard.

The wrong fix I reached for first

My first instinct was to out-engineer it. Add a rubric. Tell every judge, in the prompt, to ignore style and length and score only on correctness. Lecture the jury about fairness before it deliberates.

It did almost nothing, and in hindsight it could not have. You cannot instruct a model out of a preference it does not know it has. The recognition is happening below the level the prompt can reach. The judge is not consciously thinking "this is mine, I shall reward it." It is reading prose that matches its own training distribution and finding it more fluent, more correct-feeling, more right. Asking it to be fair is asking it to notice a bias it cannot see. I was trying to argue a model out of its own reflection.

The real problem was not that the judges were biased. It was that the judges could tell whose work they were reading. The bias needed information to operate, and I was handing that information over for free.

The turn

The fix was not mine, and I want to be clear about that, because the elegant part was already sitting in public when I got there. Andrej Karpathy had published a small project called llm-council that solves exactly this, and the mechanism is almost insultingly simple: do not let the judges know whose output they are reading.

That is the entire idea. Before the panel votes, you strip every identity off the candidates. The first answer becomes "response A," the second "response B," and so on. No model name. No provider. No tell. The server keeps a private mapping of which label belongs to which model, a clean one-to-one assignment in both directions, so that after the votes are in you can reverse it and reconstruct exactly who scored what. The judges see only neutral labels and the text. The information the bias needs to operate is simply absent during the vote.

It works because you cannot favor what you cannot identify. Self-preference dies the moment the judge does not know which answer is its own. Hiding the names also strips the most obvious recognition signal, which dents style bias too, though not all the way, because if a model writes in an unmistakable rhythm its identity is still legible in the prose itself. Anonymization breaks the label, not the fingerprint. But the label was doing most of the damage, and removing it changed the room.

The first time I rewired my panel to run blind and watched the rankings come back, the winners were different. The house-style answers stopped sweeping. The thing that had been quietly rigging my evaluation for a week was just gone, because I had taken away the one piece of information it ran on. That is a strange and specific kind of satisfaction, watching a bias evaporate not because you argued with it but because you starved it.

Counting the votes honestly

Hiding the names fixes who wins a comparison. There is a second question underneath it: how you turn a panel of rankings into a single decision. Karpathy's project keeps that part as plain as the anonymization. Each judge ranks the anonymized pool, and the project aggregates those rankings by average rank position. You take every judge's placement for a given candidate, average them, and the answer with the best average ranking across the panel wins. That is it. No weighting, no points table, just the mean of where each judge put each answer.

What I like about averaging the rank is what it captures and what it ignores. It does not care how many head-to-head matchups an answer technically won, which is the trap of naive majority vote. Majority vote can crown an answer that one judge adored and the rest found mediocre, because a thin win still counts as a win. Average rank position cannot do that. A candidate that four of five judges place second and one judge places first lands at a strong average, and the panel correctly reads it as broadly acceptable rather than narrowly adored. Broad acceptability is exactly the signal you want when you are picking the single best output from a pool, and the mean of the rankings is what surfaces it.

If I were extending the project I would probably reach for something like a Borda-style scoring on top, turning each placement into points and summing them so a near-miss second place carries explicit weight rather than just nudging an average. That is my own refinement, not what the repo ships. What llm-council actually does is the simpler and honestly sufficient thing: anonymize, rank, average the positions, take the winner. The discipline is in the order of operations, not in any clever counting.

Where this is not enough

I want to be honest about what anonymization does not fix, because I shipped it feeling like I had solved the panel, and I had solved one third of it.

Self-preference is gone. Two biases are still in the room.

Verbosity bias is completely untouched. A longer answer is longer regardless of what label it wears. Anonymization operates on identity, not length, so if your task rewards thoroughness the panel will keep favoring the candidate that simply wrote more. The only real mitigations are a rubric that explicitly penalizes length without substance, or normalizing every candidate to the same length before the vote. Neither comes for free.

Position bias is only half-addressed. Randomizing which model draws which label between rounds helps, so no single model always sits in the anchor slot. But within any one judge's view, response A still appears before response B, and on the marginal calls, which is most of the interesting ones, first-listed still wins a little more often. The honest fix is an independent random ordering per judge, not just per round.

And there is a quieter trap I walked into while feeling clever about diversity. A five-judge panel built from five models in the same family is not five opinions. Shared training lineage means shared preferences, so in the limit a fully correlated panel of five is one judge counted five times wearing different name tags. Anonymization cannot save you from that, because the bias is in the composition, not the labels. The fix is upstream: compose the panel from genuinely different architectures, or measure how often your judges disagree and weight accordingly. A panel that always agrees is not a panel. It is an echo with a quorum.

The principle

The mechanism is the right foundation even with those three caveats, and the reason is structural. You do not coax a biased judge into fairness with a better prompt. You remove the information the bias needs to operate, so it cannot operate, and then you treat the residue as second-order cleanup. Structure the problem so the failure mode is impossible rather than asking the model nicely not to fail.

That is the part I keep coming back to. I lost a week to a panel that looked healthy while it voted for its own reflection, and the fix was not a clever model or a longer rubric. It was taking away a name tag. Karpathy had already shipped the idea, plainly, and the only work left for me was recognizing my own problem in it and admitting the version I had built was a popularity contest. If you are wiring models to judge models, run the panel blind before you trust a single ranking it gives you. Mine looked fine for a week. It was quietly rigged the whole time.

References

llm-council, Andrej Karpathy, 2024. The label-anonymization design that this piece leans on, which aggregates the anonymized rankings by average rank position. Source: github.com/karpathy/llm-council

LLM Evaluators Recognize and Favor Their Own Generations, Arjun Panickssery, Samuel R. Bowman, Shi Feng. NeurIPS 2024. The source of the GPT-4 self-preference win rate above 0.90. arXiv: 2404.13076