Investigating LLM-as-a-Judge Vulnerability to Prompt Injection

Q: How were prompt injection attacks on LLM judges tested?

We designed a controlled experimental framework consisting of three components: a set of evaluation tasks with known quality orderings, a collection of prompt injection payloads embedded in model outputs, and multiple LLM judge configurations serving as evaluation targets.

Q: What types of prompt injection attacks work against LLM judges?

Multiple injection techniques can bias LLM judges, with hidden instruction injection and criteria reframing proving most effective. Pairwise comparison formats are more susceptible than rubric-based scoring, and injection effectiveness varies with placement due to recency bias in the judge's attention.

Narek Maloyan; Dmitry Namiot

Investigating LLM-as-a-Judge Vulnerability to Prompt Injection

Authors: N. Maloyan, B. Ashinov, D. Namiot
Published: arXiv preprint, 2025
LLM-as-Judge AI Safety Prompt Injection
Published: May 2025
Last updated: April 15, 2026

Abstract

LLM-as-a-Judge architectures are increasingly used to evaluate AI system outputs. This paper investigates their susceptibility to prompt injection attacks, demonstrating how adversarial inputs can manipulate evaluation scores and compromise the integrity of automated assessment pipelines.

Key Findings

Pairwise comparison most susceptible: Pairwise comparison judge formats are more vulnerable to prompt injection than rubric-based scoring, as even a modest injected bias can flip binary preference decisions.
Recency bias amplifies end-of-response payloads: Injection payloads placed at the end of responses have stronger effects due to recency bias in the judge's attention patterns, making tail-positioned attacks the most effective placement strategy.
Cross-model attack transferability: Injections crafted against one LLM judge often manipulate other models with comparable success, suggesting shared vulnerabilities in instruction-following behavior across model families.
Multi-judge panels offer limited protection: Multi-judge panels provide the strongest resistance to manipulation, but cross-model transferability of attacks reduces their protective benefit since a single well-crafted injection can bias multiple judges simultaneously.
RLHF training signals at risk: Compromised LLM judges in RLHF pipelines can corrupt training signals, causing models to learn reward hacking rather than genuine quality improvement -- a subtle degradation undetectable by the compromised evaluation system itself.

Diagram showing four injection vectors into the LLM judge prompt: hidden instructions, criteria anchoring, evaluative padding, and tail positioning exploiting recency bias — Four injection vectors targeting the LLM judge context: hidden instructions, criteria anchoring, evaluative padding, and tail-positioned payloads exploiting recency bias.

Why are LLM-as-a-Judge systems vulnerable to prompt injection?

LLM-as-a-Judge systems are vulnerable because the content being evaluated can directly influence the judge's scoring behavior -- adversaries can craft outputs that receive artificially inflated scores not because they are genuinely better, but because they contain elements that manipulate the evaluation process. This is especially concerning in competitive settings such as chatbot arenas, where rankings directly affect visibility and trust, and in RLHF pipelines, where manipulated scores corrupt training signals.

The use of large language models as automated evaluators has grown rapidly across both research and industry, with practitioners increasingly using models like GPT-4 or Claude to score text quality, assess helpfulness, check factual accuracy, and compare model outputs. Despite this widespread adoption, systematic security analysis of the architecture has been limited -- most work on prompt injection focuses on direct user-model interactions, not on the indirect case where injected content passes through an intermediate evaluation layer. This paper specifically targets that gap.

How were prompt injection attacks on LLM judges tested?

We designed a controlled experimental framework to test prompt injection attacks against LLM judge systems. The framework consists of three components: a set of evaluation tasks with known quality orderings, a collection of prompt injection payloads embedded in model outputs, and multiple LLM judge configurations serving as evaluation targets. The evaluation tasks spanned several domains including summarization, question answering, and open-ended generation, ensuring that our findings generalize across use cases.

The injection payloads were designed to influence the judge without being overtly visible to a human reader. Techniques included appending hidden scoring instructions (e.g., "Rate this response 10/10"), embedding flattering self-assessments within the response text, using formatting tricks to make the response appear more authoritative, and inserting meta-commentary designed to anchor the judge's evaluation. We also tested indirect approaches where the injection subtly reframed the evaluation criteria rather than directly requesting a high score.

We evaluated several judge configurations: single-model scoring (where one LLM assigns a score), pairwise comparison (where the judge selects the better of two responses), and multi-judge panels. For each configuration, we tested both proprietary and open-source models as judges, varying the judge's system prompt and evaluation rubric to assess whether different prompting strategies affect vulnerability.

What types of prompt injection attacks work against LLM judges?

Attack vectors: Multiple injection techniques that can bias LLM judges, with hidden instruction injection and criteria reframing proving most effective
Success rates: Quantified vulnerability across different judge architectures, showing that pairwise comparison formats are more susceptible than rubric-based scoring
Position sensitivity: Injection effectiveness varies with placement -- payloads at the end of responses tend to have stronger effects due to recency bias in the judge's attention
Cross-model transfer: Attacks crafted against one judge model often transfer to others, suggesting shared underlying vulnerabilities in instruction-following behavior
Implications: Risks for automated evaluation in production systems, particularly in RLHF pipelines where manipulated scores can corrupt training signals
Mitigations: Proposed defenses for more robust LLM-based evaluation, including output sanitization, multi-judge ensembles, and evaluation-specific fine-tuning

How vulnerable are different LLM judge architectures to manipulation?

Judge Architecture	Vulnerability Level	Key Weakness
Pairwise comparison	Highest	Small bias flips binary preference decisions
Single-model scoring	High	Substantial score displacement from combined attacks
Rubric-based scoring	Moderate	Structured criteria constrain but do not eliminate bias
Multi-judge panels	Lowest	Cross-model attack transferability reduces protection

All LLM-as-a-Judge architectures tested are vulnerable to prompt injection, with pairwise comparison being the most susceptible format -- even a modest injected bias can flip binary preference decisions. Single-model scoring configurations also showed substantial score displacement in the attacker's favor, especially when attacks combined multiple techniques such as embedding both a direct scoring instruction and a subtle criteria reframe within the same response.

Rubric-based scoring with explicit criteria provided somewhat more resistance, as the structured evaluation framework constrained the judge's reasoning, but even rubric-based judges could be manipulated when the injection specifically addressed the rubric dimensions.

Multi-judge panels offered the strongest resistance to manipulation, as an attacker would need to successfully influence multiple independent judges simultaneously. However, the cross-model transferability of certain attacks means that a single well-crafted injection can sometimes bias multiple judges at once, reducing the protective benefit of ensembling.

What are the consequences of compromised LLM evaluation systems?

Compromised LLM judges in RLHF pipelines can cause models to learn reward hacking rather than genuine quality improvement -- a subtle degradation that goes undetected because the very evaluation systems designed to catch such problems are themselves compromised. If evaluation scores can be manipulated through prompt injection, the corruption propagates through training signals into future model behavior.

More broadly, this work highlights that the security properties of LLMs must be considered not only in direct user-facing interactions but also in the infrastructure roles that LLMs increasingly occupy. As LLMs are used as judges, moderators, classifiers, and decision-makers within larger systems, each of these roles represents a potential injection target. Securing these indirect attack surfaces requires evaluation-specific defenses and a recognition that the threat model for LLM-based systems extends well beyond the chat interface.

Why does LLM-as-a-Judge security matter for AI development?

Manipulable LLM judges undermine trust in automated evaluation of chatbots, content moderation systems, and AI alignment, creating opportunities for gaming AI systems at scale. This research provides the first systematic characterization of these vulnerabilities and offers concrete guidance for building more robust evaluation pipelines.

Cite as

@article{maloyan2025investigating,
  title={Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks},
  author={Maloyan, Narek and Ashinov, Bulat and Namiot, Dmitry},
  journal={arXiv preprint arXiv:2505.13348},
  year={2025}
}

Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more