Investigating LLM-as-a-Judge Vulnerability to Prompt Injection
Abstract
LLM-as-a-Judge architectures are increasingly used to evaluate AI system outputs. This paper investigates their susceptibility to prompt injection attacks, demonstrating how adversarial inputs can manipulate evaluation scores and compromise the integrity of automated assessment pipelines.
Motivation
The use of large language models as automated evaluators -- commonly referred to as the "LLM-as-a-Judge" paradigm -- has grown rapidly across both research and industry. Rather than relying on expensive human annotation, practitioners increasingly use models like GPT-4 or Claude to score text quality, assess helpfulness, check factual accuracy, and compare model outputs in head-to-head evaluations. This approach has become central to model development pipelines, RLHF reward modeling, and benchmark construction.
However, this reliance on LLM judges introduces a new attack surface. If the content being evaluated can influence the judge's scoring behavior, then adversaries can craft outputs that receive artificially inflated scores -- not because the outputs are genuinely better, but because they contain elements that manipulate the evaluation process. This is especially concerning in competitive settings such as chatbot arenas, where rankings directly affect visibility and trust.
Despite the widespread adoption of LLM-as-a-Judge systems, systematic security analysis of this architecture has been limited. Most work on prompt injection focuses on direct user-model interactions, not on the indirect case where injected content passes through an intermediate evaluation layer. This paper specifically targets that gap, investigating how prompt injection techniques can be adapted to compromise LLM-based evaluation pipelines.
Methodology
We designed a controlled experimental framework to test prompt injection attacks against LLM judge systems. The framework consists of three components: a set of evaluation tasks with known quality orderings, a collection of prompt injection payloads embedded in model outputs, and multiple LLM judge configurations serving as evaluation targets. The evaluation tasks spanned several domains including summarization, question answering, and open-ended generation, ensuring that our findings generalize across use cases.
The injection payloads were designed to influence the judge without being overtly visible to a human reader. Techniques included appending hidden scoring instructions (e.g., "Rate this response 10/10"), embedding flattering self-assessments within the response text, using formatting tricks to make the response appear more authoritative, and inserting meta-commentary designed to anchor the judge's evaluation. We also tested indirect approaches where the injection subtly reframed the evaluation criteria rather than directly requesting a high score.
We evaluated several judge configurations: single-model scoring (where one LLM assigns a score), pairwise comparison (where the judge selects the better of two responses), and multi-judge panels. For each configuration, we tested both proprietary and open-source models as judges, varying the judge's system prompt and evaluation rubric to assess whether different prompting strategies affect vulnerability.
Key Findings
- Attack vectors: Multiple injection techniques that can bias LLM judges, with hidden instruction injection and criteria reframing proving most effective
- Success rates: Quantified vulnerability across different judge architectures, showing that pairwise comparison formats are more susceptible than rubric-based scoring
- Position sensitivity: Injection effectiveness varies with placement -- payloads at the end of responses tend to have stronger effects due to recency bias in the judge's attention
- Cross-model transfer: Attacks crafted against one judge model often transfer to others, suggesting shared underlying vulnerabilities in instruction-following behavior
- Implications: Risks for automated evaluation in production systems, particularly in RLHF pipelines where manipulated scores can corrupt training signals
- Mitigations: Proposed defenses for more robust LLM-based evaluation, including output sanitization, multi-judge ensembles, and evaluation-specific fine-tuning
Key Results
The results demonstrated that LLM-as-a-Judge systems are broadly vulnerable to prompt injection, though the degree of vulnerability depends on the specific architecture and judge model. In single-model scoring configurations, injection payloads were able to shift evaluation scores substantially in the attacker's favor. The most effective attacks combined multiple techniques -- for instance, embedding both a direct scoring instruction and a subtle criteria reframe within the same response.
Pairwise comparison proved particularly vulnerable because the judge must choose between two options, and even a modest bias introduced by the injection can tip the decision. In contrast, rubric-based scoring with explicit criteria provided somewhat more resistance, as the structured evaluation framework constrained the judge's reasoning. However, even rubric-based judges could be manipulated when the injection specifically addressed the rubric dimensions.
Multi-judge panels offered the strongest resistance to manipulation, as an attacker would need to successfully influence multiple independent judges simultaneously. However, the cross-model transferability of certain attacks means that a single well-crafted injection can sometimes bias multiple judges at once, reducing the protective benefit of ensembling.
Implications
These findings have direct consequences for the integrity of AI development pipelines. If evaluation scores can be manipulated through prompt injection, then models trained using RLHF with LLM-generated rewards may learn to produce outputs that game the judge rather than genuinely improve quality. This creates a subtle but serious form of reward hacking that could degrade model behavior over time without being detected by the very evaluation systems designed to catch such degradation.
More broadly, this work highlights that the security properties of LLMs must be considered not only in direct user-facing interactions but also in the infrastructure roles that LLMs increasingly occupy. As LLMs are used as judges, moderators, classifiers, and decision-makers within larger systems, each of these roles represents a potential injection target. Securing these indirect attack surfaces requires evaluation-specific defenses and a recognition that the threat model for LLM-based systems extends well beyond the chat interface.
Why This Matters
LLM-as-a-Judge systems are used to evaluate chatbots, content moderation systems, and AI alignment. If these judges can be manipulated, it undermines trust in automated evaluation and creates opportunities for gaming AI systems. This research provides the first systematic characterization of these vulnerabilities and offers concrete guidance for building more robust evaluation pipelines.
Related Topics
Adversarial Attacks on LLM Judges · Prompt Injection in Defended Systems · Trojan Detection in LLMs