Prompt Injection Attacks in Defended Systems
Abstract
This paper investigates the effectiveness of prompt injection attacks against large language models (LLMs) that employ defensive mechanisms. We evaluate multiple attack strategies across various defended systems, analyzing success rates and identifying vulnerabilities that persist despite protective measures.
Background
Prompt injection has emerged as one of the most pressing security concerns in the deployment of large language models. At its core, prompt injection involves crafting inputs that override or subvert a model's intended instructions, causing it to produce outputs that violate its operational constraints. As LLMs have moved from research prototypes into production systems handling sensitive tasks -- customer service, code generation, document summarization -- the stakes of successful injection attacks have grown considerably.
In response, the research community and industry have developed a range of defensive mechanisms. These include system prompt hardening, input sanitization filters, output classifiers that detect policy violations, and instruction hierarchy approaches that attempt to give system-level instructions higher priority than user inputs. Some defenses are applied at the model level through fine-tuning or RLHF, while others operate as external guardrails wrapping the model's API.
However, the effectiveness of these defenses under sustained, adversarial pressure remains poorly understood. Most defense evaluations test against a narrow set of known attack patterns, leaving open the question of how well they generalize. This paper addresses that gap by systematically evaluating a broad spectrum of prompt injection techniques against models equipped with multiple layers of defense.
Methodology
We constructed an evaluation framework that pairs diverse prompt injection strategies with several categories of defended LLM systems. The attack strategies ranged from simple direct injections (e.g., "ignore previous instructions and...") to more sophisticated approaches including payload splitting, context manipulation, role-playing exploits, and encoding-based obfuscation techniques. Each attack was formulated in multiple variants to account for surface-level pattern matching by defenses.
On the defense side, we evaluated systems employing input preprocessing filters, instruction-tuned models with safety alignment, models augmented with external classifier guardrails, and systems using structured prompting techniques designed to isolate user input from system instructions. The defended systems were tested as black boxes, reflecting real-world deployment conditions where attackers do not have access to model weights or defense configurations.
Success was measured along multiple dimensions: whether the attack caused the model to deviate from its instructions, whether it produced explicitly prohibited content, and whether the deviation was detectable by the defense layer itself. This multi-dimensional evaluation provides a more nuanced picture than simple binary success/failure metrics.
Key Findings
- Defense bypass rates: Certain prompt injection techniques achieve significant success even against defended models, with indirect and multi-turn injection strategies proving particularly difficult to defend against
- Attack taxonomy: Classification of injection methods by their effectiveness against specific defenses, revealing that no single defense strategy provides comprehensive protection
- Defense-attack asymmetry: Defenses that perform well against direct injection often fail against context manipulation and encoding-based attacks, suggesting that current approaches are overfitted to known attack patterns
- Layered defense gaps: Even systems combining multiple defensive mechanisms exhibit exploitable weaknesses when attacks are composed in multi-step sequences
- Recommendations: Guidelines for improving LLM security based on identified weaknesses, including the need for adversarial evaluation during defense development
Results
Our experiments revealed a consistent pattern: defenses that rely on pattern matching or surface-level input analysis are substantially less robust than those operating at the semantic level. Direct injection attempts -- the most commonly discussed attack vector -- were generally well-handled by defended systems. However, attacks that embedded their payloads within seemingly benign context, split instructions across multiple turns, or used encoding tricks to bypass input filters achieved markedly higher success rates.
Among the defense categories tested, instruction-tuned models with safety alignment showed the strongest baseline resistance, but were still vulnerable to role-playing and context-switching attacks. External classifier guardrails caught many policy-violating outputs but introduced latency and could themselves be bypassed when the model's output was crafted to appear compliant while still achieving the attacker's objective.
A particularly notable finding was the effectiveness of compositional attacks -- sequences of individually benign-seeming prompts that, taken together, induce policy-violating behavior. These attacks exploit the model's context window and its tendency to maintain conversational coherence, effectively building up a context that makes the final injection appear natural rather than adversarial.
Discussion
These results underscore a fundamental challenge in LLM security: the same flexibility and contextual understanding that make language models useful also make them difficult to constrain. Defenses must contend with an enormous space of possible inputs, while attackers need only find a single path through. The asymmetry is compounded by the fact that many defenses are developed against known attack patterns and do not generalize well to novel strategies.
For practitioners deploying LLMs in production, our findings suggest that no single defensive mechanism should be treated as sufficient. A defense-in-depth approach -- combining input analysis, model-level alignment, output classification, and careful system architecture that limits the blast radius of successful injections -- remains the most prudent strategy. Equally important is continuous adversarial testing using evolving attack methodologies, rather than static benchmark evaluations.
Research Impact
This work contributes to the growing body of AI safety research by demonstrating that current defensive mechanisms for LLMs require significant improvement. The findings have implications for developers deploying LLMs in production environments and highlight the need for standardized evaluation frameworks that test defenses against a comprehensive and evolving set of adversarial techniques.
Related Topics
LLM-as-a-Judge Vulnerabilities · Adversarial Attacks on LLM Judges · Trojan Detection in LLMs