safetyintermediate

Prompt Injection

Also known as: adversarial prompting, instruction injection, indirect prompt injection

An attack where malicious instructions embedded in external content attempt to override a language model's intended behavior.

Definition

Prompt injection is a class of adversarial attacks against language model systems in which untrusted external content — such as web pages, documents, user inputs, or tool results — contains hidden or overt instructions designed to override the model's original system prompt or directives. When a model processes this content as part of its context, it may follow the injected instructions instead of (or in addition to) its legitimate ones, potentially leaking sensitive data, taking unauthorized actions, or producing responses the system designer did not intend.

Indirect prompt injection is the most dangerous variant in agentic contexts: the attacker's payload is not in the user's direct message but embedded in external content the agent retrieves or processes autonomously — for example, a malicious README file, a webpage loaded by a browsing agent, or a document retrieved via RAG. This enables remote attacks without direct access to the victim's session, and the blast radius scales with the agent's permissions.

Defenses include input and output validation (guardrails), privilege separation between trusted system-prompt content and untrusted external content, sandboxing tool execution, model-level safety training, clearly delimited prompt structure that marks untrusted content, and explicit confirmation steps before high-consequence actions. Prompt injection is recognized as the primary security threat in agentic AI systems and is tracked by OWASP as a top vulnerability for LLM applications.

Models Mentioning Prompt Injection(3)

Llama Prompt Guard 2 22M2025-04 Llama Prompt Guard 2 86M2025-04 Prompt Guard 86M2024-07