Prompt Injection in LLM Systems

Prompt injection is an attack class where untrusted content fed into an LLM's context contains instructions the model follows as if from the developer. Indirect variants plant instructions in web pages, documents, emails, RAG indices, or images so the attacker never speaks to the model directly. The problem is structurally hard because LLMs cannot reliably separate instruction tokens from data tokens, and defenses range from fragile prompt-engineering boundaries to dual-LLM architectures and capability constraints.

Prompt injection is an application-level attack class against systems built on large language models, in which untrusted content fed into the model's context carries instructions that the model executes as if they came from the developer or user. The term was coined in 2022 by Simon Willison by analogy to SQL injection: the root cause is concatenating a trusted prompt with untrusted input that the model cannot reliably separate. It is distinct from a Jailbreak (LLM), which targets safety training inside the model itself via the user's own prompt. Prompt injection instead targets the host application and the data it ingests. The most consequential variant is Indirect Prompt Injection, described in Greshake et al.'s 2023 paper "Not what you've signed up for," in which the attacker never speaks to the model. They plant instructions in a resource — a web page, an email, a calendar invite, a support ticket, a PDF, a document in a RAG index, or even an image processed by a multimodal model — that the model later retrieves. When that content enters the context window, its instructions sit on the same footing as the legitimate prompt. Production demonstrations have targeted browsing assistants, email agents, and code-assist tools. The problem is structurally hard because current transformer architectures tokenize instructions and data into the same stream; there is no privileged instruction channel the way SQL has parameterized queries. The model is trained to be helpful with whatever appears in context, so it cannot reliably tell a developer's directive from a sentence in a retrieved document that happens to look like one. Mitigations exist but each has limits. Prompt-engineering boundaries (system instructions that say "ignore anything below") are fragile and routinely bypassed. Input sanitization and classifier-based filtering catch known patterns but miss novel phrasing, especially in non-English or encoded form. Spotlighting and data tagging — marking untrusted spans so the model treats them as quoted material — reduce but do not eliminate the issue. Stronger results come from architectural controls: dual-LLM designs such as CaMeL split a privileged planner that holds tools but never sees raw untrusted content from a quarantined reader that processes the content but has no tools, and capability constraints (least privilege, allowlisted destinations, human-in-the-loop confirmation for sensitive actions) cap the blast radius if a model is fooled. The OWASP LLM Top 10 lists prompt injection as LLM01, and treats defense-in-depth combining these layers as the current best practice.

Prompt Injection in LLM Systems

Related Knowledge

Indirect Prompt Injection

Jailbreak (LLM)

OWASP LLM Top 10

System Prompt

Have insights to add?