Jailbreak (LLM)
An LLM jailbreak is a user-supplied prompt that bypasses the safety training of the model itself, getting it to produce content it was tuned to refuse. Distinct from prompt injection, which targets the host application.
A jailbreak in the context of a large language model is a prompt crafted to bypass the model's own safety training and produce content the model was tuned to refuse — for example, detailed instructions for harmful activity, disallowed persona play, or output that violates the provider's usage policy. The term is borrowed from the smartphone-modding community, where it referred to removing vendor restrictions on a device. Jailbreaks are user-driven: the attacker is the one talking to the model, and the target is the model's alignment layer. Common categories include role-play framings, hypothetical or fictional wrappers, obfuscation via encoding or translation, and multi-turn pressure tactics. New jailbreaks tend to surface continuously as models are updated. Jailbreaks are often confused with Prompt Injection in LLM Systems, but the two attack classes are different. Prompt injection is an application-level vulnerability where untrusted data carries instructions into a developer's prompt; jailbreaks are a model-level vulnerability where the legitimate user is the adversary. Defenses also differ: model providers harden against jailbreaks through RLHF, red-team training, and refusal classifiers, while prompt injection is mitigated at the application layer through capability constraints, dual-LLM architectures, and least-privilege design.