Constrained Decoding
Constrained decoding (also called grammar-guided or structured generation) forces an {{LLM}}'s output to match a target grammar by masking invalid tokens at each generation step. The model still picks the highest-probability token — but only among tokens that keep the output syntactically valid. This is how {{Outlines}}, {{guidance}}, XGrammar, and llguidance enforce {{JSON Schema}} or regex constraints.
Constrained decoding intervenes inside the sampling loop. At each step, before the LLM samples its next token, the engine consults a state machine derived from the target grammar (typically compiled from a JSON Schema, regex, or context-free grammar). Any token whose prefix-extension would violate the grammar has its logit zeroed out. The model then samples normally from what remains. Output is syntactically guaranteed to parse; the model cannot emit a stray markdown fence, drop a closing brace, or use a key not defined in the schema. Implementations differ in expressive power. FSM-based engines like Outlines are fastest but flatten recursive schemas to a fixed depth. CFG-based engines like XGrammar and llguidance handle arbitrary recursion at higher overhead. guidance has reported roughly 2x faster generation than competitors in published comparisons. Commercial APIs (Structured Outputs on OpenAI, response_schema on Gemini) wrap the same technique server-side. Constrained decoding solves syntactic compliance completely but does not solve semantic correctness — a schema-valid value can still be wrong. It is the mechanism behind reliable Format-Following Failures in LLMs mitigation and underlies most modern Function Calling (LLM) implementations.