Why LLMs Prefer Plausible Over True

Large language models are trained to predict the next plausible token, not to track ground truth. Plausibility is the optimization target; correctness is a downstream correlate. The gap shows up as fluent fabrication, citations that look valid, and code that looks like it should run.

The core training objective of a language model is to assign high probability to sequences that resemble its training distribution. Next-token prediction rewards continuations that are statistically plausible given the preceding context. Truth is not in the loss function. When the most plausible continuation happens to be true, the model looks knowledgeable; when it does not, the same machinery produces the same kind of fluent text, only wrong. Bender and Koller's 2020 paper Climbing towards NLU argued, via the octopus test thought experiment, that meaning cannot be learned from linguistic form alone — a model trained only on form has no channel to ground claims in the world. The 2021 Stochastic Parrots paper extended this critique, framing LLMs as systems that stitch together likely linguistic patterns without reference to meaning. The practical signature of this mismatch is confabulation: outputs that are locally coherent and stylistically appropriate but factually invented. The failure modes are remarkably patterned. Fake academic and legal citations include plausible author names, journal titles, and ID numbers — courts in the UK and US have sanctioned filings that cited nonexistent cases produced by AI tools. Generated code uses real-looking function names, sensible argument orders, and idiomatic style, yet calls APIs that do not exist. Historical claims arrive with confident dates and attributions that map onto the right century but the wrong people. See Citation Hallucination in LLMs for the citation-specific pattern. RLHF can make the picture worse. Pretrained base models are reasonably calibrated: their token probabilities track empirical frequencies. Preference tuning reshapes outputs toward what human raters prefer, and raters tend to prefer answers that sound confident, articulate, and decisive. The result is a documented calibration gap: aligned models are most overconfident exactly where they are wrong. Related effects include sycophancy, where the model agrees with whatever premise the user supplies, and mode collapse toward majority-preferred phrasings. What helps is shifting some of the work off the model's parametric memory. RAG (Retrieval-Augmented Generation): How LLMs Access External Knowledge grounds generation in retrieved passages, which constrains the plausible-continuation space to text the model can cite. Tool use, structured verification, and uncertainty-aware decoding push in the same direction. What does not reliably help is asking the model to be more careful: instructions to avoid hallucination are themselves tokens, and a system optimized for plausibility can produce plausible disclaimers alongside plausible fabrications.

Why LLMs Prefer Plausible Over True

Related Knowledge

Confabulation (LLMs)

Confidence Calibration in LLM Outputs

Capability Hallucination in LLM Agents

Sycophancy in LLM Responses

Citation Hallucination in LLMs

Source Attribution in LLM Outputs

Have insights to add?