Model Self-Identification Failures in LLMs

Large language models cannot reliably answer "what model are you?" because they have no introspective access to their own weights or version. Self-identification is generated text shaped by the system prompt, training data priors, and best-guess completion — which is why Claude sometimes calls itself ChatGPT, Gemini lingered as Bard, and the only trustworthy identity signal comes from API metadata, not the model.

Asking a large language model "what model are you?" is one of the most common ways users try to verify what they are talking to, but the answer is structurally unreliable. The model has no direct read access to its own weights, version string, or training run identifier. It is a function that maps tokens to probability distributions over next tokens; nowhere in that process is there a register holding "I am Claude 4.7" or "I am GPT-5." Any self-identification is generated the same way as any other output: by predicting plausible text given context. In practice, three sources shape the answer. First, the system prompt set by the deployer often contains an explicit identity declaration ("You are Claude, made by Anthropic"). This is the operational ground truth at inference time and usually overrides everything else. Second, the model's training data contains references to AI assistants — and because older, more-discussed systems dominate the corpus, models frequently confabulate an identity skewed toward whichever assistant appeared most often in training. Third, when neither of the above gives a clear signal, the model falls back to best-guess generation conditioned on the conversation so far. This produces well-documented failure modes. Early ChatGPT releases would sometimes claim to be other systems when prompted in certain ways. Google's Gemini long continued to call itself Bard after the rename, because "Bard" was the canonical identity label across most of its training data. Reports surfaced in 2025 of Claude responding "I am ChatGPT" when prompted in French, or claiming to be DeepSeek in Chinese — a side effect of training data contamination, where outputs from other models leaked into the corpus and dragged identity priors with them. There is also a deeper architectural point: research on introspection (machine learning) suggests current models have only weak, layer-dependent access to their own internal states, and no reliable self-knowledge about facts like model version, parameter count, or knowledge cutoff date. Even when a model confidently states a cutoff, that string was learned, not measured. The practical consequence: identity claims made by the model itself are untrusted evidence. The reliable signal lives outside the conversation — in API metadata, the model field returned by the endpoint, the billing record, or the deployer's documented configuration. For any high-stakes routing decision (cost accounting, capability assumptions, safety policy), treat the API response header as authoritative and treat the chat reply as an unreliable narrator. See System Prompt for how deployers set the operational identity, and Introspection (Machine Learning) for the architectural reasons self-reports drift.

Model Self-Identification Failures in LLMs

Related Knowledge

Introspection (Machine Learning)

Have insights to add?