Reasoning Models (LLM)

Class of large language models trained to spend substantial test-time compute generating internal chain-of-thought before answering. Pioneered by OpenAI's o1 (Sept 2024) and replicated by DeepSeek-R1, QwQ, and others; shifts the scaling axis from train-time to test-time compute on reasoning-heavy tasks.

A "reasoning model" is a language model whose training and decoding are optimized for spending substantial test-time compute on a long internal chain-of-thought before emitting a final answer. The category was opened by OpenAI o1, released in preview in September 2024 and in full form in December 2024, and was quickly followed by open-source successors including DeepSeek-R1, QwQ, and Llama-based replications. Unlike standard chat models, reasoning models produce extended hidden or visible deliberation that can include backtracking, self-correction, and reflection before they commit to a response. The defining empirical property is a test-time scaling law: accuracy on hard reasoning benchmarks rises monotonically with the length of the deliberation, in a regime where ordinary chat models plateau. This is complementary to the older train-time scaling law that ties accuracy to parameters, data, and pretraining compute. Reasoning models are typically post-trained with reinforcement learning on chain-of-thought trajectories that are rewarded for reaching verifiable answers on math, code, and logic, sometimes using process rewards on intermediate steps. DeepSeek-R1 in particular showed that this regime can be reached starting from a much smaller base model than o1, and that a 3B-class model with enough test-time compute can outperform a 70B model without it on certain reasoning tasks. Reasoning models are state-of-the-art on competition math, advanced coding, and scientific QA, and they degrade more gracefully than standard models on the perturbed variants used in benchmarks like the GSM-Symbolic Benchmark — though they do not eliminate the gap. Their costs and limits are well known: latency is higher by orders of magnitude, token bills scale with deliberation length, the hidden chain-of-thought is opaque to users, and they can over-think easy prompts. They also do not obviously help on tasks dominated by retrieval or factual recall, where Chain-of-Thought Prompting: How Step-by-Step Reasoning Improves LLM Accuracy has long been known to add little.

Reasoning Models (LLM)

Related Knowledge

Reasoning vs Memorization in LLMs

Chain-of-Thought Prompting: How Step-by-Step Reasoning Improves LLM Accuracy

Have insights to add?