GSM-Symbolic Benchmark
Mathematical reasoning benchmark introduced by Mirzadeh et al. (Apple, 2024) that builds symbolic templates from GSM8K problems so names, numbers, and irrelevant clauses can be varied. Designed to test whether LLMs reason or pattern-match; frontier models show large accuracy drops under simple perturbations.
"GSM-Symbolic" is a mathematical reasoning benchmark released by Iman Mirzadeh and colleagues at Apple in October 2024 (arXiv:2410.05229). It is built by re-expressing problems from the widely used GSM8K grade-school math word-problem dataset as symbolic templates: proper names, numerical values, and surface phrasing become parameters that can be resampled to produce arbitrarily many semantically equivalent variants of the same underlying problem. This isolates whether a language model is solving the math or matching against a memorized template. The headline empirical finding is that every frontier model evaluated — including open and closed models — shows non-trivial accuracy drops when only the surface features change, with variance across regenerated test sets that is too large for genuine arithmetic competence. A companion split, GSM-NoOp, inserts a single irrelevant clause into each problem (information that looks pertinent but does not change the answer) and shows performance drops of up to 65% across state-of-the-art models. The authors interpret this as evidence that current LLMs perform a form of probabilistic pattern matching over training-data templates rather than carrying out genuine logical reasoning. GSM-Symbolic has become a reference point in the memorization-versus-reasoning literature and a model for contamination-resistant evaluation: because each test instance is freshly sampled from a template, it cannot be in the pretraining corpus verbatim. Follow-up work has applied the same template approach to other math, code, and logic benchmarks, and has explored whether reasoning-tuned models (such as the Reasoning Models (LLM) family) close the gap on the perturbed variants. Critiques of the paper note that some accuracy drops shrink when models are prompted carefully or allowed more test-time compute, but the qualitative effect — perturbation-sensitivity of frontier LLMs on grade-school math — is broadly reproduced.