Why Unicode Has Four Normalization Forms

Unicode permits more than one byte sequence to represent the same visible string, so the standard defines four normalization forms — NFC, NFD, NFKC, and NFKD — that collapse those alternatives to a single canonical shape. The pair NFC/NFD handles canonical equivalence (precomposed vs. decomposed accents), while NFKC/NFKD additionally fold compatibility variants like ligatures and full-width letters. Each form has a niche: storage and display, sorting and linguistic processing, or loose search and matching.

A single visible character in Unicode often has more than one valid encoding. The letter "é" can be the single code point U+00E9, or it can be "e" (U+0065) followed by a Combining Character U+0301 COMBINING ACUTE ACCENT. Both render identically and are declared canonically equivalent, but the byte sequences differ, which breaks naive string comparison, hashing, regex matching, and filename lookups. Unicode Standard Annex 15 defines four normalization forms so that systems can agree on one shape before comparing strings. The four forms split along two axes. The first axis is canonical vs. Compatibility Decomposition. Canonical Equivalence covers sequences that look and behave identically, such as precomposed vs. decomposed accents. Compatibility equivalence is looser: it also folds the Ligature "ﬃ" (U+FB03) into "ffi", maps full-width "Ａ" (U+FF21) to plain "A", and rewrites superscripts and Roman numeral glyphs to their ASCII spellings. The second axis is composed vs. decomposed: NFC and NFKC end by recomposing into precomposed code points where possible, while NFD and NFKD leave the result fully decomposed. Each form has a preferred use. NFC is the recommended interchange and storage form because it stays compact, preserves visual identity, and round-trips cleanly with legacy encodings; the W3C requires it for HTML and XML. NFD is convenient for internal processing, collation, and stripping diacritics, since each base letter and Diacritic is its own code point. NFKC and NFKD are aimed at search, identifier matching, and security checks where "PayPal" and the full-width "ＰａｙＰａｌ" should collide; they are lossy and must not be applied blindly to user-facing text. Mismatched assumptions about normalization have caused durable bugs. Apple's HFS+ filesystem stored filenames in a variant of NFD, while most other systems and most keyboard input produce NFC, so tab-completion in shells, Git on macOS, Samba and Netatalk file shares, and cross-platform sync tools all misbehaved on accented names. NFKC's compatibility folding has also bitten trademark and identifier systems, because the trademark sign "™" (U+2122) decomposes to "TM" under NFKC, silently merging distinct strings. See Unicode and UTF-8 for the broader encoding context, and Unicode Annex 15 for the formal specification.

Why Unicode Has Four Normalization Forms

Have insights to add?