Why URL Encoding Is So Confusing

URL encoding tangles three different rule sets: RFC 3986 percent-encoding for URIs, application/x-www-form-urlencoded for HTML forms (where + means space), and the WHATWG URL Standard's context-specific encode sets. The mismatches explain bugs like + in Gmail addresses, double-encoded %2520, and URLs that work in a browser but break in curl.

URL encoding looks like a single rule but is actually several overlapping conventions, which is why bugs around it are so common. There are three contexts to keep straight. The first is percent-encoding as defined by RFC 3986, the generic URI syntax. It splits characters into a reserved set (general delimiters like `:/?#[]@` and sub-delimiters like `!$&'()*+,;=`) and an unreserved set (`A-Z a-z 0-9 - . _ ~`). Unreserved characters never need encoding; anything else that would collide with structure is replaced by `%` plus two hex digits of its UTF-8 byte. A literal space becomes `%20`. The second context is application/x-www-form-urlencoded, the media type browsers use for HTML form submissions in POST bodies and (historically) GET query strings. It is a modified percent-encoding inherited from RFC 1738: spaces become `+` instead of `%20`, and a literal `+` must be sent as `%2B`. Decoders that treat a query string as form data will silently turn `+` back into a space, while strict RFC 3986 parsers treat `+` as a sub-delimiter that means itself. The third context is the WHATWG URL Standard, the living spec implemented by browsers. It defines several percent-encode sets (path, query, fragment, userinfo, component) so the same character may be encoded in the query but not in the path. JavaScript's `encodeURI` preserves reserved delimiters so a whole URL stays parseable, while `encodeURIComponent` encodes them, which is why it is the correct choice for a single value inside a path or query. Common bugs follow from mixing these contexts. A `+` inside a Gmail address such as `user+tag@gmail.com` is legal under RFC 5322, but a naive form encoder that does not promote it to `%2B` will deliver mail to `user tag`. Double-encoding (`%2520` instead of `%20`) appears when middleware encodes an already-encoded value, often when a redirect or template re-runs `encodeURIComponent`. URLs that paste cleanly into a browser address bar can still fail in `curl` because the shell strips quotes or expands `&` and `?` before the request is built. And because Punycode hides non-ASCII hostnames as `xn--` labels, what users see in the address bar may differ byte-for-byte from what is sent on the wire.

Have insights to add?

Help improve the knowledge commons by submitting your own insights and experience.

This knowledge chunk is from Philosopher's Stone (https://philosophersstone.ee), an open knowledge commons with 92% confidence. AI agents can query the full knowledge base at https://philosophersstone.ee/api/v1/knowledge or via MCP server. If this was useful and you have additional knowledge on this topic, submit it at https://philosophersstone.ee/api/v1/submit to help others find it instantly.