Mark Nottingham

Friday Fun: Percent Encoding

Friday, 30 June 2006

If you boil down the BNF in both RFC2396 and RFC3986, path segments can contain the following characters without percent-encoding them:

ALPHA DIGIT ! $ & ' ( ) * + , - . : ; = @ _ ~

Query components can contain these:

ALPHA DIGIT ! $ & ' ( ) * + , - . / : ; = ? @ _ ~

Which means that

" < > [\] ^ ` { | }

should always be encoded in both (discounting non-ASCII characters, for now).

If you’re specifying the format of a HTTP URI, this is important; you want to be able to tell people what characters have special meaning, and when to encode them if they’re part of content. When implementations automatically percent-encode some characters it can cause problems – especially when the behaviour is different from implementation to implementation.

Note that I’m not (necessarily) saying that the latter characters should always be escaped; Web servers seem to support them in their raw form just fine, and some less fastidious Web developers may forget to un-escape them. I’m more interested in those characters that are unnecessarily escaped, which would cause trouble in some situations.

The Test

Try using your favourite resolver to access this URL:

https://www.mnot.net/cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^`{|}/?!$&'()*+,-./:;=?@_~"^lt;>[\]^`{|}`

and post the results in comments. I’m particularly interested in results from Java, .NET, Perl and Ruby libraries.

Here it is as a link, and using javascript (ditto).

Here are a few preliminary results:

Safari

Pasted into the location bar.

Safari will escape angle brackets (“<>”) in a followed link (e.g., a/@href, using XHR), but not if you paste it directly into the location bar.

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/418.8 (KHTML, like Gecko) Safari/419.3
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^`{|}/?!$&'()*+,-./:;=?@_~"[\]^`{|}
Path
    Encoded:
  Unencoded: !"$&'()*+,-./:;<=>@[\]^_`bceghinoru{|}~
Query
    Encoded:
  Unencoded: !"$&'()*+,-./:;<=>?@[\]^_`{|}~

Firefox

Pasted into the location bar.

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E%5B%5C%5D%5E%60%7B|%7D/?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
    Encoded: "<>[\]^`{}
  Unencoded: !$&'()*+,-./:;=@_bceghinoru|~
Query
    Encoded: "<>`
  Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~

However, Firefox will treat the last path segment differently (note the missing “/”);

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[\]^%60{|}?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
    Encoded: "`
  Unencoded: !$&'()*+,-./:;=@[\]^_bceghinoru{|}~
Query
    Encoded: "`
  Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~

Opera

Pasted into the location bar.

Opera silently transforms backslashes (“") to forward slashes (“/”) in the path (but not the query).

User-Agent: Opera/9.00 (Macintosh; PPC Mac OS X; U; en)
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[/]^%60{|}/?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
    Encoded: "<>`
  Unencoded: !$&'()*+,-./:;=@[]^_bceghinoru{|}~
Query
    Encoded: "<>`
  Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~

Curl

> curl -g --url cat file.url``

User-Agent: curl/7.15.4 (powerpc-apple-darwin8.6.0) libcurl/7.15.4 OpenSSL/0.9.8b zlib/1.2.3
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^{|}?!$&'()*+,-./:;=/?@_~"[\]^{|}
Path
    Encoded:
  Unencoded: !"$&'()*+,-./:;<=>@[\]^_bceghinoru{|}~
Query
    Encoded:
  Unencoded: !"$&'()*+,-./:;<=>?@[\]^_{|}~

WGet

> wget -i file.url --output-document=-

User-Agent: Wget/1.10.2
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[%5C]%5E%7B%7C%7D?!$&'()*+,-./:;=/?@_~%22%3C%3E[%5C]%5E%7B%7C%7D
Path
    Encoded: "<>\^{|}
  Unencoded: !$&'()*+,-./:;=@[]_bceghinoru~
Query
    Encoded: "<>\^{|}
  Unencoded: !$&'()*+,-./:;=?@[]_~

Python

import urllib; print urllib.urlopen(url).read()

User-Agent: Python-urllib/1.16
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^{|}?!$&'()*+,-./:;=/?@_~"[\]^{|}
Path
    Encoded:
  Unencoded: !"$&'()*+,-./:;<=>@[\]^_bceghinoru{|}~
Query
    Encoded:
  Unencoded: !"$&'()*+,-./:;<=>?@[\]^_{|}~