Page Speed Optimization Libraries  1.4.26.1
Classes | Namespaces
net/instaweb/util/public/url_to_filename_encoder.h File Reference
#include <cstddef>
#include "net/instaweb/util/public/string.h"

Go to the source code of this file.

Classes

class  net_instaweb::UrlToFilenameEncoder
 Helper class for converting a URL into a filename. More...

Namespaces

namespace  net_instaweb
 

for StringPiece



Detailed Description

jmarantz@google.com (Joshua Marantz)

URL filename encoder goals:

1. Allow URLs with arbitrary path-segment length, generating filenames with a maximum of 128 characters. 2. Provide a somewhat human readable filenames, for easy debugging flow. 3. Provide reverse-mapping from filenames back to URLs. 4. Be able to distinguish http://x from http://x/ from http://x/index.html. Those can all be different URLs. 5. Be able to represent http://a/b/c and http://a/b/c/d, a pattern seen with Facebook Connect.

We need an escape-character for representing characters that are legal in URL paths, but not in filenames, such as '?'.

We can pick any legal character as an escape, as long as we escape it too. But as we have a goal of having filenames that humans can correlate with URLs, we should pick one that doesn't show up frequently in URLs. Candidates are ~`!#$%^&()-=_+{}[],. but we would prefer to avoid characters that are shell escapes or that blaze or g4 do not like.

.#&%-=_+ occur frequently in URLs. <>:"/\|?* are illegal in Windows See http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx ~`!$^&(){}[]'; are special to Unix shells In addition, blaze does not like ^@ Perforce does not like #%

Josh took a quick look at the frequency of some special characters in Sadeesh's slurped directory from Fall 09 and found the following occurrences:

^ 3 blaze doesn't like ^ in testdata filenames @ 10 blaze doesn't like @ in testdata filenames . 1676 too frequent in URLs , 76 THE WINNER # 0 g4 doesn't like it & 487 Prefer to avoid shell escapes % 374 g4 doesn't like it = 579 very frequent in URLs -- leave unmodified

The escaping algorithm is: 1) Escape all unfriendly symbols as ,XX where XX is the hex code. 2) Add a ',' at the end (We do not allow ',' at end of any directory name, so this assures that e.g. /a and /a/b can coexist in the filesystem). 3) Go through the path segment by segment (where a segment is one directory or leaf in the path) and 3a) If the segment is empty, escape the second slash. i.e. if it was www.foo.com///<a then we escape the second / like www.foo.com/,2Fa, 3a) If it is "." or ".." prepend with ',' (so that we have a non- empty and non-reserved filename). 3b) If it is over 128 characters, break it up into smaller segments by inserting ,-/ (Windows limits paths to 128 chars, other OSes also have limits that would restrict us)

For example: URL File / /, /index.html /index.html, /. /., /a/b /a/b, /a/b/ /a/b/, /a/b/c /a/b/c, Note: no prefix problem /u?foo=bar /u,3Ffoo=bar, /,2F, /./ /,./, /../ /,../, /, /,2C, /,./ /,2C./, /very...longname/ /very...long,-/name If very...long is about 126 long.

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines