purell

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 24, 2012 License: BSD-3-Clause Imports: 7 Imported by: 0

README

Purell

Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know...

Based on the wikipedia paper and the RFC 3986 document.

build status

Install

go get github.com/PuerkitoBio/purell

Examples

From example_test.go (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."):

package purell

import (
  "fmt"
  "net/url"
)

func ExampleNormalizeURLString() {
  if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
    FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
    panic(err)
  } else {
    fmt.Print(normalized)
  }
  // Output: http://somewebsite.com:80/Amazing%3F/url/
}

func ExampleMustNormalizeURLString() {
  normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
    FlagsUnsafeGreedy)
  fmt.Print(normalized)

  // Output: http://somewebsite.com/Amazing%FA/url
}

func ExampleNormalizeURL() {
  if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
    panic(err)
  } else {
    normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
    fmt.Print(normalized)
  }

  // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0
}

API

As seen in the examples above, purell offers three methods, NormalizeURLString(string, NormalizationFlags) (string, error), MustNormalizeURLString(string, NormalizationFlags) (string) and NormalizeURL(*url.URL, NormalizationFlags) (string). They all normalize the provided URL based on the specified flags. Here are the available flags:

const (
  // Safe normalizations
  FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host
  FlagLowercaseHost                                            // http://HOST -> http://host
  FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
  FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
  FlagRemoveDefaultPort                                        // http://host:80 -> http://host
  FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path

  // Usually safe normalizations
  FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
  FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
  FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c

  // Unsafe normalizations
  FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
  FlagRemoveFragment         // http://host/path#fragment -> http://host/path
  FlagForceHTTP              // https://host -> http://host
  FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
  FlagRemoveWWW              // http://www.host/ -> http://host/
  FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
  FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3

  // Normalizations not in the wikipedia article, required to cover tests cases
  // submitted by jehiah
  FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
  FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
  FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
  FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
  FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path

  // Convenience set of safe normalizations
  FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator

  // For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags,
  // while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix".

  // Convenience set of usually safe normalizations (includes FlagsSafe)
  FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
  FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments

  // Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
  FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
  FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery

  // Convenience set of all available flags
  FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
  FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
)

For convenience, the set of flags FlagsSafe, FlagsUsuallySafe[Greedy|NonGreedy], FlagsUnsafe[Greedy|NonGreedy] and FlagsAll[Greedy|NonGreedy] are provided for the similarly grouped normalizations on wikipedia's URL normalization page. You can add (using the bitwise OR | operator) or remove (using the bitwise AND NOT &^ operator) individual flags from the sets if required, to build your own custom set.

The full godoc reference is available on gopkgdoc.

Some things to note:

  • FlagDecodeUnnecessaryEscapes, FlagEncodeNecessaryEscapes, FlagUppercaseEscapes and FlagRemoveEmptyQuerySeparator are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary ? at the end of the url). So this operation cannot not be done. For this reason, FlagRemoveEmptyQuerySeparator (as well as the other three) has been included in the FlagsSafe convenience set, instead of FlagsUnsafe, where Wikipedia puts it (strangely?).

  • The FlagDecodeUnnecessaryEscapes decodes the following escapes (from -> to): - %24 -> $ - %26 -> & - %2B-%3B -> +,-./0123456789:; - %3D -> = - %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ - %5F -> _ - %61-%7A -> abcdefghijklmnopqrstuvwxyz - %7E -> ~ - When the test runs on Travis-ci, it fails because more escapes are decoded than on my machine, so it is either machine/OS-dependent or it is because they run a different Go version (which they do, they run go1 while I run go1.0.2 so this is very likely the cause). To avoid a failing build banner on GitHub, I commented-out these specific tests (TestDecodeUnnecessaryEscapesAll(), TestEncodeNecessaryEscapesAll(), TestGoVersion()).

  • When the NormalizeURL function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization).

  • The replace IP with domain name normalization (http://208.77.188.166/ → http://www.example.com/) is obviously not possible for a library without making some network requests. This is not implemented in purell.

  • The remove unused query string parameters and remove default query parameters are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object.

Safe vs Usually Safe vs Unsafe

Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between.

Consider the following URL:

HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

Normalizing with the FlagsSafe gives:

https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

With the FlagsUsuallySafeGreedy:

https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid

And with FlagsUnsafeGreedy:

http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3

TODOs

  • Try to make jehiah's tests pass (more exactly, support IDNA domains/utf8/unicode encoding/escaping). For now theses tests are commented out.
  • Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes (^|/)((?:default|index)\.\w{1,4})$.

Thanks / Contributions

@rogpeppe @jehiah

License

The BSD 3-Clause license.

Documentation

Overview

Package purell offers URL normalization as described on the wikipedia page: http://en.wikipedia.org/wiki/URL_normalization

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func MustNormalizeURLString

func MustNormalizeURLString(u string, f NormalizationFlags) string

MustNormalizeURLString returns the normalized string, and panics if an error occurs. It takes an URL string as input, as well as the normalization flags.

Example
normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
	FlagsUnsafeGreedy)
fmt.Print(normalized)
Output:

http://somewebsite.com/Amazing%FA/url

func NormalizeURL

func NormalizeURL(u *url.URL, f NormalizationFlags) string

NormalizeURL returns the normalized string. It takes a parsed URL object as input, as well as the normalization flags.

Example
if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
	panic(err)
} else {
	normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
	fmt.Print(normalized)
}
Output:

http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0

func NormalizeURLString

func NormalizeURLString(u string, f NormalizationFlags) (string, error)

NormalizeURLString returns the normalized string, or an error if it can't be parsed into an URL object. It takes an URL string as input, as well as the normalization flags.

Example
if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
	FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
	panic(err)
} else {
	fmt.Print(normalized)
}
Output:

http://somewebsite.com:80/Amazing%3F/url/

Types

type NormalizationFlags

type NormalizationFlags uint

A set of normalization flags determines how a URL will be normalized.

const (
	// Safe normalizations
	FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host
	FlagLowercaseHost                                            // http://HOST -> http://host
	FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
	FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
	FlagEncodeNecessaryEscapes                                   // http://host/!"#$ -> http://host/%21%22#$
	FlagRemoveDefaultPort                                        // http://host:80 -> http://host
	FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path

	// Usually safe normalizations
	FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
	FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
	FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c

	// Unsafe normalizations
	FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
	FlagRemoveFragment         // http://host/path#fragment -> http://host/path
	FlagForceHTTP              // https://host -> http://host
	FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
	FlagRemoveWWW              // http://www.host/ -> http://host/
	FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
	FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3

	// Normalizations not in the wikipedia article, required to cover tests cases
	// submitted by jehiah
	FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
	FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
	FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
	FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
	FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path

	// Convenience set of safe normalizations
	FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator

	// Convenience set of usually safe normalizations (includes FlagsSafe)
	FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
	FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments

	// Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
	FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
	FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery

	// Convenience set of all available flags
	FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
	FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL