urlresolverapi

module
v0.0.0-...-f68f6f9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2024 License: MIT

README ¶

urlresolverapi

GoDoc Build status Coverage

A golang HTTP server that uses github.com/mccutchen/urlresolver to "resolve" a URL into its canonical form by following any redirects, normalizing query parameters, and attempting to fetch its title.

It is used by Thresholderbot to resolve URLs found in tweets, which tend to be wrapped in one or more URL shorteners (t.co, bit.ly, etc).

A publicly-available instance is available at https://api.urlresolver.com, (deployed on fly.io).

API

There is a single API endpoint, /resolve, seen here resolving a t.co URL that redirects a few times before ending up at the New York Times:

GET https://api.urlresolver.com/resolve?url=https://t.co/1AuEh8FMK0?amp=1

{
  "given_url": "https://t.co/1AuEh8FMK0?amp=1",
  "resolved_url": "https://www.nytimes.com/2021/08/25/style/lil-nas-x.html",
  "title": "Some Said Lil Nas X Was a One-Hit Wonder. They Were Wrong. - The New York Times",
  "intermediate_urls": [
    "https://t.co/1AuEh8FMK0?amp=1",
    "https://nyti.ms/3BlpRIR",
    "https://trib.al/4OR3gtI"
  ]
}

A 200 OK status code means that the API successfully resolved the URL and extracted a title.

A 203 Non-Authoritative Information status code means that the API ran into an error while resolving the URL and/or extracting a title (e.g. a request timeout). In this case, the API returns as much information as it can; it will at least return a normalized/canonicalized and potentially partially-resolved resolved_url value.

🔒 Access control

Because this server can be used to generate load on arbitrary other web sites, it should not be deployed without some form of authentication or rate limiting in place.

👉 TL;DR:

  • The server allows both authenticated and unauthenticated requests by default
  • Rate limits are applied only to unauthenticated requests
  • To require authentication for all requests, set the rate limit to 0
Authentication

We currently support an extremely simple token-based authentication mechanism. Specify valid tokens as a list of comma-separated client-id:token-value pairs:

AUTH_TOKENS="client-a:abcd1234,client-a:efgh56789,client-b:1234abcd"

(This format allows for token rotation while keeping the client ID consistent for observability purposes.)

Authentication information is used to determine whether or not to rate limit requests, and recorded in the server's instrumentation to identify known clients.

Rate limiting

Rate limits are applied per-instance, using an in-process token bucket rate limiting implementation. So, if you set the rate limit to 10 req/sec and scale the server up to 5 instances, the effective rate limit for will be 50 req/sec.

Rate limits are applied only to unauthenticated clients, and all unauthenticated clients share one global rate limit.

Configuration

This server can be configured via CLI arguments or environment variables. See the --help output for all of the configurable parameters:

urlresolverapi --help
Usage of urlresolverapi:
  -auth-tokens string
      Comma-separated list of valid auth tokens in "client-id:token-value" format for which rate limiting is disabled
  -burst-limit int
      Allowed bursts over rate limit (if rate limit >= 0) (default 2)
  -cache-ttl duration
      TTL for cached results (if caching enabled) (default 120h0m0s)
  -client-patience duration
      How long to wait for slow clients to write requests or read responses (default 1s)
  -debug-port int
      Port on which to expose pprof/expvar debugging endpoints (disabled if == 0) (default 6060)
  -honeycomb-api-key string
      Honeycomb API key (enables sending telemetry data to honeycomb)
  -honeycomb-dataset string
      Honeycomb dataset for telemetry data (default "urlresolverapi")
  -honeycomb-sample-rate uint
      Sample rate for telemetry data (1/N events will be submitted) (default 1)
  -honeycomb-service-name string
      Service name for telemetry data (default "urlresolverapi")
  -idle-cx-ttl duration
      TTL for idle connections (default 1m30s)
  -max-idle-cx-per-host int
      Max idle connections per host (default 10)
  -port int
      Port to listen on (default 8080)
  -rate-limit float
      Per-second rate limit for anonymous clients (disabled if <= 0)
  -redis-timeout duration
      Timeout for redis operations (if caching enabled) (default 150ms)
  -redis-url string
      Redis connection URL (enables caching)
  -request-timeout duration
      Overall timeout on a single resolve request, including any redirects (default 10s)

Profiling notes

The app exposes Go's expvar and net/http/pprof endpoints on port 6060, which should not be exposed to the outside world.

When deployed on fly.io, you can access those endpoints via a secure WireGuard VPN connection. First, follow their Private Network VPN setup guide.

With that VPN enabled, you can now access the internal endpoints by hostname (which will connect to 1/N instances of the app):

go tool pprof urlresolverapi-production.internal:6060/debug/pprof/allocs

Or you can connect to a specific instance by first getting its private IPv6 address and then connecting directly to it:

$ dig +short aaaa urlresolverapi-production.internal
fdaa:0:2530:a7b:ab2:5f9e:d6b1:2
fdaa:0:2530:a7b:ab3:97d6:9ae8:2

$ go tool pprof "[$(dig +short aaaa urlresolverapi-production.internal | sort | head -1)]:6060/debug/pprof/allocs"

Directories ¶

Path Synopsis
cmd
pkg
httphandler
Package httphandler provides a basic net/http http.Handler implementation that resolves URLs.
Package httphandler provides a basic net/http http.Handler implementation that resolves URLs.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL