progszy

package module
v0.0.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 3, 2022 License: BSD-3-Clause Imports: 30 Imported by: 0

README

progszy

BSD3 Build Status Go Report Card Godoc

progszy is a hard-caching HTTP(S) proxy server (with programmatic cache management), designed for use as part of a data-scraping pipeline.

  • Brings stable reproducability to web data-scraping pipelines.
  • Improves web scraper development workflow, via fast controlled caching of HTTP responses.
  • Improves debugging of failed live scrapes, download the cached data for consistent reproducible local re-runs.
  • Improves scraper performance when resuming partial/incomplete scrapes.
  • Fast and compact data caching: SQLite database, Zstandard compressed HTTP body.

It is both a standalone executable CLI program, and a Go package.

It is not suitable for use as a regular HTTP(S) caching proxy for humans surfing with web browsers.

progszy should work with any HTTP client, but currently has only been tested with Go's http.Client.

Caching

Cached content is persisted in an SQLite database, using Zstandard compression, enabling cached content to be retrieved faster than regular file system reads, while also providing convenient packaging of cached content and saving storage space.

A separate single-file database is created per domain, to cache its respective content (that is: content is 'binned' according to the root domain name). Database filenames also contain a creation timestamp.

For example, request responses for http://www.example.com/index.html and http://foo.bar.example.com/index.html will both get cached in the same database, having a filename like example.com-2020-03-20-1640.sqlite.

We may review/change this binning/naming strategy at a later date.

Caching Strategy

progszy intentionally makes no use of HTTP headers relating to cached content control that are normally utilised by browsers and other caching proxies.

The body content and appropriate headers for all 200 Ok responses are hard-cached — unless the body matches a given filter (see X-Cache-Reject, below).

Content exceeding an arbitrary maximum body size of 128mb is not cached nor proxied, and instead returns a 412 Precondition Failed response to the client. We may review this decision/behaviour at a later date.

Cache eviction/management is manual-only at present. Later we will add a REST API for programmatic cache management.

HTTP(S) Proxy

The CLI version of progszy operates as a standalone HTTP(S) proxy server. By default it listens on port 5595, for which the client's proxy configuration URL would be http://127.0.0.1:5595. It should be noted that currently progszy binds only to IP 127.0.0.1, which is not suitable for access from a remote IP (without the use of an SSH tunnel).

Incoming requests can be either vanilla HTTP, or can be HTTPS (using CONNECT protocol).

When proxying HTTPS requests, the connection is intercepted by a man-in-the-middle (MITM) hijack, to allow both caching and the application of rules, and the resulting outbound stream is then re-encrypted using a private certificate, before being passed to the client. Note that clients wishing to proxy HTTPS requests using progszy will need specific configuration to prevent/ignore the resulting certificate mismatch errors caused by this process. See tests for an example of how this is done in Go.

Outgoing HTTP requests utilise automatic retries with exponential backoff. Internal HTTP clients use a shared transport with pooling, and support upstream proxy chaining. Connections are not explicitly rate-limited.

Currently, progszy only supports HTTP GET, HEAD and CONNECT methods. Note that support for the HEAD method is not actually particularly useful in this context, and really only exists for spec compliance.

HTTP Headers

progszy makes use of custom HTTP X-* headers to both control features and report status to the client.

Request Headers
  • X-Cache-Reject headers control early rejection/filtering of incoming content. Each header value is compiled into a regexp reject rule: if the content body matches any filter, the request response is not cached, and instead a 412 Precondition Failed is returned to the client. See tests for example usage. Note that cache hits (requests for already cached content) are not currently affected by the use of this header.
  • X-Cache-SSL: INSECURE forces use of an internal HTTP client configured to skip SSL certificate validation during the upstream/outbound request. See tests for example usage.
  • X-Cache-Flush: TRUE forces the creation of a new cache database bin for the requested URL.

Incoming X-* headers are not copied to outgoing requests.

Response Headers
  • X-Cache value will be HIT, MISS or FLUSHED accordingly. For cache hits and misses, the following headers are also present:
  • X-Cache-Timestamp indicates when the content was originally cached (RFC3339 format with nanosecond precision).
  • Content-Length value is set accordingly.
  • Content-Type, Content-Language, ETag and Last-Modified headers from incoming responses all have their value persisted to the cache, and restored appropriately on outgoing responses to the client.

Installation

Binary Executable

Pre-built binary executables for Linux and Windows are available for download from the latest release page.

Build From Source

First, ensure you have a working Go environment. See Go 'Getting Started' documentation.

Then fetch the code, build and install the binary:

go get github.com/jimsmart/progszy/cmd/progszy

By default, the resulting binary executable will be ~/go/bin/progszy (assuming no customisation has been made to $GOPATH or $GOBIN).

Usage Examples

Once built/installed, progszy can be invoked via the command line, as follows...

Get help / usage instructions:

$ ./progszy --help
Usage of ./progszy:
  -cache string
        Cache location (default "./cache")
  -port int
        Port number to listen on (default 5595)
  -proxy string
        Upstream HTTP(S) proxy URL (e.g. "http://10.0.0.1:8080")

Run progszy with default settings:

$ ./progszy
Cache location /<path-to-current-folder>/cache
Listening on port 5595

Run using custom configuration:

$ ./progszy -port=8080 -cache=/foo/bar/store -proxy=http://10.10.0.1:9000
Cache location /foo/bar/store
Upstream proxy http://10.10.0.1:9000
Listening on port 8080

Press control+c to halt execution — progszy will attempt to cleanly complete any in-flight connections before exiting.

Package Documentation

GoDocs https://godoc.org/github.com/jimsmart/progszy

Local GoDocs

Change folder to project root, and run:

godoc -http=:6060 -notes="BUG|TODO"

Open a web browser and navigate to http://127.0.0.1:6060/pkg/github.com/jimsmart/progszy/

Testing

To run the tests execute go test inside the project root folder.

For a full coverage report, try:

go test -coverprofile=coverage.out && go tool cover -html=coverage.out

GitHub Build Automation

This repo uses the following GitHub Action workflow automations:

Github Actions

Documentation https://docs.github.com/en/actions

  • .github/workflows/build.yml - Automatically runs on all push actions to this repo. Builds project, runs go vet & golint, runs tests, reports coverage (coverage reporting is currently disabled, due to this repo currently being private).
  • .github/workflows/dummy-release.yml - Manually run pre-release workflow. Runs the same actions as the 'release' action (below), but skips publishing. Use this as a dry run, before pushing a version-tagged commit to the repo to trigger publication of a release.
  • .github/workflows/release.yml - Automatically runs on all push actions to this repo that specify a tag of format "v*.*.**". Installs cross-compilers, runs GoReleaser to build all target binaries, package tars/zips, and create a draft release using the resulting assets. Publication of this release must then be manually confirmed on GitHub (by choosing to edit the release, and pressing the green 'Publish release' button).
GoReleaser

Website https://goreleaser.com/

.goreleaser.yml contains GoReleaser configuration for release builds, handling cross-compilation, packaging and creation of a (draft) release on GitHub. It is invoked by the above mentioned GitHub Actions, 'release' and 'dummy release'.

Release Publication

1. Dry Run of Release Build Workflow

First, go to this repo's Actions page, and manually run the 'dummy release' action workflow, addressing any issues that may arise.

2. Tag Version & Push

Once the 'dummy release' action workflow completes ok, then make a version-tagged push to the repo, using a command similar to:

git tag v0.0.1 && git push origin v0.0.1

(Amending the version number accordingly)

On completion of the push, the 'release' action workflow will automatically begin execution. Wait for it to complete.

3. Confirm Publication

GoRelease is configured here to only publish draft releases.

On successful completion of the 'release' workflow execution, go to the repo's releases page, find the new draft release, edit it (by clicking the pencil icon), check all is well, then click the green 'Publish release' button.

Project Dependencies

Packages used by progszy (and their licensing):

— Many thanks to the authors and contributors of these packages.

License

progszy is copyright 2020–2022 by Jim Smart and released under the BSD 3-Clause License.

History

  • v0.0.11 (2022-01-27) Test fixup. Updated dependencies.
  • v0.0.10 (2021-06-21) Require Go 1.15 instead of 1.16.
  • v0.0.9 (2021-06-21) Updated dependencies.
  • v0.0.3 (2021-04-21) Automated releases.
  • v0.0.1 (2021-04-20) Work in progress. Initial test release.

Documentation

Overview

Package progszy is a hard-caching HTTP(S) proxy server, using SQLite & Zstd.

Index

Constants

This section is empty.

Variables

View Source
var ErrCacheMiss = errors.New("progszy: cache miss")

ErrCacheMiss occurs when a given URL is not in the cache.

Functions

func BaseDomainName

func BaseDomainName(u *url.URL) (string, error)

func NormalisePath

func NormalisePath(u *url.URL)

func NormaliseQuery

func NormaliseQuery(u *url.URL) error

func ProxyHandlerWith

func ProxyHandlerWith(cache Cache, proxy *url.URL) http.Handler

func Run

func Run(addr, cachePath string, proxy *url.URL) error

Run a server, blocking until we receive OS interrupt (ctrl-C).

Types

type Cache

type Cache interface {
	Get(uri string) (*CacheRecord, error)
	Put(cr *CacheRecord) error
	CloseAll() error
	Flush(uri string) error
}

type CacheRecord

type CacheRecord struct {
	// Key is the normalised URL.
	Key string
	// URL is the originally requested URL.
	URL string
	// BaseDomain is the friendly domain name.
	BaseDomain string
	// Status code of response.
	Status int
	// Protocol originally used for response.
	Protocol string
	// ContentLanguage value (or empty string).
	ContentLanguage string
	// ContentType is the MIME type.
	ContentType string
	// ETag value (or empty string).
	ETag string
	// LastModified value (or empty string).
	LastModified string
	// ZstdBody holds the Zstd compressed HTTP body.
	ZstdBody []byte
	// CompressedLength is the length of ZstdBody.
	CompressedLength int64
	// ContentLength is the original content length.
	ContentLength int64
	// ResponseTime is the duration of the original request, in ms.
	ResponseTime float64
	// MD5 is the md5 sum of the uncompressed body.
	MD5 string
	// Created is the time this record was created.
	Created time.Time
}

func NewCacheRecord

func NewCacheRecord(uri string, status int, proto, lang, mime, etag, lastMod string, body []byte, responseTime float64, created time.Time) (*CacheRecord, error)

func (*CacheRecord) Body

func (r *CacheRecord) Body() (io.ReadCloser, error)

func (*CacheRecord) SetBody

func (r *CacheRecord) SetBody(body []byte) error

type SqliteCache

type SqliteCache struct {
	// contains filtered or unexported fields
}

func NewSqliteCache

func NewSqliteCache(cachePath string) *SqliteCache

NewSqliteCache initialises and returns a new SqliteCache.

func (*SqliteCache) CloseAll

func (c *SqliteCache) CloseAll() error

func (*SqliteCache) Flush

func (c *SqliteCache) Flush(uri string) error

func (*SqliteCache) Get

func (c *SqliteCache) Get(uri string) (*CacheRecord, error)

Get the cached response for the given URL. If the given URL does not exist in the cache, error ErrCacheMiss is returned.

func (*SqliteCache) Put

func (c *SqliteCache) Put(cr *CacheRecord) error

Put adds the given URL/response pair to the cache.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL