limpet

package module
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 18, 2026 License: MIT, Unlicense Imports: 25 Imported by: 0

README

limpet - fetch, cache, and reuse web requests

Go package docs Build status

A Go library and CLI for fetching web pages with automatic caching. Supports plain HTTP and headless browser (Playwright/Chromium) requests. Can run as a caching HTTP proxy with HTTPS CONNECT tunneling.

Features

  • HTTP + headless browser: fetch pages via standard HTTP or Playwright-driven Chromium
  • Blob storage: cache fetched pages to local filesystem or S3
  • Deterministic cache keys: normalized URL+method+headers+body maps to a SHA-256 blob key, with options to exclude headers and query params
  • Conditional requests: automatic ETag/If-Modified-Since revalidation in Transport avoids re-downloading unchanged content
  • Request deduplication: concurrent Transport requests for the same URL coalesce via singleflight
  • Version history: archive timestamped snapshots and diff pages to detect changes
  • Staleness hints: check HTTP cache headers or time-based age via Page.Stale() / Page.StaleAfter()
  • Per-request cache TTL: override the default TTL per request via context
  • Rate limiting: configurable per-request rate limits with exponential backoff
  • Silent throttle detection: detect and retry when a site silently serves captcha/block pages
  • HTTP proxy mode: caching HTTP proxy with HTTPS CONNECT tunneling (SSRF-safe)

CLI Usage

# Install
go install github.com/arclabs561/limpet/cmd/limpet@latest

# Fetch a URL (cached on subsequent calls)
limpet do https://example.com

# Force re-fetch, ignoring cache
limpet do -f https://example.com

# Include response headers in output
limpet do -i https://example.com

# Use headless browser (Playwright)
limpet do -B https://example.com

# HEAD request (implies -i)
limpet do -I https://example.com

# Custom HTTP method
limpet do -X POST https://example.com/api

# Run as a caching HTTP proxy
limpet proxy -a localhost:8080

# List cached entries
limpet cache ls

# List cached entries for a specific host
limpet cache ls example.com

# Read a cached page's response body
limpet cache get example.com/abc123.json

# Show page metadata (URL, status, fetch time)
limpet cache get --meta example.com/abc123.json

# Delete a cached entry
limpet cache rm example.com/abc123.json

# Purge all cached entries (or by host prefix)
limpet cache purge
limpet cache purge example.com
Global flags
Flag Default Description
-b, --bucket-url file://<config>/bucket Blob storage URL (file:// or s3://)
--cache-dir <config>/cache Local cache directory
--no-cache false Disable local caching
--cache-ttl 24h Cache TTL (0 or forever for no expiry)
-L, --log-level fatal Log level: trace, debug, info, warn, error, fatal
-F, --log-format auto Log format: auto, console
-c, --log-color auto Log color: auto, always, never

Rate Limiting

Set the LIMPET_RATE_LIMIT environment variable:

LIMPET_RATE_LIMIT=100        # 100 requests/second (default)
LIMPET_RATE_LIMIT=10/1m      # 10 requests/minute
LIMPET_RATE_LIMIT=none       # Unlimited

Format: <count>[/<duration>]. Duration uses Go syntax (1s, 1m, 1h).

Page Schema

Each fetched page is stored as JSON with three sections:

Page
+-- Meta
|   +-- Version      (uint16, currently 1)
|   +-- FetchedAt    (timestamp)
|   +-- FetchDur     (duration)
+-- Request
|   +-- URL
|   +-- RedirectedURL (if redirected)
|   +-- Method
|   +-- Header
|   +-- Body
+-- Response
    +-- StatusCode
    +-- ProtoMajor, ProtoMinor
    +-- Header
    +-- Body
    +-- ContentLength
    +-- TransferEncoding
    +-- Trailer

Library Usage

import (
    "context"
    "fmt"

    "github.com/arclabs561/limpet"
    "github.com/arclabs561/limpet/blob"
)

ctx := context.Background()
bucket, _ := blob.NewBucket(ctx, "file:///tmp/limpet-cache", nil)
defer bucket.Close()

cl, _ := limpet.NewClient(ctx, bucket)
defer cl.Close()

// Simple GET with convenience method
page, _ := cl.Get(ctx, "https://example.com")
fmt.Println(string(page.Response.Body))

// Second call returns cached result
page, _ = cl.Get(ctx, "https://example.com")
Client options (construction time)
  • limpet.WithBrowser() -- always use headless browser
  • limpet.WithChromiumSandbox(false) -- disable Chromium OS sandbox (for CI containers)
  • limpet.WithRateLimit(10) -- set programmatic rate limit
  • limpet.WithRequestBodyLimit(10e6) -- max request body for cache key (default 10 MB, 0 = no limit)
  • limpet.WithResponseBodyLimit(100e6) -- max response body to cache (default 100 MB, 0 = no limit)
  • limpet.WithIgnoreHeaders("User-Agent", "Accept-Encoding") -- exclude headers from cache key (different browsers, same cache entry)
  • limpet.WithIgnoreParams("_t", "token", "utm_source") -- exclude query params from cache key (auth tokens, tracking params)
  • limpet.WithUserAgent("limpet/0.1") -- default User-Agent header (applied if not already set)
  • limpet.WithCacheStatuses(200, 301, 404) -- cache non-200 responses (default: 200 only)
  • limpet.WithRetry(limpet.RetryConfig{Attempts: 3, MinWait: 2 * time.Second}) -- configure retry (zero fields keep defaults: 5 attempts, 1s min, 1m max, 1s jitter)
Per-request options (DoConfig)
page, _ := cl.Do(ctx, req, limpet.DoConfig{
    Replace:        true,                // force re-fetch, bypassing cache
    Browser:        true,                // use headless browser for this request
    Archive:        true,                // store a timestamped snapshot for version history
    SilentThrottle: regexp.MustCompile(`captcha`), // detect and retry throttled responses
    Limiter:        rateLimiter,         // per-request rate limiter
})
Batch fetching
// Fetch multiple URLs concurrently (up to 5 at a time)
err := cl.GetMany(ctx, urls, 5, limpet.DoConfig{}, func(url string, page *limpet.Page, err error) error {
    if err != nil {
        return err // stops remaining fetches
    }
    fmt.Printf("%s: %d bytes\n", url, len(page.Response.Body))
    return nil
})
Error types

Do and Get return typed errors for non-200 responses and throttling:

page, err := cl.Get(ctx, url)
var statusErr *limpet.StatusError
if errors.As(err, &statusErr) {
    fmt.Printf("HTTP %d\n", statusErr.Page.Response.StatusCode)
}

var throttleErr *limpet.ThrottledError
if errors.As(err, &throttleErr) {
    // site returned a captcha/block page matching SilentThrottle pattern
}
Version history and change detection
// Fetch with archive to build version history
page, _ := cl.Do(ctx, req, limpet.DoConfig{Archive: true, Replace: true})

// List all archived snapshots for a request
versions, _ := cl.Versions(ctx, req)
for _, v := range versions {
    fmt.Printf("%s  %s\n", v.FetchedAt, v.Key)
}

// Read a specific version and compare
old, _ := cl.Version(ctx, versions[0].Key)
new, _ := cl.Version(ctx, versions[len(versions)-1].Key)
diff := limpet.Diff(old, new)
fmt.Printf("changed=%v old_size=%d new_size=%d\n", diff.Changed, diff.OldSize, diff.NewSize)
Staleness
// Check HTTP cache headers (Cache-Control, Expires)
if page.Stale() {
    page, _ = cl.Get(ctx, url, limpet.DoConfig{Replace: true})
}

// Check time-based age (for targets with no cache headers)
if page.StaleAfter(24 * time.Hour) {
    page, _ = cl.Get(ctx, url, limpet.DoConfig{Replace: true})
}
Transport (http.RoundTripper)

For integrating caching into any http.Client:

bucket, _ := blob.NewBucket(ctx, "file:///tmp/cache", nil)
defer bucket.Close()

tr := limpet.NewTransport(bucket,
    limpet.TransportWithRateLimit(10),
    limpet.TransportWithRequestBodyLimit(10e6),   // 10 MB (default)
    limpet.TransportWithResponseBodyLimit(100e6), // 100 MB (default)
    limpet.TransportWithIgnoreHeaders("User-Agent", "Accept-Encoding"),
)

client := &http.Client{Transport: tr}

// First call fetches and caches. Second call returns from cache.
resp, _ := client.Get("https://example.com")
// resp.Header.Get("X-Limpet-Source") == "fetch", "cache", or "revalidated"

Per-request cache control via context:

// Skip cache read, force fresh fetch (still caches the result)
ctx := limpet.WithCachePolicy(ctx, limpet.CachePolicyReplace)

// Bypass cache entirely (no read, no write)
ctx = limpet.WithCachePolicy(ctx, limpet.CachePolicySkip)

// Per-request cache TTL (overrides bucket default)
ctx = limpet.WithCacheTTL(ctx, 7*24*time.Hour) // weekly
Client vs Transport

Use Transport when you want transparent caching as a drop-in http.RoundTripper for any http.Client. Use Client when you also need retry with backoff, headless browser rendering, version history, or silent throttle detection. Transport is the caching primitive; Client composes it with higher-level scraping features.

License

Dual-licensed under MIT or the UNLICENSE.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func WithCachePolicy

func WithCachePolicy(ctx context.Context, p CachePolicy) context.Context

WithCachePolicy returns a context that carries the given cache policy.

func WithCacheTTL added in v0.2.0

func WithCacheTTL(ctx context.Context, ttl time.Duration) context.Context

WithCacheTTL returns a context that overrides the default cache TTL for writes made with this context. Works with both Client and Transport. Use 0 for no expiry, or a positive duration for a custom TTL.

Types

type CachePolicy

type CachePolicy int

CachePolicy controls per-request caching behavior.

const (
	// CachePolicyDefault reads from cache on hit, writes on miss (status 200).
	CachePolicyDefault CachePolicy = iota
	// CachePolicyReplace skips cache read but still writes on status 200.
	CachePolicyReplace
	// CachePolicySkip bypasses cache entirely (no read, no write).
	CachePolicySkip
)

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client fetches web pages with automatic caching.

func NewClient

func NewClient(
	ctx context.Context,
	bucket *blob.Bucket,
	opts ...Option,
) (*Client, error)

NewClient creates a new Client with the given blob bucket and options.

func (*Client) Close

func (c *Client) Close()

Close shuts down the browser (if started) and releases resources.

func (*Client) Do

func (c *Client) Do(
	ctx context.Context,
	req *http.Request,
	cfgs ...DoConfig,
) (page *Page, err error)

Do fetches the given request, returning a cached result if available. Pass a DoConfig to control caching, browser mode, and rate limiting.

func (*Client) Get

func (c *Client) Get(ctx context.Context, url string, cfgs ...DoConfig) (*Page, error)

Get is a convenience method for fetching a URL with GET.

func (*Client) GetMany

func (c *Client) GetMany(
	ctx context.Context,
	urls []string,
	concurrency int,
	cfg DoConfig,
	fn func(url string, page *Page, err error) error,
) error

GetMany fetches multiple URLs concurrently with the given concurrency limit. The callback fn is called for each completed fetch (in arbitrary order). Stops early if ctx is cancelled. Returns the first non-nil error from fn, or nil if all callbacks return nil.

func (*Client) Version

func (c *Client) Version(ctx context.Context, key string) (*Page, error)

Version reads a specific archived page snapshot by its key.

func (*Client) Versions

func (c *Client) Versions(ctx context.Context, req *http.Request) ([]PageVersion, error)

Versions lists all archived snapshots for the given request, ordered by fetch time (oldest first). Returns nil if no archive entries exist.

type DoConfig

type DoConfig struct {
	// Replace skips cache read, forcing a fresh fetch (still caches the result).
	Replace bool
	// Browser uses headless Chromium instead of plain HTTP.
	Browser bool
	// Archive stores a timestamped snapshot alongside the latest cache entry.
	// Use Client.Versions to list snapshots and detect changes over time.
	Archive bool
	// SilentThrottle detects and retries when a site silently serves
	// captcha/block pages matching this regexp.
	SilentThrottle *regexp.Regexp
	// Limiter applies a per-request rate limiter instead of the client default.
	Limiter Limiter
}

DoConfig controls per-request behavior for Client.Do and Client.Get.

type Limiter

type Limiter interface {
	Take() time.Time
}

Limiter is a rate limiter interface compatible with go.uber.org/ratelimit.

type Option

type Option func(*Client)

Option configures a Client at construction time.

func WithBrowser

func WithBrowser() Option

WithBrowser configures the client to always use headless browser.

func WithCacheStatuses added in v0.2.0

func WithCacheStatuses(codes ...int) Option

WithCacheStatuses sets which HTTP status codes are eligible for caching. By default only 200 is cached. Use this to also cache redirects, 404s, etc.

func WithChromiumSandbox added in v0.2.0

func WithChromiumSandbox(enabled bool) Option

WithChromiumSandbox controls whether the headless Chromium browser runs with OS-level sandboxing. Defaults to true. Set to false in environments where sandboxing is unsupported (e.g. CI containers without suid sandbox).

func WithIgnoreHeaders added in v0.2.0

func WithIgnoreHeaders(names ...string) Option

WithIgnoreHeaders excludes the named headers from cache key computation. Useful for scraping where User-Agent or Accept-Encoding vary between requests but should map to the same cache entry.

func WithIgnoreParams added in v0.2.0

func WithIgnoreParams(names ...string) Option

WithIgnoreParams excludes the named query parameters from cache key computation. Useful for stripping auth tokens, timestamps, or tracking params (utm_source, etc.) that vary between requests to the same resource.

func WithRateLimit

func WithRateLimit(rps int, opts ...ratelimit.Option) Option

WithRateLimit sets a programmatic rate limit, overriding the env var.

func WithRequestBodyLimit added in v0.2.0

func WithRequestBodyLimit(n int64) Option

WithRequestBodyLimit sets the maximum request body size used for cache key computation. 0 means no limit. Default: 10 MB.

func WithResponseBodyLimit added in v0.2.0

func WithResponseBodyLimit(n int64) Option

WithResponseBodyLimit sets the maximum response body size to read and cache. 0 means no limit. Default: 100 MB.

func WithRetry added in v0.2.0

func WithRetry(cfg RetryConfig) Option

WithRetry configures retry behavior. Zero-value fields keep defaults.

func WithUserAgent added in v0.2.0

func WithUserAgent(ua string) Option

WithUserAgent sets a default User-Agent header on all HTTP requests. The header is added before each request if not already set by the caller.

type Page

type Page struct {
	Meta     PageMeta     `json:"meta"`
	Request  PageRequest  `json:"request"`
	Response PageResponse `json:"response"`
}

Page is a cached HTTP request/response pair with metadata.

func (*Page) HTTPResponse

func (p *Page) HTTPResponse() *http.Response

HTTPResponse reconstructs a standard *http.Response from the cached page. The returned response has its own header map (safe for concurrent use).

func (*Page) Stale

func (p *Page) Stale() bool

Stale reports whether this cached page is stale according to HTTP cache semantics (Cache-Control max-age, Expires header). Returns false if the response has no cache directives (treat as fresh). This is informational -- the caller decides whether to re-fetch.

func (*Page) StaleAfter

func (p *Page) StaleAfter(maxAge time.Duration) bool

StaleAfter reports whether this cached page was fetched more than maxAge ago. Use this for scraping targets that send no HTTP cache headers.

type PageDiff

type PageDiff struct {
	Changed    bool
	OldSize    int
	NewSize    int
	OldFetched time.Time
	NewFetched time.Time
}

PageDiff describes the difference between two page snapshots.

func Diff

func Diff(a, b *Page) PageDiff

Diff compares two pages and returns whether the response body changed.

type PageMeta

type PageMeta struct {
	Version   uint16        `json:"version"`
	Source    string        `json:"-"`
	FetchedAt time.Time     `json:"fetched_at"`
	FetchDur  time.Duration `json:"fetch_dur"`
}

PageMeta contains cache metadata for a fetched page.

type PageRequest

type PageRequest struct {
	URL           string      `json:"url"`
	RedirectedURL string      `json:"redirected_url,omitempty"`
	Method        string      `json:"method"`
	Header        http.Header `json:"header,omitempty"`
	Body          []byte      `json:"body,omitempty"`
}

PageRequest stores the original HTTP request details.

type PageResponse

type PageResponse struct {
	StatusCode       int         `json:"status_code"`
	ProtoMajor       int         `json:"proto_major"`
	ProtoMinor       int         `json:"proto_minor"`
	TransferEncoding []string    `json:"transfer_encoding,omitempty"`
	ContentLength    int64       `json:"content_length"`
	Header           http.Header `json:"header"`
	Body             []byte      `json:"body"`
	Trailer          http.Header `json:"trailer,omitempty"`
}

PageResponse stores the HTTP response details including the body.

type PageVersion

type PageVersion struct {
	Key       string    // Cache key for this snapshot.
	FetchedAt time.Time // When this snapshot was fetched.
	BodyHash  string    // SHA-256 hex digest of the response body.
}

PageVersion describes a single archived snapshot of a cached page.

type RetryConfig added in v0.2.0

type RetryConfig struct {
	// Attempts is the maximum number of tries (including the first). Default: 5.
	Attempts int
	// MinWait is the base wait duration for exponential backoff. Default: 1s.
	MinWait time.Duration
	// MaxWait caps the backoff duration. Default: 1m.
	MaxWait time.Duration
	// Jitter adds random jitter up to this duration per attempt. Default: 1s.
	Jitter time.Duration
}

RetryConfig controls retry behavior for failed HTTP requests.

type StatusError

type StatusError struct {
	Page *Page
}

StatusError is returned when the HTTP status is not 200 OK. The Page contains the response and status.

func (*StatusError) Error

func (e *StatusError) Error() string

type ThrottledError

type ThrottledError struct{}

ThrottledError is returned when the fetch is throttled.

func (*ThrottledError) Error

func (e *ThrottledError) Error() string

type Transport

type Transport struct {
	// Base is the underlying RoundTripper. Nil means http.DefaultTransport.
	Base http.RoundTripper
	// contains filtered or unexported fields
}

Transport is an http.RoundTripper that caches responses in a blob.Bucket. Responses are fully buffered (no streaming). Only HTTP 200 responses are cached.

Use WithCachePolicy on the request context to control per-request caching.

func NewTransport

func NewTransport(bucket *blob.Bucket, opts ...TransportOption) *Transport

NewTransport creates a caching Transport backed by the given bucket.

func (*Transport) RoundTrip

func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip executes a single HTTP transaction with caching.

func (*Transport) Stats added in v0.2.0

func (t *Transport) Stats() TransportStatsSnapshot

Stats returns a snapshot of the transport's cache performance counters.

type TransportOption

type TransportOption func(*Transport)

TransportOption configures a Transport.

func TransportWithCacheStatuses added in v0.2.0

func TransportWithCacheStatuses(codes ...int) TransportOption

TransportWithCacheStatuses sets which HTTP status codes are eligible for caching. By default only 200 is cached.

func TransportWithIgnoreHeaders added in v0.2.0

func TransportWithIgnoreHeaders(names ...string) TransportOption

TransportWithIgnoreHeaders excludes the named headers from cache key computation. Useful when User-Agent or Accept-Encoding vary between requests but should map to the same cache entry.

func TransportWithIgnoreParams added in v0.2.0

func TransportWithIgnoreParams(names ...string) TransportOption

TransportWithIgnoreParams excludes the named query parameters from cache key computation. Useful for stripping auth tokens or tracking params.

func TransportWithRateLimit

func TransportWithRateLimit(rps int, opts ...ratelimit.Option) TransportOption

TransportWithRateLimit sets a rate limit on outgoing requests.

func TransportWithRequestBodyLimit

func TransportWithRequestBodyLimit(n int64) TransportOption

TransportWithRequestBodyLimit sets the maximum request body size used for cache key computation. 0 means no limit.

func TransportWithResponseBodyLimit

func TransportWithResponseBodyLimit(n int64) TransportOption

TransportWithResponseBodyLimit sets the maximum response body size to cache. 0 means no limit.

func TransportWithUserAgent added in v0.2.0

func TransportWithUserAgent(ua string) TransportOption

TransportWithUserAgent sets a default User-Agent header on all requests.

type TransportStats added in v0.2.0

type TransportStats struct {
	// Hits counts cache hits (served from cache without fetch).
	Hits atomic.Int64
	// Misses counts cache misses (required a fetch).
	Misses atomic.Int64
	// Revalidated counts conditional requests that returned 304.
	Revalidated atomic.Int64
	// Coalesced counts requests served by singleflight coalescing.
	Coalesced atomic.Int64
}

TransportStats tracks cache performance counters. All fields are safe for concurrent access. Read via Transport.Stats().

type TransportStatsSnapshot added in v0.2.0

type TransportStatsSnapshot struct {
	Hits        int64
	Misses      int64
	Revalidated int64
	Coalesced   int64
}

TransportStatsSnapshot is a point-in-time snapshot of cache stats.

Directories

Path Synopsis
cmd
limpet command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL