limpet

package module

v0.2.4 Latest Latest Go to latest Published: Mar 18, 2026 License: MIT, Unlicense Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/arclabs561/limpet

Links

Open Source Insights

README ¶

limpet - fetch, cache, and reuse web requests

A Go library and CLI for fetching web pages with automatic caching. Supports plain HTTP and headless browser (Playwright/Chromium) requests. Can run as a caching HTTP proxy with HTTPS CONNECT tunneling.

Features

HTTP + headless browser: fetch pages via standard HTTP or Playwright-driven Chromium
Blob storage: cache fetched pages to local filesystem or S3
Deterministic cache keys: normalized URL+method+headers+body maps to a SHA-256 blob key, with options to exclude headers and query params
Conditional requests: automatic ETag/If-Modified-Since revalidation in Transport avoids re-downloading unchanged content
Request deduplication: concurrent Transport requests for the same URL coalesce via singleflight
Version history: archive timestamped snapshots and diff pages to detect changes
Staleness hints: check HTTP cache headers or time-based age via Page.Stale() / Page.StaleAfter()
Per-request cache TTL: override the default TTL per request via context
Rate limiting: configurable per-request rate limits with exponential backoff
Silent throttle detection: detect and retry when a site silently serves captcha/block pages
HTTP proxy mode: caching HTTP proxy with HTTPS CONNECT tunneling (SSRF-safe)

CLI Usage

# Install
go install github.com/arclabs561/limpet/cmd/limpet@latest

# Fetch a URL (cached on subsequent calls)
limpet do https://example.com

# Force re-fetch, ignoring cache
limpet do -f https://example.com

# Include response headers in output
limpet do -i https://example.com

# Use headless browser (Playwright)
limpet do -B https://example.com

# HEAD request (implies -i)
limpet do -I https://example.com

# Custom HTTP method
limpet do -X POST https://example.com/api

# Run as a caching HTTP proxy
limpet proxy -a localhost:8080

# List cached entries
limpet cache ls

# List cached entries for a specific host
limpet cache ls example.com

# Read a cached page's response body
limpet cache get example.com/abc123.json

# Show page metadata (URL, status, fetch time)
limpet cache get --meta example.com/abc123.json

# Delete a cached entry
limpet cache rm example.com/abc123.json

# Purge all cached entries (or by host prefix)
limpet cache purge
limpet cache purge example.com

Global flags

Flag	Default	Description
`-b`, `--bucket-url`	`file://<config>/bucket`	Blob storage URL (`file://` or `s3://`)
`--cache-dir`	`<config>/cache`	Local cache directory
`--no-cache`	`false`	Disable local caching
`--cache-ttl`	`24h`	Cache TTL (`0` or `forever` for no expiry)
`-L`, `--log-level`	`fatal`	Log level: `trace`, `debug`, `info`, `warn`, `error`, `fatal`
`-F`, `--log-format`	`auto`	Log format: `auto`, `console`
`-c`, `--log-color`	`auto`	Log color: `auto`, `always`, `never`

Rate Limiting

Set the LIMPET_RATE_LIMIT environment variable:

LIMPET_RATE_LIMIT=100        # 100 requests/second (default)
LIMPET_RATE_LIMIT=10/1m      # 10 requests/minute
LIMPET_RATE_LIMIT=none       # Unlimited

Format: <count>[/<duration>]. Duration uses Go syntax (1s, 1m, 1h).

Page Schema

Each fetched page is stored as JSON with three sections:

Page
+-- Meta
|   +-- Version      (uint16, currently 1)
|   +-- FetchedAt    (timestamp)
|   +-- FetchDur     (duration)
+-- Request
|   +-- URL
|   +-- RedirectedURL (if redirected)
|   +-- Method
|   +-- Header
|   +-- Body
+-- Response
    +-- StatusCode
    +-- ProtoMajor, ProtoMinor
    +-- Header
    +-- Body
    +-- ContentLength
    +-- TransferEncoding
    +-- Trailer

Library Usage

import (
    "context"
    "fmt"

    "github.com/arclabs561/limpet"
    "github.com/arclabs561/limpet/blob"
)

ctx := context.Background()
bucket, _ := blob.NewBucket(ctx, "file:///tmp/limpet-cache", nil)
defer bucket.Close()

cl, _ := limpet.NewClient(ctx, bucket)
defer cl.Close()

// Simple GET with convenience method
page, _ := cl.Get(ctx, "https://example.com")
fmt.Println(string(page.Response.Body))

// Second call returns cached result
page, _ = cl.Get(ctx, "https://example.com")

Client options (construction time)

limpet.WithBrowser() -- always use headless browser
limpet.WithChromiumSandbox(false) -- disable Chromium OS sandbox (for CI containers)
limpet.WithRateLimit(10) -- set programmatic rate limit
limpet.WithRequestBodyLimit(10e6) -- max request body for cache key (default 10 MB, 0 = no limit)
limpet.WithResponseBodyLimit(100e6) -- max response body to cache (default 100 MB, 0 = no limit)
limpet.WithIgnoreHeaders("User-Agent", "Accept-Encoding") -- exclude headers from cache key (different browsers, same cache entry)
limpet.WithIgnoreParams("_t", "token", "utm_source") -- exclude query params from cache key (auth tokens, tracking params)
limpet.WithUserAgent("limpet/0.1") -- default User-Agent header (applied if not already set)
limpet.WithCacheStatuses(200, 301, 404) -- cache non-200 responses (default: 200 only)
limpet.WithRetry(limpet.RetryConfig{Attempts: 3, MinWait: 2 * time.Second}) -- configure retry (zero fields keep defaults: 5 attempts, 1s min, 1m max, 1s jitter)

Per-request options (DoConfig)

page, _ := cl.Do(ctx, req, limpet.DoConfig{
    Replace:        true,                // force re-fetch, bypassing cache
    Browser:        true,                // use headless browser for this request
    Archive:        true,                // store a timestamped snapshot for version history
    SilentThrottle: regexp.MustCompile(`captcha`), // detect and retry throttled responses
    Limiter:        rateLimiter,         // per-request rate limiter
})

Batch fetching

// Fetch multiple URLs concurrently (up to 5 at a time)
err := cl.GetMany(ctx, urls, 5, limpet.DoConfig{}, func(url string, page *limpet.Page, err error) error {
    if err != nil {
        return err // stops remaining fetches
    }
    fmt.Printf("%s: %d bytes\n", url, len(page.Response.Body))
    return nil
})

Error types

Do and Get return typed errors for non-200 responses and throttling:

page, err := cl.Get(ctx, url)
var statusErr *limpet.StatusError
if errors.As(err, &statusErr) {
    fmt.Printf("HTTP %d\n", statusErr.Page.Response.StatusCode)
}

var throttleErr *limpet.ThrottledError
if errors.As(err, &throttleErr) {
    // site returned a captcha/block page matching SilentThrottle pattern
}

Version history and change detection

// Fetch with archive to build version history
page, _ := cl.Do(ctx, req, limpet.DoConfig{Archive: true, Replace: true})

// List all archived snapshots for a request
versions, _ := cl.Versions(ctx, req)
for _, v := range versions {
    fmt.Printf("%s  %s\n", v.FetchedAt, v.Key)
}

// Read a specific version and compare
old, _ := cl.Version(ctx, versions[0].Key)
new, _ := cl.Version(ctx, versions[len(versions)-1].Key)
diff := limpet.Diff(old, new)
fmt.Printf("changed=%v old_size=%d new_size=%d\n", diff.Changed, diff.OldSize, diff.NewSize)

Staleness

// Check HTTP cache headers (Cache-Control, Expires)
if page.Stale() {
    page, _ = cl.Get(ctx, url, limpet.DoConfig{Replace: true})
}

// Check time-based age (for targets with no cache headers)
if page.StaleAfter(24 * time.Hour) {
    page, _ = cl.Get(ctx, url, limpet.DoConfig{Replace: true})
}

Transport (http.RoundTripper)

For integrating caching into any http.Client:

bucket, _ := blob.NewBucket(ctx, "file:///tmp/cache", nil)
defer bucket.Close()

tr := limpet.NewTransport(bucket,
    limpet.TransportWithRateLimit(10),
    limpet.TransportWithRequestBodyLimit(10e6),   // 10 MB (default)
    limpet.TransportWithResponseBodyLimit(100e6), // 100 MB (default)
    limpet.TransportWithIgnoreHeaders("User-Agent", "Accept-Encoding"),
)

client := &http.Client{Transport: tr}

// First call fetches and caches. Second call returns from cache.
resp, _ := client.Get("https://example.com")
// resp.Header.Get("X-Limpet-Source") == "fetch", "cache", or "revalidated"

Per-request cache control via context:

// Skip cache read, force fresh fetch (still caches the result)
ctx := limpet.WithCachePolicy(ctx, limpet.CachePolicyReplace)

// Bypass cache entirely (no read, no write)
ctx = limpet.WithCachePolicy(ctx, limpet.CachePolicySkip)

// Per-request cache TTL (overrides bucket default)
ctx = limpet.WithCacheTTL(ctx, 7*24*time.Hour) // weekly

Client vs Transport

Use Transport when you want transparent caching as a drop-in http.RoundTripper for any http.Client. Use Client when you also need retry with backoff, headless browser rendering, version history, or silent throttle detection. Transport is the caching primitive; Client composes it with higher-level scraping features.

License

Dual-licensed under MIT or the UNLICENSE.

Documentation ¶

Index ¶

func WithCachePolicy(ctx context.Context, p CachePolicy) context.Context
func WithCacheTTL(ctx context.Context, ttl time.Duration) context.Context
type CachePolicy
type Client
- func NewClient(ctx context.Context, bucket *blob.Bucket, opts ...Option) (*Client, error)
- func (c *Client) Close()
- func (c *Client) Do(ctx context.Context, req *http.Request, cfgs ...DoConfig) (page *Page, err error)
- func (c *Client) Get(ctx context.Context, url string, cfgs ...DoConfig) (*Page, error)
- func (c *Client) GetMany(ctx context.Context, urls []string, concurrency int, cfg DoConfig, ...) error
- func (c *Client) Version(ctx context.Context, key string) (*Page, error)
- func (c *Client) Versions(ctx context.Context, req *http.Request) ([]PageVersion, error)
type DoConfig
type Limiter
type Option
- func WithBrowser() Option
- func WithCacheStatuses(codes ...int) Option
- func WithChromiumSandbox(enabled bool) Option
- func WithIgnoreHeaders(names ...string) Option
- func WithIgnoreParams(names ...string) Option
- func WithRateLimit(rps int, opts ...ratelimit.Option) Option
- func WithRequestBodyLimit(n int64) Option
- func WithResponseBodyLimit(n int64) Option
- func WithRetry(cfg RetryConfig) Option
- func WithUserAgent(ua string) Option
type Page
- func (p *Page) HTTPResponse() *http.Response
- func (p *Page) Stale() bool
- func (p *Page) StaleAfter(maxAge time.Duration) bool
type PageDiff
- func Diff(a, b *Page) PageDiff
type PageMeta
type PageRequest
type PageResponse
type PageVersion
type RetryConfig
type StatusError
- func (e *StatusError) Error() string
type ThrottledError
- func (e *ThrottledError) Error() string
type Transport
- func NewTransport(bucket *blob.Bucket, opts ...TransportOption) *Transport
- func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)
- func (t *Transport) Stats() TransportStatsSnapshot
type TransportOption
- func TransportWithCacheStatuses(codes ...int) TransportOption
- func TransportWithIgnoreHeaders(names ...string) TransportOption
- func TransportWithIgnoreParams(names ...string) TransportOption
- func TransportWithRateLimit(rps int, opts ...ratelimit.Option) TransportOption
- func TransportWithRequestBodyLimit(n int64) TransportOption
- func TransportWithResponseBodyLimit(n int64) TransportOption
- func TransportWithUserAgent(ua string) TransportOption
type TransportStats
type TransportStatsSnapshot

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func WithCachePolicy ¶

func WithCachePolicy(ctx context.Context, p CachePolicy) context.Context

WithCachePolicy returns a context that carries the given cache policy.

func WithCacheTTL ¶ added in v0.2.0

func WithCacheTTL(ctx context.Context, ttl time.Duration) context.Context

WithCacheTTL returns a context that overrides the default cache TTL for writes made with this context. Works with both Client and Transport. Use 0 for no expiry, or a positive duration for a custom TTL.

Types ¶

type CachePolicy ¶

type CachePolicy int

CachePolicy controls per-request caching behavior.

const (
	// CachePolicyDefault reads from cache on hit, writes on miss (status 200).
	CachePolicyDefault CachePolicy = iota
	// CachePolicyReplace skips cache read but still writes on status 200.
	CachePolicyReplace
	// CachePolicySkip bypasses cache entirely (no read, no write).
	CachePolicySkip
)

type Client ¶

type Client struct {
	// contains filtered or unexported fields
}

Client fetches web pages with automatic caching.

func NewClient ¶

func NewClient(
	ctx context.Context,
	bucket *blob.Bucket,
	opts ...Option,
) (*Client, error)

NewClient creates a new Client with the given blob bucket and options.

func (*Client) Close ¶

func (c *Client) Close()

Close shuts down the browser (if started) and releases resources.

func (*Client) Do ¶

func (c *Client) Do(
	ctx context.Context,
	req *http.Request,
	cfgs ...DoConfig,
) (page *Page, err error)

Do fetches the given request, returning a cached result if available. Pass a DoConfig to control caching, browser mode, and rate limiting.

func (*Client) Get ¶

func (c *Client) Get(ctx context.Context, url string, cfgs ...DoConfig) (*Page, error)

Get is a convenience method for fetching a URL with GET.

func (*Client) GetMany ¶

func (c *Client) GetMany(
	ctx context.Context,
	urls []string,
	concurrency int,
	cfg DoConfig,
	fn func(url string, page *Page, err error) error,
) error

GetMany fetches multiple URLs concurrently with the given concurrency limit. The callback fn is called for each completed fetch (in arbitrary order). Stops early if ctx is cancelled. Returns the first non-nil error from fn, or nil if all callbacks return nil.

func (*Client) Version ¶

func (c *Client) Version(ctx context.Context, key string) (*Page, error)

Version reads a specific archived page snapshot by its key.

func (*Client) Versions ¶

func (c *Client) Versions(ctx context.Context, req *http.Request) ([]PageVersion, error)

Versions lists all archived snapshots for the given request, ordered by fetch time (oldest first). Returns nil if no archive entries exist.

type DoConfig ¶

type DoConfig struct {
	// Replace skips cache read, forcing a fresh fetch (still caches the result).
	Replace bool
	// Browser uses headless Chromium instead of plain HTTP.
	Browser bool
	// Archive stores a timestamped snapshot alongside the latest cache entry.
	// Use Client.Versions to list snapshots and detect changes over time.
	Archive bool
	// SilentThrottle detects and retries when a site silently serves
	// captcha/block pages matching this regexp.
	SilentThrottle *regexp.Regexp
	// Limiter applies a per-request rate limiter instead of the client default.
	Limiter Limiter
}

DoConfig controls per-request behavior for Client.Do and Client.Get.

type Limiter ¶

type Limiter interface {
	Take() time.Time
}

Limiter is a rate limiter interface compatible with go.uber.org/ratelimit.

type Option ¶

type Option func(*Client)

Option configures a Client at construction time.

func WithBrowser ¶

func WithBrowser() Option

WithBrowser configures the client to always use headless browser.

func WithCacheStatuses ¶ added in v0.2.0

func WithCacheStatuses(codes ...int) Option

WithCacheStatuses sets which HTTP status codes are eligible for caching. By default only 200 is cached. Use this to also cache redirects, 404s, etc.

func WithChromiumSandbox ¶ added in v0.2.0

func WithChromiumSandbox(enabled bool) Option

WithChromiumSandbox controls whether the headless Chromium browser runs with OS-level sandboxing. Defaults to true. Set to false in environments where sandboxing is unsupported (e.g. CI containers without suid sandbox).

func WithIgnoreHeaders ¶ added in v0.2.0

func WithIgnoreHeaders(names ...string) Option

WithIgnoreHeaders excludes the named headers from cache key computation. Useful for scraping where User-Agent or Accept-Encoding vary between requests but should map to the same cache entry.

func WithIgnoreParams ¶ added in v0.2.0

func WithIgnoreParams(names ...string) Option

WithIgnoreParams excludes the named query parameters from cache key computation. Useful for stripping auth tokens, timestamps, or tracking params (utm_source, etc.) that vary between requests to the same resource.

func WithRateLimit ¶

func WithRateLimit(rps int, opts ...ratelimit.Option) Option

WithRateLimit sets a programmatic rate limit, overriding the env var.

func WithRequestBodyLimit ¶ added in v0.2.0

func WithRequestBodyLimit(n int64) Option

WithRequestBodyLimit sets the maximum request body size used for cache key computation. 0 means no limit. Default: 10 MB.

func WithResponseBodyLimit ¶ added in v0.2.0

func WithResponseBodyLimit(n int64) Option

WithResponseBodyLimit sets the maximum response body size to read and cache. 0 means no limit. Default: 100 MB.

func WithRetry ¶ added in v0.2.0

func WithRetry(cfg RetryConfig) Option

WithRetry configures retry behavior. Zero-value fields keep defaults.

func WithUserAgent ¶ added in v0.2.0

func WithUserAgent(ua string) Option

WithUserAgent sets a default User-Agent header on all HTTP requests. The header is added before each request if not already set by the caller.

type Page ¶

type Page struct {
	Meta     PageMeta     `json:"meta"`
	Request  PageRequest  `json:"request"`
	Response PageResponse `json:"response"`
}

Page is a cached HTTP request/response pair with metadata.

func (*Page) HTTPResponse ¶

func (p *Page) HTTPResponse() *http.Response

HTTPResponse reconstructs a standard *http.Response from the cached page. The returned response has its own header map (safe for concurrent use).

func (*Page) Stale ¶

func (p *Page) Stale() bool

Stale reports whether this cached page is stale according to HTTP cache semantics (Cache-Control max-age, Expires header). Returns false if the response has no cache directives (treat as fresh). This is informational -- the caller decides whether to re-fetch.

func (*Page) StaleAfter ¶

func (p *Page) StaleAfter(maxAge time.Duration) bool

StaleAfter reports whether this cached page was fetched more than maxAge ago. Use this for scraping targets that send no HTTP cache headers.

type PageDiff ¶

type PageDiff struct {
	Changed    bool
	OldSize    int
	NewSize    int
	OldFetched time.Time
	NewFetched time.Time
}

PageDiff describes the difference between two page snapshots.

func Diff ¶

func Diff(a, b *Page) PageDiff

Diff compares two pages and returns whether the response body changed.

type PageMeta ¶

type PageMeta struct {
	Version   uint16        `json:"version"`
	Source    string        `json:"-"`
	FetchedAt time.Time     `json:"fetched_at"`
	FetchDur  time.Duration `json:"fetch_dur"`
}

PageMeta contains cache metadata for a fetched page.

type PageRequest ¶

type PageRequest struct {
	URL           string      `json:"url"`
	RedirectedURL string      `json:"redirected_url,omitempty"`
	Method        string      `json:"method"`
	Header        http.Header `json:"header,omitempty"`
	Body          []byte      `json:"body,omitempty"`
}

PageRequest stores the original HTTP request details.

type PageResponse ¶

type PageResponse struct {
	StatusCode       int         `json:"status_code"`
	ProtoMajor       int         `json:"proto_major"`
	ProtoMinor       int         `json:"proto_minor"`
	TransferEncoding []string    `json:"transfer_encoding,omitempty"`
	ContentLength    int64       `json:"content_length"`
	Header           http.Header `json:"header"`
	Body             []byte      `json:"body"`
	Trailer          http.Header `json:"trailer,omitempty"`
}

PageResponse stores the HTTP response details including the body.

type PageVersion ¶

type PageVersion struct {
	Key       string    // Cache key for this snapshot.
	FetchedAt time.Time // When this snapshot was fetched.
	BodyHash  string    // SHA-256 hex digest of the response body.
}

PageVersion describes a single archived snapshot of a cached page.

type RetryConfig ¶ added in v0.2.0

type RetryConfig struct {
	// Attempts is the maximum number of tries (including the first). Default: 5.
	Attempts int
	// MinWait is the base wait duration for exponential backoff. Default: 1s.
	MinWait time.Duration
	// MaxWait caps the backoff duration. Default: 1m.
	MaxWait time.Duration
	// Jitter adds random jitter up to this duration per attempt. Default: 1s.
	Jitter time.Duration
}

RetryConfig controls retry behavior for failed HTTP requests.

type StatusError ¶

type StatusError struct {
	Page *Page
}

StatusError is returned when the HTTP status is not 200 OK. The Page contains the response and status.

func (*StatusError) Error ¶

func (e *StatusError) Error() string

type ThrottledError ¶

type ThrottledError struct{}

ThrottledError is returned when the fetch is throttled.

func (*ThrottledError) Error ¶

func (e *ThrottledError) Error() string

type Transport ¶

type Transport struct {
	// Base is the underlying RoundTripper. Nil means http.DefaultTransport.
	Base http.RoundTripper
	// contains filtered or unexported fields
}

Transport is an http.RoundTripper that caches responses in a blob.Bucket. Responses are fully buffered (no streaming). Only HTTP 200 responses are cached.

Use WithCachePolicy on the request context to control per-request caching.

func NewTransport ¶

func NewTransport(bucket *blob.Bucket, opts ...TransportOption) *Transport

NewTransport creates a caching Transport backed by the given bucket.

func (*Transport) RoundTrip ¶

func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip executes a single HTTP transaction with caching.

func (*Transport) Stats ¶ added in v0.2.0

func (t *Transport) Stats() TransportStatsSnapshot

Stats returns a snapshot of the transport's cache performance counters.

type TransportOption ¶

type TransportOption func(*Transport)

TransportOption configures a Transport.

func TransportWithCacheStatuses ¶ added in v0.2.0

func TransportWithCacheStatuses(codes ...int) TransportOption

TransportWithCacheStatuses sets which HTTP status codes are eligible for caching. By default only 200 is cached.

func TransportWithIgnoreHeaders ¶ added in v0.2.0

func TransportWithIgnoreHeaders(names ...string) TransportOption

TransportWithIgnoreHeaders excludes the named headers from cache key computation. Useful when User-Agent or Accept-Encoding vary between requests but should map to the same cache entry.

func TransportWithIgnoreParams ¶ added in v0.2.0

func TransportWithIgnoreParams(names ...string) TransportOption

TransportWithIgnoreParams excludes the named query parameters from cache key computation. Useful for stripping auth tokens or tracking params.

func TransportWithRateLimit ¶

func TransportWithRateLimit(rps int, opts ...ratelimit.Option) TransportOption

TransportWithRateLimit sets a rate limit on outgoing requests.

func TransportWithRequestBodyLimit ¶

func TransportWithRequestBodyLimit(n int64) TransportOption

TransportWithRequestBodyLimit sets the maximum request body size used for cache key computation. 0 means no limit.

func TransportWithResponseBodyLimit ¶

func TransportWithResponseBodyLimit(n int64) TransportOption

TransportWithResponseBodyLimit sets the maximum response body size to cache. 0 means no limit.

func TransportWithUserAgent ¶ added in v0.2.0

func TransportWithUserAgent(ua string) TransportOption

TransportWithUserAgent sets a default User-Agent header on all requests.

type TransportStats ¶ added in v0.2.0

type TransportStats struct {
	// Hits counts cache hits (served from cache without fetch).
	Hits atomic.Int64
	// Misses counts cache misses (required a fetch).
	Misses atomic.Int64
	// Revalidated counts conditional requests that returned 304.
	Revalidated atomic.Int64
	// Coalesced counts requests served by singleflight coalescing.
	Coalesced atomic.Int64
}

TransportStats tracks cache performance counters. All fields are safe for concurrent access. Read via Transport.Stats().

type TransportStatsSnapshot ¶ added in v0.2.0

type TransportStatsSnapshot struct {
	Hits        int64
	Misses      int64
	Revalidated int64
	Coalesced   int64
}

TransportStatsSnapshot is a point-in-time snapshot of cache stats.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
blob
cmd
limpet command
limpet/cmd

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL