scraper

package module

v0.0.4 Latest Latest Go to latest Published: Jun 10, 2026 License: MIT Imports: 0 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/go-go-golems/scraper

Links

Open Source Insights

README ¶

scraper

scraper is a durable workflow-driven scraping engine.

Go owns:

workflow persistence
scheduling and leases
retries and queue policies
worker runners (js, http/fetch)
CLI and HTTP API hosts

JavaScript owns most site-specific behavior:

submit verbs under sites/<site>/verbs/
durable op scripts under sites/<site>/scripts/
site-specific SQL projections under sites/<site>/migrations/

Site definitions are loaded from filesystem manifest directories during bootstrap, before the Cobra command tree is built. That is why commands such as site js-demo run seed only exist when scraper knows where the site manifests live.

Repository layout

cmd/scraper/ — main CLI entrypoint
pkg/cmd/ — root command, bootstrap config, worker/api/site commands
pkg/engine/ — durable engine model, scheduler, runner registry, SQLite store
pkg/js/runtime/ — goja runtime and JS host APIs
pkg/sites/manifest/ — site.yaml loading and validation
sites/ — default site manifests, JS verbs/scripts, migrations, fixtures
pkg/doc/ — embedded help pages
web/ — frontend

Bootstrap site loading

Scraper discovers site manifests from three sources, in this order:

app config file (~/.scraper/config.yaml)
environment variable (SCRAPER_SITES_MANIFEST_DIRS)
bootstrap CLI flags (--sites-manifest-dir)

Example config:

sitesManifestDirs:
  - /absolute/path/to/sites
  - /another/path/to/sites

Example environment variable:

export SCRAPER_SITES_MANIFEST_DIRS="/path/to/sites-a:/path/to/sites-b"

Example CLI usage:

go run ./cmd/scraper --sites-manifest-dir ./sites site js-demo run seed --help

Dev environment with devctl

From the scraper/ directory, start the full local development stack with:

devctl up

This launches:

redis via Docker on 127.0.0.1:6379 for cross-process runtime event transport;
the scraper API on http://127.0.0.1:8080;
a scraper worker connected to the same engine/site state;
the Vite frontend on http://127.0.0.1:5173.

Useful commands:

devctl plugins list
devctl validate
devctl plan
devctl status --tail-lines 10
devctl logs --service api --follow
devctl logs --service web --follow
devctl down

Devctl stores local runtime databases under state/devctl/ and process logs under .devctl/logs/; both are ignored by git.

Quickstart

1. Run the test suite

go test ./... -count=1

2. Submit a simple workflow

tmpdir=$(mktemp -d)

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  site js-demo run seed \
  --sites-dir "$tmpdir/sites" \
  --engine-db "$tmpdir/engine.db" \
  --workflow-id demo-1 \
  --count 3 \
  --multiplier 4 \
  --prefix smoke

3. Run the worker

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  worker run \
  --sites-dir "$tmpdir/sites" \
  --engine-db "$tmpdir/engine.db" \
  --max-cycles 16 \
  --poll-interval 5ms

4. Inspect engine state

go run ./cmd/scraper engine status --engine-db "$tmpdir/engine.db"

HTTP API quickstart

Start the API server:

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  api serve \
  --address 127.0.0.1:8080 \
  --engine-db /tmp/scraper-http-api/engine.db \
  --sites-dir /tmp/scraper-http-api/sites

Submit a workflow:

curl -X POST http://127.0.0.1:8080/api/v1/sites/js-demo/verbs/seed:submit \
  -H 'Content-Type: application/json' \
  -d '{
    "workflowID": "demo-http-001",
    "values": {
      "count": 3,
      "multiplier": 4,
      "prefix": "http"
    }
  }'

Then run the worker against the same engine/site DBs:

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  worker run \
  --engine-db /tmp/scraper-http-api/engine.db \
  --sites-dir /tmp/scraper-http-api/sites \
  --max-cycles 16 \
  --poll-interval 5ms

Help topics

Useful embedded help pages:

go run ./cmd/scraper --sites-manifest-dir ./sites help scraper-architecture-overview
go run ./cmd/scraper --sites-manifest-dir ./sites help scraper-runtime-model
go run ./cmd/scraper --sites-manifest-dir ./sites help scraper-bootstrap-config-and-site-manifest-loading
go run ./cmd/scraper --sites-manifest-dir ./sites help scraper-new-developer-onboarding
go run ./cmd/scraper --sites-manifest-dir ./sites help scraper-adding-a-declarative-site

Current default site set

The repo currently ships a small progressive default set under sites/:

js-demo — pure JS workflow path
hackernews — JS + HTTP + JS
slashdot — alternate HTML shape and pagination
nereval — more complex fan-out and normalized projections

Notes

--sites-dir is the runtime directory for per-site SQLite databases.
--sites-manifest-dir is the bootstrap directory for site definitions.
If site <name> run <verb> is missing, scraper probably did not load the right manifest directories during bootstrap.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

logcopter_generate.go

Directories ¶

Path	Synopsis
cmd
scraper command
gen
proto/scraper/runtime/sessionstream/v1
proto/scraper/runtime/v1
pkg
api/handlers
api/server
api/types
cmd
doc
engine/config
engine/model
engine/runner
engine/scheduler
engine/store
engine/store/sqlite
js/runtime
metrics
runtimeevents
runtimeevents/sessionstream
services/catalog
services/engineview
services/submission
sites/defaults
sites/manifest
sites/migrate
sites/registry
sites/submitverbs
testfixtures
workflow

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL