ScrapeGoat

module
v0.0.0-...-50e7da2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2026 License: GPL-3.0

README ΒΆ

ScrapeGoat πŸ•·οΈ

Next-generation, enterprise-grade web scraping and crawling toolkit written in Go.

ScrapeGoat combines the best of Scrapy, Colly, and modern crawler services into a single, high-performance platform.

Go Version License Tests


Features

Feature Details
High-Performance Crawling Concurrent workers, per-domain throttling
CSS, XPath & Regex Extraction goquery CSS selectors, XPath, named regex groups
Smart URL Management Priority queue, deduplication, domain filters
robots.txt Compliance Automatic parsing and crawl-delay support
Multi-Format Export JSON, JSONL, CSV with streaming writes
Search Engine Mode Full-text index with headings, meta, link graph
AI-Powered Crawling Summarize, NER, sentiment via Ollama/OpenAI
Prometheus Metrics Built-in /metrics & /health endpoints
Proxy Rotation Round-robin / random with health checking
Checkpoint Persistence Pause/resume crawls with atomic state
Graceful Shutdown SIGINT/SIGTERM handling with state preservation

Quick Start

Install
git clone https://github.com/IshaanNene/ScrapeGoat
cd ScrapeGoat
make build
Basic Crawl
# Crawl a website (depth 2, JSON output)
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2

# Limit pages (great for testing)
./bin/scrapegoat crawl https://news.ycombinator.com --depth 1 --max-requests 30

# Single page (depth 0 = just the seed URL)
./bin/scrapegoat crawl https://en.wikipedia.org/wiki/Web_scraping --depth 0

# Multiple seed URLs
./bin/scrapegoat crawl https://quotes.toscrape.com https://books.toscrape.com --depth 1

# High concurrency, JSONL output
./bin/scrapegoat crawl https://quotes.toscrape.com \
  --depth 2 \
  --concurrency 20 \
  --delay 200ms \
  --format jsonl \
  --output ./results

# CSV output
./bin/scrapegoat crawl https://books.toscrape.com --depth 0 --format csv

# Custom User-Agent
./bin/scrapegoat crawl https://example.com --user-agent "MyBot/1.0 (+https://mybot.example.com)"

# Verbose logging (see every request)
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 1 --verbose

# Show current configuration
./bin/scrapegoat config
crawl Flags
Flag Short Default Description
--depth -d 3 Maximum crawl depth
--concurrency -n 10 Concurrent workers
--delay 1s Politeness delay per domain
--format -f json Output format: json, jsonl, csv
--output -o ./output Output directory
--max-requests -m 0 (unlimited) Maximum total requests
--max-retries 3 Retries per failed request (0 = fail fast)
--allowed-domains (all) Comma-separated domains to stay within
--user-agent (built-in) Custom User-Agent string
--config -c YAML config file path
--verbose -v Enable debug logging

[!TIP] Use --allowed-domains when crawling Wikipedia, GitHub, or any site linked to many external pages. Without it, the crawler follows every outbound link.

# Stay locked to English Wikipedia only
./bin/scrapegoat crawl https://en.wikipedia.org/wiki/Web_scraping \
  --depth 1 --max-requests 30 --allowed-domains en.wikipedia.org --delay 300ms

Search Engine Mode

Index a website with full-text, headings, meta, and link graph:

# Index a website (JSONL output, one document per page)
./bin/scrapegoat search https://go.dev

# Custom depth and page limit
./bin/scrapegoat search https://en.wikipedia.org/wiki/Artificial_intelligence \
  --depth 2 \
  --max-pages 100 \
  --output ./wiki_index

# Index multiple domains
./bin/scrapegoat search https://docs.python.org https://docs.rust-lang.org

Each indexed document contains: url, title, description, keywords, canonical, language, h1/h2/h3, body_text, word_count, outbound_links, images, content_hash, indexed_at.

search Flags: --depth (-d, default 3), --concurrency (-n, default 10), --delay (default 200ms), --max-pages (default 500), --allowed-domains (comma-separated, optional), --output (-o, default ./output/search_index).


AI-Powered Crawling

Crawl + summarize + entity extraction + sentiment analysis (requires an LLM):

# With Ollama (local, no API key needed)
ollama serve &
ollama pull llama3.2
./bin/scrapegoat ai-crawl https://news.ycombinator.com

# With OpenAI
OPENAI_API_KEY=sk-... ./bin/scrapegoat ai-crawl https://techcrunch.com --llm openai --model gpt-4o-mini

# Custom endpoint (any OpenAI-compatible API)
./bin/scrapegoat ai-crawl https://example.com \
  --llm custom \
  --llm-endpoint http://localhost:8080 \
  --model mistral

Each item gets: summary (200-word summary), entities (people, orgs, locations), sentiment (positive/negative/neutral/mixed).

ai-crawl Flags: --depth (-d, default 2), --concurrency (-n, default 5), --delay (default 500ms), --max-pages (default 50), --llm (ollama/openai/custom), --model, --llm-endpoint, --output (-o).


Library (Go SDK)

Embed ScrapeGoat directly in your Go application:

package main

import (
    "fmt"
    "strings"
    "time"

    scrapegoat "github.com/IshaanNene/ScrapeGoat/pkg/scrapegoat"
)

func main() {
    crawler := scrapegoat.NewCrawler(
        scrapegoat.WithConcurrency(5),
        scrapegoat.WithMaxDepth(2),
        scrapegoat.WithDelay(500 * time.Millisecond),
        scrapegoat.WithOutput("json", "./output"),
        scrapegoat.WithAllowedDomains("quotes.toscrape.com"),
        scrapegoat.WithMaxRequests(100),
    )

    // Follow pagination links
    crawler.OnHTML("li.next a[href]", func(e *scrapegoat.Element) {
        e.Follow(e.Attr("href"))
    })

    // Extract quotes
    crawler.OnHTML(".quote", func(e *scrapegoat.Element) {
        text := strings.TrimSpace(e.Selection.Find(".text").Text())
        author := strings.TrimSpace(e.Selection.Find(".author").Text())
        if text != "" {
            e.Item.Set("quote", text)
            e.Item.Set("author", author)
        }
    })

    if err := crawler.Start("https://quotes.toscrape.com"); err != nil {
        panic(err)
    }
    crawler.Wait()

    fmt.Println("Stats:", crawler.Stats())
}
SDK Options
scrapegoat.WithConcurrency(10)               // parallel workers
scrapegoat.WithMaxDepth(3)                   // crawl depth
scrapegoat.WithDelay(500 * time.Millisecond) // politeness delay
scrapegoat.WithOutput("jsonl", "./out")      // format + path
scrapegoat.WithAllowedDomains("example.com") // domain filter
scrapegoat.WithProxy("http://p1:8080", "http://p2:8080") // proxy rotation
scrapegoat.WithRobotsRespect(true)           // obey robots.txt
scrapegoat.WithMaxRequests(500)              // request cap
scrapegoat.WithUserAgent("MyBot/1.0")        // custom UA
scrapegoat.WithVerbose()                     // debug logging

Ready-Made Examples

Run these directly β€” no configuration needed:

# Hacker News top stories (rank, title, URL, points, author, comments)
go run ./examples/hackernews/

# E-commerce product scraper (books.toscrape.com β†’ title, price, rating, stock)
go run ./examples/ecommerce/

# GitHub trending repos (name, description, language, stars, forks)
go run ./examples/github/

# Wikipedia knowledge extractor (title, summary, categories, references)
go run ./examples/wikipedia/

# Multi-site news aggregator (HN + Techmeme + dev.to β†’ headlines, previews)
go run ./examples/news/

# Search engine indexer (provide your own URLs)
go run ./examples/search/ https://go.dev
go run ./examples/search/ https://en.wikipedia.org/wiki/Web_scraping

# AI crawler (requires Ollama or OpenAI)
go run ./examples/aicrawl/ https://news.ycombinator.com
go run ./examples/aicrawl/ https://en.wikipedia.org/wiki/Artificial_intelligence

YAML Configuration

# configs/default.yaml
engine:
  concurrency: 10
  max_depth: 5
  request_timeout: 30s
  politeness_delay: 1s
  respect_robots_txt: true
  max_retries: 3

fetcher:
  type: http
  follow_redirects: true
  max_body_size: 10485760  # 10MB

storage:
  type: json            # json | jsonl | csv
  output_path: ./output

proxy:
  enabled: false
  rotation: round_robin # round_robin | random
  urls: []

metrics:
  enabled: false
  port: 9090
  path: /metrics

parser:
  rules:
    - name: title
      selector: "h1"
      type: css
    - name: emails
      pattern: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
      type: regex

Use with: ./bin/scrapegoat crawl https://example.com --config configs/default.yaml


Testing

# Unit tests (fast, no network)
bash scripts/test.sh unit

# All benchmarks
bash scripts/test.sh bench

# Live integration tests (requires network)
bash scripts/test.sh integration

# Live crawl examples
bash scripts/test.sh crawl

# Full suite
bash scripts/test.sh all
Test Results
Suite Tests Status
Engine (frontier, dedup, stats) 7 βœ… PASS
Parser (CSS, XPath, regex, structured) 8 βœ… PASS
Pipeline (trim, sanitize, PII, dates, dedup) 9 βœ… PASS
Total 23 βœ… 23/23 PASS
Benchmarks (Apple M4 Pro)
Component Ops/sec Latency
Frontier Push/Pop 7.2M 543 ns
Dedup Lookup 13.9M 255 ns
Pipeline (3 stages) 2.6M 1.4 Β΅s
CSS Parse 1.4M 2.5 Β΅s
Composite Parser 54K 66 Β΅s

Prometheus Metrics

Enable via config or env:

# With metrics server
SCRAPEGOAT_METRICS_ENABLED=true ./bin/scrapegoat crawl https://example.com

# Query while crawling
curl http://localhost:9090/health
curl http://localhost:9090/metrics

Docker

# Build image
make docker-build

# Start with Docker Compose (for Redis/monitoring)
make docker-up

Project Structure

scrapegoat/
β”œβ”€β”€ bin/scrapegoat           # Compiled binary
β”œβ”€β”€ cmd/scrapegoat/          # CLI (crawl, search, ai-crawl, version, config)
β”œβ”€β”€ pkg/scrapegoat/          # Public SDK for embedding
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ engine/            # Scheduler, frontier, dedup, checkpoint, robots
β”‚   β”œβ”€β”€ fetcher/           # HTTP fetcher, proxy rotation
β”‚   β”œβ”€β”€ parser/            # CSS / XPath / regex / structured data (JSON-LD, OG)
β”‚   β”œβ”€β”€ pipeline/          # Middleware: trim, sanitize, PII redact, date/currency normalize
β”‚   β”œβ”€β”€ storage/           # JSON, JSONL, CSV writers
β”‚   β”œβ”€β”€ ai/                # LLM client, summarizer, NER, sentiment
β”‚   β”œβ”€β”€ config/            # YAML + env config loading
β”‚   └── observability/     # Prometheus metrics
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ hackernews/        # HN top stories
β”‚   β”œβ”€β”€ ecommerce/         # books.toscrape.com products
β”‚   β”œβ”€β”€ github/            # GitHub trending repos
β”‚   β”œβ”€β”€ wikipedia/         # Wikipedia deep crawl
β”‚   β”œβ”€β”€ news/              # Multi-site news aggregator
β”‚   β”œβ”€β”€ search/            # Search engine indexer
β”‚   └── aicrawl/           # AI-powered crawler
β”œβ”€β”€ tests/                 # Integration tests
β”œβ”€β”€ configs/               # Default YAML config
└── scripts/test.sh        # Test runner

Development

make build      # Build binary
make test       # Run all tests
make lint       # Run golangci-lint
make clean      # Clean build artifacts
make deps       # Download and tidy modules
make docker-up  # Start dev services

License

MIT

Directories ΒΆ

Path Synopsis
cmd
scrapegoat command
examples
aicrawl command
AI Web Crawler β€” crawls sites and uses LLM to summarize, extract entities, and analyze sentiment Requires Ollama running locally: ollama serve && ollama pull llama3.2
AI Web Crawler β€” crawls sites and uses LLM to summarize, extract entities, and analyze sentiment Requires Ollama running locally: ollama serve && ollama pull llama3.2
ecommerce command
E-Commerce Product Scraper β€” scrapes books.toscrape.com Extracts: title, price, rating, availability, description, image
E-Commerce Product Scraper β€” scrapes books.toscrape.com Extracts: title, price, rating, availability, description, image
github command
GitHub Trending Scraper β€” scrapes today's trending repos Extracts: repo name, description, language, stars, forks, stars gained today
GitHub Trending Scraper β€” scrapes today's trending repos Extracts: repo name, description, language, stars, forks, stars gained today
hackernews command
Hacker News Scraper β€” scrapes front page stories Extracts: rank, title, URL, points, author
Hacker News Scraper β€” scrapes front page stories Extracts: rank, title, URL, points, author
news command
Multi-Site News Aggregator β€” crawls multiple news sources Extracts structured data (JSON-LD, OpenGraph, meta) from any news site
Multi-Site News Aggregator β€” crawls multiple news sources Extracts structured data (JSON-LD, OpenGraph, meta) from any news site
search command
Search Engine Crawler β€” indexes pages with full text, metadata, and link graph Produces a search-engine-style index: title, URL, body text, meta, headings, links
Search Engine Crawler β€” indexes pages with full text, metadata, and link graph Produces a search-engine-style index: title, URL, body text, meta, headings, links
wikipedia command
Wikipedia Knowledge Extractor β€” deep crawl with 1000 article limit Extracts: title, summary, categories, reference count, external links
Wikipedia Knowledge Extractor β€” deep crawl with 1000 article limit Extracts: title, summary, categories, reference count, external links
internal
ai
api
seo
pkg
scrapegoat
Package scrapegoat provides a public SDK for embedding ScrapeGoat as a library.
Package scrapegoat provides a public SDK for embedding ScrapeGoat as a library.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL