crawlobserver

module
v0.7.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 4, 2026 License: AGPL-3.0

README

CrawlObserver

Free, open-source SEO crawler built by SEObserver.
Extract 45+ SEO signals per page. Query millions of pages in milliseconds.

Quick Start · Web UI · CLI · Config · API · Contributing

CI Latest Release Downloads AGPL-3.0 License Go Report Card

CrawlObserver Web UI


Why CrawlObserver?

At SEObserver, we crawl billions of pages. We built CrawlObserver because every SEO deserves a proper crawler — not a spreadsheet with 10,000 rows, not a SaaS with monthly limits. A real tool that runs on your machine, stores data in a columnar database, and lets you query millions of pages in milliseconds.

We're giving it to the community for free. Use it, break it, improve it.

What it does

  • Crawls websites following internal links from seed URLs
  • Extracts 45+ SEO signals per page (title, canonical, meta tags, headings, hreflang, Open Graph, schema.org, images, links, indexability...)
  • Respects robots.txt and per-host crawl delays
  • Tracks redirect chains, response times, and body sizes
  • Stores everything in a columnar database for instant analytical queries
  • Computes PageRank and crawl depth per session
  • Comes with a web UI, a REST API, and a native desktop app

Quick Start

Prerequisites: Go 1.25+

# 1. Clone & build
git clone https://github.com/SEObserver/crawlobserver.git
cd crawlobserver
make build

# 2. Crawl a site
./crawlobserver crawl --seed https://example.com --max-pages 1000

# 3. Browse results
./crawlobserver serve
# Open http://127.0.0.1:8899

That's it. CrawlObserver automatically downloads and manages its own database on first run (macOS & Linux). No Docker, no manual setup.

Advanced: You can also point CrawlObserver at an existing database instance (Docker, remote server...). See the Configuration section for clickhouse.* settings.


Web UI

Start the web interface with ./crawlobserver serve and open http://127.0.0.1:8899.

The UI gives you:

  • Session management — start, stop, resume, delete crawl sessions
  • Page explorer — filter and browse crawled pages by status code, title, depth, word count...
  • Tabs — overview, titles, meta, headings, images, indexability, response codes, internal links, external links
  • PageRank — distribution histogram, treemap by path, top-N pages
  • robots.txt tester — view robots.txt per host and test URL access
  • Sitemap viewer — discover and browse sitemap trees
  • Real-time progress — live crawl stats via Server-Sent Events
  • Theming — custom accent color, logo, dark mode
  • API key management — project-scoped keys for programmatic access

The UI is a single Go binary — no Node.js runtime needed in production.


CLI Reference

crawlobserver [command]
Command Description
crawl Start a crawl session
serve Start the web UI
gui Start the native desktop app (macOS)
migrate Create or update database tables
sessions List all crawl sessions
report external-links Export external links (table or CSV)
update Check for updates and self-update
install-clickhouse Download database binary for offline use
version Print version

Crawl examples

# Single seed URL
crawlobserver crawl --seed https://example.com

# Multiple seeds from file (one URL per line)
crawlobserver crawl --seeds-file urls.txt

# Fine-tune the crawl
crawlobserver crawl --seed https://example.com \
  --workers 20 \
  --delay 500ms \
  --max-pages 50000 \
  --max-depth 10 \
  --store-html

Reports

# External links as a table
crawlobserver report external-links --format table

# Export to CSV
crawlobserver report external-links --format csv > external-links.csv

# Filter by session
crawlobserver report external-links --session <session-id> --format csv

Configuration

Copy config.example.yaml to config.yaml:

cp config.example.yaml config.yaml

All settings can be overridden via environment variables with the CRAWLOBSERVER_ prefix (e.g. CRAWLOBSERVER_CRAWLER_WORKERS=20) or via CLI flags.

Key settings

Setting Default Description
crawler.workers 10 Concurrent fetch workers
crawler.delay 1s Per-host request delay
crawler.max_pages 0 Max pages to crawl (0 = unlimited)
crawler.max_depth 0 Max crawl depth (0 = unlimited)
crawler.timeout 30s HTTP request timeout
crawler.user_agent CrawlObserver/1.0 User-Agent string
crawler.respect_robots true Obey robots.txt
crawler.store_html false Store raw HTML (ZSTD compressed)
crawler.crawl_scope host host, domain (eTLD+1), or subdirectory
clickhouse.host localhost Database host
clickhouse.port 19000 Database native protocol port
clickhouse.mode (auto) managed, external, or auto-detect
server.port 8899 Web UI port
server.username admin Basic auth username
server.password (generated) Basic auth password (random if not set)
resources.max_memory_mb 0 Memory soft limit (0 = auto)
resources.max_cpu 0 CPU limit / GOMAXPROCS (0 = all)

See config.example.yaml for the full reference.


Architecture

Seed URLs
    |
    v
Frontier  (priority queue, per-host delay, dedup)
    |
    v
Fetch Workers  (N goroutines, robots.txt cache, redirect tracking)
    |
    v
Parser  (goquery: 45+ SEO signals extracted)
    |
    v
Storage Buffer  (batch insert, configurable flush)
    |
    v
Columnar DB  (partitioned by crawl session, managed automatically)
    |
    |---> Web UI  (Svelte 5, embedded in binary)
    |---> REST API  (40+ endpoints)
    |---> CLI reports
Why a columnar database?

A crawl is a link graph, so why not a graph database? Because a crawler is an analytics pipeline, not a graph explorer. The questions you ask are analytical — "show me all pages with a missing H1 and a 301 canonical", "give me PageRank percentiles by subdirectory" — and columnar databases answer these instantly, even over millions of rows.

When we need graph algorithms (PageRank, crawl depth), we compute them in-memory in Go and write the results back. A million-page link graph fits in ~200MB of RAM and computes in seconds — no need for a graph database.

Under the hood, CrawlObserver uses ClickHouse in managed mode: it downloads a static binary and runs it as a subprocess. You see one program; it gets concurrent read/write access, columnar compression (~10:1), and instant session deletion.

Tech stack

Layer Technology
Crawler engine Go, net/http, goroutine pool, HTTP/2 (via utls ALPN negotiation)
TLS fingerprinting refraction-networking/utls (Chrome/Firefox/Edge profiles)
HTML parsing goquery (CSS selectors)
URL normalization purell + custom rules
robots.txt temoto/robotstxt
Storage ClickHouse (via clickhouse-go/v2)
API keys / sessions SQLite (modernc.org/sqlite)
Web UI Svelte 5, Vite (zero runtime dependencies)
Desktop app webview (macOS)
CLI Cobra + Viper

API

The REST API is available when running crawlobserver serve. All endpoints are under /api/.

Sessions

Method Endpoint Description
GET /api/sessions List all sessions
POST /api/crawl Start a new crawl
POST /api/sessions/:id/stop Stop a running crawl
POST /api/sessions/:id/resume Resume a stopped crawl
DELETE /api/sessions/:id Delete a session and its data
Method Endpoint Description
GET /api/sessions/:id/pages Crawled pages (paginated, filterable)
GET /api/sessions/:id/links External links
GET /api/sessions/:id/internal-links Internal links
GET /api/sessions/:id/page-detail?url= Full detail for one URL
GET /api/sessions/:id/page-html?url= Raw HTML body

Analytics

Method Endpoint Description
GET /api/sessions/:id/stats Session statistics
GET /api/sessions/:id/events Live progress (SSE)
POST /api/sessions/:id/compute-pagerank Compute internal PageRank
POST /api/sessions/:id/recompute-depths Recompute crawl depths
GET /api/sessions/:id/pagerank-top Top pages by PageRank
GET /api/sessions/:id/pagerank-distribution PageRank histogram

robots.txt & Sitemaps

Method Endpoint Description
GET /api/sessions/:id/robots-hosts Hosts with robots.txt
GET /api/sessions/:id/robots-content robots.txt content
POST /api/sessions/:id/robots-test Test URLs against robots.txt
GET /api/sessions/:id/sitemaps Discovered sitemaps

Authentication: Basic Auth or API key (X-API-Key header).


Contributing

We welcome contributions. Please read CONTRIBUTING.md before submitting anything.

TL;DR:

  • Open an issue before starting significant work
  • One PR = one thing (don't mix features and refactors)
  • Write tests for new code
  • Run make test && make lint before pushing
  • Follow existing code style — don't reorganize what you didn't change

Acknowledgments

Thanks to the people who helped shape CrawlObserver with their feedback, testing, and ideas:


License

AGPL-3.0 — see LICENSE.

Built by SEObserver.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL