crawlobserver

module

v0.5.0 Latest Latest Go to latest Published: Mar 3, 2026 License: AGPL-3.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/SEObserver/crawlobserver

Links

Open Source Insights

README ¶

CrawlObserver

Free, open-source SEO crawler built by SEObserver.
Extract 45+ SEO signals per page. Query millions of pages in milliseconds.

Quick Start · Web UI · CLI · Config · API · Contributing

CrawlObserver Web UI

Why CrawlObserver?

At SEObserver, we crawl billions of pages. We built CrawlObserver because every SEO deserves a proper crawler — not a spreadsheet with 10,000 rows, not a SaaS with monthly limits. A real tool that runs on your machine, stores data in a columnar database, and lets you query millions of pages in milliseconds.

We're giving it to the community for free. Use it, break it, improve it.

What it does

Crawls websites following internal links from seed URLs
Extracts 45+ SEO signals per page (title, canonical, meta tags, headings, hreflang, Open Graph, schema.org, images, links, indexability...)
Respects robots.txt and per-host crawl delays
Tracks redirect chains, response times, and body sizes
Stores everything in a columnar database for instant analytical queries
Computes PageRank and crawl depth per session
Comes with a web UI, a REST API, and a native desktop app

Quick Start

Prerequisites: Go 1.25+

# 1. Clone & build
git clone https://github.com/SEObserver/crawlobserver.git
cd crawlobserver
make build

# 2. Crawl a site
./crawlobserver crawl --seed https://example.com --max-pages 1000

# 3. Browse results
./crawlobserver serve
# Open http://127.0.0.1:8899

That's it. CrawlObserver automatically downloads and manages its own database on first run (macOS & Linux). No Docker, no manual setup.

Advanced: You can also point CrawlObserver at an existing database instance (Docker, remote server...). See the Configuration section for clickhouse.* settings.

Web UI

Start the web interface with ./crawlobserver serve and open http://127.0.0.1:8899.

The UI gives you:

Session management — start, stop, resume, delete crawl sessions
Page explorer — filter and browse crawled pages by status code, title, depth, word count...
Tabs — overview, titles, meta, headings, images, indexability, response codes, internal links, external links
PageRank — distribution histogram, treemap by path, top-N pages
robots.txt tester — view robots.txt per host and test URL access
Sitemap viewer — discover and browse sitemap trees
Real-time progress — live crawl stats via Server-Sent Events
Theming — custom accent color, logo, dark mode
API key management — project-scoped keys for programmatic access

The UI is a single Go binary — no Node.js runtime needed in production.

CLI Reference

crawlobserver [command]

Command	Description
`crawl`	Start a crawl session
`serve`	Start the web UI
`gui`	Start the native desktop app (macOS)
`migrate`	Create or update database tables
`sessions`	List all crawl sessions
`report external-links`	Export external links (table or CSV)
`update`	Check for updates and self-update
`install-clickhouse`	Download database binary for offline use
`version`	Print version

Crawl examples

# Single seed URL
crawlobserver crawl --seed https://example.com

# Multiple seeds from file (one URL per line)
crawlobserver crawl --seeds-file urls.txt

# Fine-tune the crawl
crawlobserver crawl --seed https://example.com \
  --workers 20 \
  --delay 500ms \
  --max-pages 50000 \
  --max-depth 10 \
  --store-html

Reports

# External links as a table
crawlobserver report external-links --format table

# Export to CSV
crawlobserver report external-links --format csv > external-links.csv

# Filter by session
crawlobserver report external-links --session <session-id> --format csv

Configuration

Copy config.example.yaml to config.yaml:

cp config.example.yaml config.yaml

All settings can be overridden via environment variables with the CRAWLOBSERVER_ prefix (e.g. CRAWLOBSERVER_CRAWLER_WORKERS=20) or via CLI flags.

Key settings

Setting	Default	Description
`crawler.workers`	`10`	Concurrent fetch workers
`crawler.delay`	`1s`	Per-host request delay
`crawler.max_pages`	`0`	Max pages to crawl (0 = unlimited)
`crawler.max_depth`	`0`	Max crawl depth (0 = unlimited)
`crawler.timeout`	`30s`	HTTP request timeout
`crawler.user_agent`	`CrawlObserver/1.0`	User-Agent string
`crawler.respect_robots`	`true`	Obey robots.txt
`crawler.store_html`	`false`	Store raw HTML (ZSTD compressed)
`crawler.crawl_scope`	`host`	`host`, `domain` (eTLD+1), or `subdirectory`
`clickhouse.host`	`localhost`	Database host
`clickhouse.port`	`19000`	Database native protocol port
`clickhouse.mode`	(auto)	`managed`, `external`, or auto-detect
`server.port`	`8899`	Web UI port
`server.username`	`admin`	Basic auth username
`server.password`	(generated)	Basic auth password (random if not set)
`resources.max_memory_mb`	`0`	Memory soft limit (0 = auto)
`resources.max_cpu`	`0`	CPU limit / GOMAXPROCS (0 = all)

See config.example.yaml for the full reference.

Architecture

Seed URLs
    |
    v
Frontier  (priority queue, per-host delay, dedup)
    |
    v
Fetch Workers  (N goroutines, robots.txt cache, redirect tracking)
    |
    v
Parser  (goquery: 45+ SEO signals extracted)
    |
    v
Storage Buffer  (batch insert, configurable flush)
    |
    v
Columnar DB  (partitioned by crawl session, managed automatically)
    |
    |---> Web UI  (Svelte 5, embedded in binary)
    |---> REST API  (40+ endpoints)
    |---> CLI reports

Why a columnar database?

A crawl is a link graph, so why not a graph database? Because a crawler is an analytics pipeline, not a graph explorer. The questions you ask are analytical — "show me all pages with a missing H1 and a 301 canonical", "give me PageRank percentiles by subdirectory" — and columnar databases answer these instantly, even over millions of rows.

When we need graph algorithms (PageRank, crawl depth), we compute them in-memory in Go and write the results back. A million-page link graph fits in ~200MB of RAM and computes in seconds — no need for a graph database.

Under the hood, CrawlObserver uses ClickHouse in managed mode: it downloads a static binary and runs it as a subprocess. You see one program; it gets concurrent read/write access, columnar compression (~10:1), and instant session deletion.

Tech stack

Layer	Technology
Crawler engine	Go, `net/http`, goroutine pool, HTTP/2 (via `utls` ALPN negotiation)
TLS fingerprinting	`refraction-networking/utls` (Chrome/Firefox/Edge profiles)
HTML parsing	`goquery` (CSS selectors)
URL normalization	`purell` + custom rules
robots.txt	`temoto/robotstxt`
Storage	ClickHouse (via `clickhouse-go/v2`)
API keys / sessions	SQLite (`modernc.org/sqlite`)
Web UI	Svelte 5, Vite (zero runtime dependencies)
Desktop app	webview (macOS)
CLI	Cobra + Viper

API

The REST API is available when running crawlobserver serve. All endpoints are under /api/.

Sessions

Method	Endpoint	Description
`GET`	`/api/sessions`	List all sessions
`POST`	`/api/crawl`	Start a new crawl
`POST`	`/api/sessions/:id/stop`	Stop a running crawl
`POST`	`/api/sessions/:id/resume`	Resume a stopped crawl
`DELETE`	`/api/sessions/:id`	Delete a session and its data

Pages & Links

Method	Endpoint	Description
`GET`	`/api/sessions/:id/pages`	Crawled pages (paginated, filterable)
`GET`	`/api/sessions/:id/links`	External links
`GET`	`/api/sessions/:id/internal-links`	Internal links
`GET`	`/api/sessions/:id/page-detail?url=`	Full detail for one URL
`GET`	`/api/sessions/:id/page-html?url=`	Raw HTML body

Analytics

Method	Endpoint	Description
`GET`	`/api/sessions/:id/stats`	Session statistics
`GET`	`/api/sessions/:id/events`	Live progress (SSE)
`POST`	`/api/sessions/:id/compute-pagerank`	Compute internal PageRank
`POST`	`/api/sessions/:id/recompute-depths`	Recompute crawl depths
`GET`	`/api/sessions/:id/pagerank-top`	Top pages by PageRank
`GET`	`/api/sessions/:id/pagerank-distribution`	PageRank histogram

robots.txt & Sitemaps

Method	Endpoint	Description
`GET`	`/api/sessions/:id/robots-hosts`	Hosts with robots.txt
`GET`	`/api/sessions/:id/robots-content`	robots.txt content
`POST`	`/api/sessions/:id/robots-test`	Test URLs against robots.txt
`GET`	`/api/sessions/:id/sitemaps`	Discovered sitemaps

Authentication: Basic Auth or API key (X-API-Key header).

Contributing

We welcome contributions. Please read CONTRIBUTING.md before submitting anything.

TL;DR:

Open an issue before starting significant work
One PR = one thing (don't mix features and refactors)
Write tests for new code
Run make test && make lint before pushing
Follow existing code style — don't reorganize what you didn't change

Acknowledgments

Thanks to the people who helped shape CrawlObserver with their feedback, testing, and ideas:

Fabien Raquidel — referenceur-web.pro · @fabienr34

License

AGPL-3.0 — see LICENSE.

Built by SEObserver.

Directories ¶

Path	Synopsis
cmd
crawlobserver command
internal
apikeys
applog
backup
cli
clickhouse
config
crawler
customtests
extraction
fetcher
frontier
gsc
normalizer
parser
providers
renderer
report
seobserver
server
storage
updater

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL