web-researcher-mcp

module

v1.0.0 Latest Latest Go to latest Published: May 18, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zoharbabin/web-researcher-mcp

Links

Open Source Insights

README ¶

web-researcher-mcp

A production-grade MCP server that gives AI assistants the power to search the web, extract content, and conduct multi-source research.

Why Web Researcher MCP?

AI assistants are only as good as the information they can access. web-researcher-mcp bridges the gap between LLMs and the live internet through the Model Context Protocol standard:

8 specialized research tools in a single server
4 pluggable search backends (Google, Brave, Serper, SearXNG)
Tiered content extraction -- markdown negotiation, HTML parsing, headless browser, document parsing
Search lenses for domain-focused research (programming, news, legal, medical, and more)
Single static binary (~20MB) with zero runtime dependencies
Enterprise-ready with OAuth 2.1, multi-tenancy, rate limiting, and audit logging

Works with Claude Code, Claude Desktop, Cursor, and any MCP-compatible client.

Tools

Tool	Description
`web_search`	General web search with optional search lenses for domain-focused results
`scrape_page`	Extract content from any URL -- web pages, PDFs, DOCX, PPTX, YouTube transcripts
`search_and_scrape`	Combined search + extraction pipeline with quality scoring and deduplication
`image_search`	Search for images with size, type, color, and file format filters
`news_search`	Search news sources with freshness controls and source filtering
`academic_search`	Search academic papers via Scholar, arXiv, and PubMed
`patent_search`	Search patent databases with CPC classification for prior art and IP research
`sequential_search`	Multi-step research tracking with session state for iterative investigation

Quick Start

One-Line Install (Claude Code)

claude mcp add web-researcher -- go run github.com/zoharbabin/web-researcher-mcp/cmd/web-researcher-mcp@latest

Then set your API keys in ~/.claude/settings.json under the server's env block (see Connect to Your AI Assistant below).

Option 1: Download Binary

Download the latest release for your platform from Releases:

# macOS (Apple Silicon)
curl -L https://github.com/zoharbabin/web-researcher-mcp/releases/latest/download/web-researcher-mcp-darwin-arm64 -o web-researcher-mcp
chmod +x web-researcher-mcp

# macOS (Intel)
curl -L https://github.com/zoharbabin/web-researcher-mcp/releases/latest/download/web-researcher-mcp-darwin-amd64 -o web-researcher-mcp
chmod +x web-researcher-mcp

# Linux (x86_64)
curl -L https://github.com/zoharbabin/web-researcher-mcp/releases/latest/download/web-researcher-mcp-linux-amd64 -o web-researcher-mcp
chmod +x web-researcher-mcp

Option 2: Docker

docker run -e GOOGLE_CUSTOM_SEARCH_API_KEY=YOUR_KEY \
           -e GOOGLE_CUSTOM_SEARCH_ID=YOUR_CX \
           zoharbabin/web-researcher-mcp

Option 3: Build from Source

git clone https://github.com/zoharbabin/web-researcher-mcp.git
cd web-researcher-mcp
go build -o web-researcher-mcp ./cmd/web-researcher-mcp

Or install directly:

go install github.com/zoharbabin/web-researcher-mcp/cmd/web-researcher-mcp@latest

Connect to Your AI Assistant

Add this to your MCP client configuration (example for Claude Code ~/.claude/settings.json):

{
  "mcpServers": {
    "web-researcher": {
      "command": "/path/to/web-researcher-mcp",
      "env": {
        "GOOGLE_CUSTOM_SEARCH_API_KEY": "YOUR_GOOGLE_API_KEY",
        "GOOGLE_CUSTOM_SEARCH_ID": "YOUR_SEARCH_ENGINE_ID"
      }
    }
  }
}

Done. Your AI assistant now has access to all 8 research tools.

Configuration

Required

Variable	Description	How to Get
`GOOGLE_CUSTOM_SEARCH_API_KEY`	Google API key	Google Cloud Console
`GOOGLE_CUSTOM_SEARCH_ID`	Programmable Search Engine ID	PSE Console

Search Provider

Variable	Description	Default
`SEARCH_PROVIDER`	Backend: `google`, `brave`, `serper`, or `searxng`	`google`
`BRAVE_API_KEY`	Brave Search API key
`SERPER_API_KEY`	Serper.dev API key
`SEARXNG_URL`	SearXNG instance URL

HTTP Transport (Optional)

Variable	Description	Default
`PORT`	Enable HTTP/SSE mode	STDIO only
`OAUTH_ISSUER_URL`	JWT issuer URL for token validation
`OAUTH_AUDIENCE`	Expected JWT audience claim

All Environment Variables

Variable	Description	Default
`GOOGLE_CUSTOM_SEARCH_API_KEY`	Google API key	(required)
`GOOGLE_CUSTOM_SEARCH_ID`	Search engine ID	(required)
`SEARCH_PROVIDER`	Search backend	`google`
`BRAVE_API_KEY`	Brave Search API key
`SERPER_API_KEY`	Serper.dev API key
`SEARXNG_URL`	SearXNG instance URL
`PORT`	HTTP port (enables HTTP/SSE mode)	STDIO
`OAUTH_ISSUER_URL`	JWT issuer for auth
`OAUTH_AUDIENCE`	Expected JWT audience
`REDIS_URL`	Redis URL for shared cache/sessions	in-memory
`CACHE_TTL`	Cache time-to-live	`1h`
`CACHE_ENCRYPTION_KEY`	Encryption key for cached content (64 hex chars)
`RATE_LIMIT_RPM`	Requests per minute per client	`60`
`LOG_LEVEL`	Logging level: debug, info, warn, error	`info`
`LOG_FORMAT`	Log format: json, text	`json`
`METRICS_PORT`	Prometheus metrics endpoint port	disabled
`MAX_CONCURRENT_SCRAPES`	Concurrent scrape limit	`5`
`SCRAPE_TIMEOUT`	Per-scrape timeout	`30s`

Architecture

web-researcher-mcp/
├── cmd/web-researcher-mcp/     # Entry point (wiring only, ~50 lines)
├── internal/
│   ├── config/                 # Env-based strongly-typed configuration
│   ├── server/                 # MCP server lifecycle + signal handling
│   ├── tools/                  # Tool handlers (one file per tool)
│   ├── search/                 # Pluggable search providers + lens routing
│   ├── scraper/                # Tiered scraping pipeline
│   ├── documents/              # PDF, DOCX, PPTX parsing
│   ├── cache/                  # Hybrid cache (ristretto + disk + optional Redis)
│   ├── auth/                   # OAuth 2.1 middleware + JWKS
│   ├── session/                # Per-tenant session management
│   ├── content/                # Sanitize, dedup, truncate, quality score
│   ├── metrics/                # Prometheus metrics + per-tool stats
│   ├── ratelimit/              # Three-tier rate limiting
│   ├── circuit/                # Circuit breaker for external APIs
│   └── resources/              # MCP Resources + Prompts
├── lenses/                     # Search lens JSON files
└── docs/                       # Extended documentation

High-Level Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         MCP Protocol Layer                        │
│  ┌──────────────────┐              ┌─────────────────────────┐  │
│  │  STDIO Transport │              │  HTTP/SSE Transport     │  │
│  │  (zero-config)   │              │  (OAuth 2.1 + CORS)     │  │
│  └────────┬─────────┘              └──────────┬──────────────┘  │
│           └────────────────┬───────────────────┘                 │
│                    ┌───────▼───────┐                             │
│                    │  MCP Server   │                             │
│                    │  (go-sdk)     │                             │
│                    └───────┬───────┘                             │
└────────────────────────────┼─────────────────────────────────────┘
                             │
┌────────────────────────────┼─────────────────────────────────────┐
│                    Tool Dispatch Layer                             │
│  ┌─────────┐ ┌────────┐ ┌┴───────┐ ┌────────┐ ┌─────────────┐  │
│  │ Search  │ │ Scrape │ │Combined│ │Academic│ │ Sequential  │  │
│  │ Tools   │ │ Tool   │ │  Tool  │ │& Patent│ │  Research   │  │
│  └────┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └──────┬──────┘  │
└───────┼──────────┼───────────┼──────────┼─────────────┼──────────┘
        │          │           │          │             │
┌───────┼──────────┼───────────┼──────────┼─────────────┼──────────┐
│       │     Service Layer    │          │             │           │
│  ┌────▼────┐ ┌───▼────┐ ┌───▼───┐ ┌───▼────┐ ┌─────▼─────┐   │
│  │ Search  │ │Scraper │ │Quality│ │Citation│ │  Session   │   │
│  │Provider │ │Pipeline│ │Scorer │ │Extract │ │  Manager   │   │
│  └────┬────┘ └───┬────┘ └───────┘ └────────┘ └────────────┘   │
│       │          │                                               │
│  ┌────▼─────┐ ┌─▼──────────────────────────────────┐           │
│  │ Brave    │ │  Scraper Tiers                      │           │
│  │ Google   │ │  markdown > HTML > browser > docs   │           │
│  │ Serper   │ │  + YouTube transcripts              │           │
│  │ SearXNG  │ └─────────────────────────────────────┘           │
│  └──────────┘                                                    │
└──────────────────────────────────────────────────────────────────┘
        │          │
┌───────┼──────────┼──────────────────────────────────────────────┐
│       │   Infrastructure Layer                                    │
│  ┌────▼────┐ ┌───▼────┐ ┌─────────┐ ┌────────┐ ┌───────────┐  │
│  │  Cache  │ │  SSRF  │ │  Rate   │ │Metrics │ │   Audit   │  │
│  │(hybrid) │ │Protect │ │ Limiter │ │(Prom.) │ │   Logger  │  │
│  └─────────┘ └────────┘ └─────────┘ └────────┘ └───────────┘  │
│  ┌──────────────────┐  ┌──────────────────────────────────────┐ │
│  │  Circuit Breaker  │  │  Content Pipeline (sanitize, dedup,  │ │
│  │                   │  │  truncate, quality score)             │ │
│  └───────────────────┘  └──────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Design Principles

Zero global state -- all dependencies injected via constructors
Interface-driven -- every external dependency behind an interface for testing and swapping
Bounded concurrency -- explicit semaphores for external API calls
Defense in depth -- SSRF protection, rate limiting, content sanitization at every layer
Fail loud -- errors returned, never swallowed; validation at boundaries

Search Providers

The server supports four search backends. Google PSE is always used for lens-restricted and site-restricted queries (free, works indefinitely). The configured provider handles unrestricted whole-web searches.

Provider	Cost per 1K Queries	Whole-Web	Images	News	Notes
Google PSE	Free (100/day) to $5	Until 2027	Yes	Yes	Default; always used for lenses
Brave Search	$5 (free tier available)	Yes	Yes	Yes	Recommended for whole-web
Serper.dev	$0.30-$1	Yes	Yes	Yes	Google-identical results
SearXNG	Free (self-hosted)	Yes	Yes	Yes	Privacy-first, air-gapped deployments

Routing Logic

Request arrives
  |-- lens specified?     --> Google PSE (site-restricted, free forever)
  |-- site: param set?    --> Google PSE (site-restricted)
  `-- unrestricted?       --> Configured SEARCH_PROVIDER

Provider Setup Examples

Brave Search (recommended for whole-web):

export SEARCH_PROVIDER=brave
export BRAVE_API_KEY=BSAxxxxxxxxxx
export GOOGLE_CUSTOM_SEARCH_API_KEY=AIza...  # still needed for lenses
export GOOGLE_CUSTOM_SEARCH_ID=017...

SearXNG (self-hosted, privacy-first):

export SEARCH_PROVIDER=searxng
export SEARXNG_URL=http://localhost:8080
export GOOGLE_CUSTOM_SEARCH_API_KEY=AIza...
export GOOGLE_CUSTOM_SEARCH_ID=017...

Google PSE only (simplest setup):

export GOOGLE_CUSTOM_SEARCH_API_KEY=AIza...
export GOOGLE_CUSTOM_SEARCH_ID=017...
# SEARCH_PROVIDER defaults to "google"

Search Lenses

Search lenses are curated domain lists that focus search results on high-quality sources for specific topics. They route through Google PSE in site-restricted mode -- free and works indefinitely.

Built-in Lenses

Lens	Focus	Example Domains
`programming`	Code docs, tutorials, Q&A	stackoverflow.com, github.com, developer.mozilla.org
`news`	Current events, journalism	reuters.com, apnews.com, bbc.com, nytimes.com
`tech`	Technology industry	arstechnica.com, techcrunch.com, theverge.com
`legal`	Law, cases, statutes	law.cornell.edu, courtlistener.com, justia.com
`medical`	Health, medicine	nih.gov, mayoclinic.org, who.int, pubmed.ncbi.nlm.nih.gov
`finance`	Markets, filings	sec.gov, bloomberg.com, investopedia.com
`science`	Research, papers	nature.com, science.org, nasa.gov
`government`	Policy, regulations	*.gov, europa.eu, gov.uk, un.org

Usage Example

{
  "tool": "web_search",
  "arguments": {
    "query": "golang context best practices",
    "lens": "programming"
  }
}

This searches only stackoverflow.com, github.com, go.dev, developer.mozilla.org, and other curated programming sites.

Creating Custom Lenses

Add a JSON file to the lenses/ directory:

{
  "name": "my-custom-lens",
  "description": "Description of what this lens covers",
  "domains": [
    "example.com",
    "docs.example.org",
    "*.trusted-source.io"
  ],
  "cx": ""
}

Fields:

domains -- Up to 5,000 URL patterns per lens (Google PSE limit)
cx -- Optional dedicated PSE engine ID. If empty, site: operators are injected at query time (limited to ~10 domains per query)

Security

SSRF Protection

The server implements a custom DialContext that validates all resolved IPs before connecting:

Blocks all private/reserved IP ranges (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, fc00::/7)
Blocks cloud metadata endpoints (169.254.169.254)
Validates against DNS rebinding by connecting only to the first resolved IP
Re-validates redirect targets at each hop

Authentication and Authorization

In HTTP mode, the server supports OAuth 2.1 with:

JWKS-based token validation with automatic key rotation
Per-tenant session isolation
Audience and issuer validation
Configurable claim extraction for multi-tenancy

Rate Limiting

Three-tier rate limiting protects both the server and upstream APIs:

Per-client -- token bucket per authenticated session
Per-provider -- prevents exceeding upstream API quotas
Global -- server-wide backpressure valve

Content Safety

HTML sanitization via whitelist-based policy (bluemonday)
Paragraph-level deduplication across scraped results
Smart truncation at natural content breakpoints
Quality scoring to filter low-value results before returning to the LLM

For the full threat model and security architecture, see docs/SECURITY.md.

MCP Client Integration

Claude Code

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "web-researcher": {
      "command": "/path/to/web-researcher-mcp",
      "env": {
        "GOOGLE_CUSTOM_SEARCH_API_KEY": "AIza...",
        "GOOGLE_CUSTOM_SEARCH_ID": "017...",
        "SEARCH_PROVIDER": "brave",
        "BRAVE_API_KEY": "BSA..."
      }
    }
  }
}

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "web-researcher": {
      "command": "/path/to/web-researcher-mcp",
      "env": {
        "GOOGLE_CUSTOM_SEARCH_API_KEY": "AIza...",
        "GOOGLE_CUSTOM_SEARCH_ID": "017..."
      }
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root:

{
  "mcpServers": {
    "web-researcher": {
      "command": "/path/to/web-researcher-mcp",
      "env": {
        "GOOGLE_CUSTOM_SEARCH_API_KEY": "AIza...",
        "GOOGLE_CUSTOM_SEARCH_ID": "017..."
      }
    }
  }
}

HTTP/SSE Mode (Multi-Client, Teams)

For shared deployments serving multiple clients or web applications:

PORT=3000 \
OAUTH_ISSUER_URL=https://auth.example.com \
OAUTH_AUDIENCE=https://api.example.com \
./web-researcher-mcp

Connect any MCP client to http://localhost:3000/sse.

Docker Compose Example

version: "3.8"
services:
  web-researcher:
    image: zoharbabin/web-researcher-mcp
    ports:
      - "3000:3000"
    environment:
      PORT: "3000"
      GOOGLE_CUSTOM_SEARCH_API_KEY: ${GOOGLE_CUSTOM_SEARCH_API_KEY}
      GOOGLE_CUSTOM_SEARCH_ID: ${GOOGLE_CUSTOM_SEARCH_ID}
      SEARCH_PROVIDER: brave
      BRAVE_API_KEY: ${BRAVE_API_KEY}
      REDIS_URL: redis://redis:6379
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Performance

Operation	Expected Latency	Notes
Search (cache hit)	< 1ms	Direct return from in-memory cache
Search (API call)	200-500ms	Circuit-breaker protected
Scrape (markdown)	100-300ms	Fastest tier via content negotiation
Scrape (HTML)	500-2000ms	goquery-based extraction
Scrape (browser)	2-10s	Headless Chrome, bounded to 3 concurrent
search_and_scrape	2-15s	Parallel scrape with semaphore (max 5)

Development

# Build
go build -o web-researcher-mcp ./cmd/web-researcher-mcp

# Run all tests
go test ./...

# Tests with race detector
go test -race ./...

# E2E tests
go test -v ./tests/e2e/...

# Benchmarks
go test -bench=. ./tests/benchmark/

# Lint
golangci-lint run

# Security audit
govulncheck ./...

# Production build (static, stripped)
CGO_ENABLED=0 go build -ldflags="-s -w" -o web-researcher-mcp ./cmd/web-researcher-mcp

Contributing

Contributions are welcome. Please see docs/CONTRIBUTING.md for code style guidelines, development workflow, and how to submit pull requests.

Documentation

Document	Description
ARCHITECTURE.md	Full architecture, design decisions, technology stack
docs/TOOLS.md	Detailed tool specifications and parameter schemas
docs/SECURITY.md	Threat model, SSRF, authentication, content safety
docs/SEARCH_PROVIDERS.md	Provider system, lenses, routing, migration plan
docs/DEPLOYMENT.md	Build, Docker, Kubernetes, scaling
docs/TESTING.md	Test strategy and patterns
docs/COMPLIANCE.md	SOC2, GDPR, FedRAMP compliance
docs/GO_MODULE.md	Every dependency with rationale

License

MIT

Built with Go and the Model Context Protocol

If this project helps your workflow, consider giving it a star.

Directories ¶

Path	Synopsis
cmd
web-researcher-mcp command
internal
audit
auth
cache
circuit
config
content
documents
metrics
ratelimit
resources
scraper
search
server
session
tools

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL