metawebsearch

package module
v0.0.0-...-2b51795 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 1, 2026 License: MIT Imports: 22 Imported by: 0

README

metawebsearch

Multi-engine web search library for Go. Queries DuckDuckGo, Brave, Mojeek, Yahoo, Yandex, Wikipedia, and Grokipedia behind a common interface, with concurrent multi-engine dispatch, URL deduplication, and browser-grade TLS fingerprinting.

Inspired by deedy5/ddgs.

Installation

go get github.com/jcalvert/metawebsearch

Requires Go 1.24+.

Quick Start

package main

import (
    "context"
    "fmt"
    "log"
    "time"

    mws "github.com/jcalvert/metawebsearch"
)

func main() {
    client, err := mws.NewClient(mws.ClientOpts{})
    if err != nil {
        log.Fatal(err)
    }

    ms := mws.MultiSearch{
        Client:  client,
        Engines: mws.AllEngines(),
    }

    ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
    defer cancel()

    sr, err := ms.Search(ctx, "golang concurrency", mws.SearchOpts{MaxResults: 10})
    if err != nil {
        log.Fatal(err)
    }

    for _, r := range sr.Results {
        fmt.Printf("[%s] %s\n  %s\n\n", r.Engine, r.Title, r.URL)
    }

    // Check for per-engine errors (partial failure is normal)
    for name, err := range sr.Errors {
        fmt.Printf("  %s: %v\n", name, err)
    }
}

Single Engine

Use Execute to query one engine directly:

results, err := mws.Execute(
    ctx,
    client,
    mws.DuckDuckGo,
    "search query",
    mws.SearchOpts{MaxResults: 10},
)

Engines

AllEngines() returns the default set:

Engine Name MinDelay Notes
DuckDuckGo "duckduckgo" 2s POST to html.duckduckgo.com
Brave "brave" 2s Cookie-based region/safesearch
Mojeek "mojeek" 2s Independent search index
Yahoo "yahoo" 2s Unwraps /RU= redirect URLs
Yandex "yandex" 2s Uses /search/site/ endpoint
Wikipedia "wikipedia" 1s OpenSearch JSON API
Grokipedia "grokipedia" 1s Typeahead JSON API

Google is available via EngineByName("google") but excluded from the default set. As of early 2026, Google requires JavaScript execution to serve search results, which breaks all HTTP-based scrapers (including the reference Python implementation). The Google engine code is maintained in the repo for when a workaround is found.

Picking Specific Engines
// Use only DuckDuckGo and Brave
ms := mws.MultiSearch{
    Client:  client,
    Engines: []mws.EngineConfig{mws.DuckDuckGo, mws.Brave},
}
Runtime Lookup
eng, ok := mws.EngineByName("brave")
if ok {
    results, err := mws.Execute(ctx, client, eng, "query", mws.SearchOpts{})
}

Rate Limiting and Retries

Each engine has built-in rate limiting and retry behavior. These are configured via fields on EngineConfig.

How Rate Limiting Works

The Execute pipeline enforces a per-engine minimum delay between requests. If you call Execute for the same engine twice in quick succession, the second call blocks until MinDelay has elapsed since the last request to that engine.

Rate limiting is global per engine name (not per client), so multiple goroutines sharing the same engine will respect the same rate limit.

How Retries Work

When an engine returns a retryable HTTP status (default: 202, 429, 503), Execute retries with exponential backoff:

  • First retry waits at least 5 seconds (or MinDelay, whichever is greater)
  • Each subsequent retry doubles the wait: 5s, 10s, 20s...
  • After MaxRetries attempts (default: 3), it returns an error
  • Retries respect the context — a cancelled context aborts immediately
Customizing an Engine

Every field on EngineConfig is public. Copy a built-in engine and adjust:

// Slow down Brave to avoid rate limiting
myBrave := mws.Brave
myBrave.MinDelay = 5 * time.Second
myBrave.MaxRetries = 5

// Make DuckDuckGo retry on 403
myDDG := mws.DuckDuckGo
myDDG.RetryableStatus = func(code int) bool {
    return code == 202 || code == 403 || code == 429 || code == 503
}

ms := mws.MultiSearch{
    Client:  client,
    Engines: []mws.EngineConfig{myBrave, myDDG, mws.Mojeek},
}
Engine Defaults
Field Default (if zero)
MinDelay 2s (used for rate limiting between calls)
MaxRetries 3
RetryableStatus 202, 429, 503

The minimum retry backoff is 5 seconds for engines with MinDelay >= 1s. This prevents hammering search engines during retry loops.

Search Options

opts := mws.SearchOpts{
    MaxResults: 10,          // max results to request (engine-dependent)
    Region:     "us-en",     // country-language code
    SafeSearch: "moderate",  // "on", "moderate", "off"
    TimeLimit:  "w",         // "d" (day), "w" (week), "m" (month), "y" (year)
    Page:       1,           // 1-based page number
}

Not all engines support all options. Unsupported options are silently ignored.

Option DuckDuckGo Brave Mojeek Yahoo Yandex Wikipedia Grokipedia
MaxResults - - - - - yes yes
Region - yes yes - - yes -
SafeSearch - yes yes - - - -
TimeLimit - yes - yes - - -
Page - - - yes yes - -

MultiSearch Behavior

MultiSearch.Search dispatches all engines concurrently and returns a *SearchResult:

type SearchResult struct {
    Results []Result            // deduplicated by URL, ordered by engine
    Errors  map[string]error    // per-engine errors (partial failure)
}
  • Partial failure: If some engines fail and others succeed, you get results from the successful ones plus error details for the failed ones. Search itself never returns an error.
  • Deduplication: Results are deduplicated by URL. When the same URL appears from multiple engines, the first engine (by order in the Engines slice) wins.
  • Ordering: Results are grouped by engine in the order engines appear in the Engines slice.

TLS Client

All HTTP requests go through bogdanfinn/tls-client, which impersonates real browser TLS fingerprints (JA3, HTTP/2 SETTINGS, header order). This is critical for avoiding bot detection.

// Default Chrome profile
client, err := mws.NewClient(mws.ClientOpts{})

// Specific browser profile
client, err := mws.NewClient(mws.ClientOpts{
    BrowserProfile: "chrome_131",
})

See tls-client profiles for available profile names.

Browser-like headers (Accept, Accept-Language, Sec-Ch-Ua, Sec-Fetch-*, etc.) are automatically added to every request by the Execute pipeline. Engine-specific headers set in BuildRequest take precedence.

Result Type

type Result struct {
    Title   string `json:"title"`
    URL     string `json:"url"`
    Snippet string `json:"snippet"`
    Engine  string `json:"engine"`
}

Results include JSON tags for easy serialization.

Test Binary

A simple CLI is included for manual testing:

go run ./cmd/search/ "your query here"

Outputs JSON with results grouped by engine and any errors.

Testing

Unit tests (no network, fast):

go test ./...

With race detector:

go test -race ./...

Integration tests (live HTTP, hits real search engines):

go test -tags=integration ./... -v

License

MIT. See LICENSE.

Documentation

Overview

brave.go

Package metawebsearch provides a multi-engine web search library for Go.

It scrapes Google, DuckDuckGo, Brave, Mojeek, Yahoo, Yandex, Wikipedia, and Grokipedia behind a common EngineConfig interface. Engines can be used individually via Execute or concurrently via MultiSearch.

Browser impersonation (TLS + HTTP/2 fingerprinting) is handled by tls-client, wrapped behind the HTTPClient interface for testability.

duckduckgo.go

engine.go

extract.go

google.go

grokipedia.go

mojeek.go

multi.go

registry.go

result.go

wikipedia.go

yahoo.go

yandex.go

Index

Constants

This section is empty.

Variables

View Source
var Brave = EngineConfig{
	Name:            "brave",
	MinDelay:        2 * time.Second,
	MaxRetries:      3,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    braveBuildRequest,
	ParseResponse:   braveParseResponse,
}

Brave is the EngineConfig for Brave web search. Ported from reference/ddgs/engines/brave.py.

View Source
var DuckDuckGo = EngineConfig{
	Name:            "duckduckgo",
	MinDelay:        2 * time.Second,
	MaxRetries:      3,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    ddgBuildRequest,
	ParseResponse:   ddgParseResponse,
	PostProcess:     ddgPostProcess,
}

DuckDuckGo is the EngineConfig for DuckDuckGo web search. Ported from reference/ddgs/engines/duckduckgo.py.

View Source
var Google = EngineConfig{
	Name:            "google",
	ClientProfile:   "safari_ios_26_0",
	MinDelay:        3 * time.Second,
	MaxRetries:      3,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    googleBuildRequest,
	ParseResponse:   googleParseResponse,
	PostProcess:     googlePostProcess,
}

Google is the EngineConfig for Google web search.

View Source
var Grokipedia = EngineConfig{
	Name:            "grokipedia",
	MinDelay:        1 * time.Second,
	MaxRetries:      2,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    grokipediaBuildRequest,
	ParseResponse:   grokipediaParseResponse,
}

Grokipedia is the EngineConfig for the Grokipedia typeahead API. Ported from reference/ddgs/engines/grokipedia.py.

JSON API: GET https://grokipedia.com/api/typeahead?query=<query>&limit=<limit> Returns: {"results": [{"title": "...", "snippet": "...", "slug": "..."}]}

View Source
var Mojeek = EngineConfig{
	Name:            "mojeek",
	MinDelay:        2 * time.Second,
	MaxRetries:      3,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    mojeekBuildRequest,
	ParseResponse:   mojeekParseResponse,
}

Mojeek is the EngineConfig for Mojeek web search. Ported from reference/ddgs/engines/mojeek.py.

View Source
var Wikipedia = EngineConfig{
	Name:            "wikipedia",
	MinDelay:        1 * time.Second,
	MaxRetries:      2,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    wikipediaBuildRequest,
	ParseResponse:   wikipediaParseResponse,
}

Wikipedia is the EngineConfig for Wikipedia OpenSearch API. Ported from reference/ddgs/engines/wikipedia.py.

Unlike other engines, Wikipedia returns JSON (OpenSearch format), not HTML. The response is a JSON array: ["query", ["titles..."], ["descriptions..."], ["urls..."]]

View Source
var Yahoo = EngineConfig{
	Name:            "yahoo",
	MinDelay:        2 * time.Second,
	MaxRetries:      3,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    yahooBuildRequest,
	ParseResponse:   yahooParseResponse,
	PostProcess:     yahooPostProcess,
}

Yahoo is the EngineConfig for Yahoo web search. Ported from reference/ddgs/engines/yahoo.py.

View Source
var Yandex = EngineConfig{
	Name:            "yandex",
	MinDelay:        2 * time.Second,
	MaxRetries:      3,
	RetryableStatus: defaultRetryableStatus,
	BuildRequest:    yandexBuildRequest,
	ParseResponse:   yandexParseResponse,
}

Yandex is the EngineConfig for Yandex web search. Ported from reference/ddgs/engines/yandex.py.

Functions

func CleanText

func CleanText(s string) string

CleanText trims whitespace and collapses internal whitespace/newlines.

func UnwrapRedirect

func UnwrapRedirect(href string, pattern RedirectPattern) string

UnwrapRedirect extracts the real URL from a search engine redirect wrapper.

func XPathExtract

func XPathExtract(doc *html.Node, itemsXPath string, fields map[string]string) ([]map[string]string, error)

XPathExtract finds containers via itemsXPath, then extracts fields from each container using the fields map (field name -> XPath expression). Mirrors ddgs's BaseSearchEngine.extract_results().

Types

type ClientOpts

type ClientOpts struct {
	BrowserProfile string // key into profiles.MappedTLSClients; empty = default
}

ClientOpts configures the TLS-impersonating HTTP client.

type EngineConfig

type EngineConfig struct {
	Name          string
	BuildRequest  func(query string, opts SearchOpts) (*http.Request, error)
	ParseResponse func(resp *http.Response) ([]Result, error)
	PostProcess   func(results []Result) []Result

	// ClientProfile overrides the TLS client profile for this engine.
	// If set, Execute creates a dedicated client with this profile.
	// This is needed when the engine's User-Agent requires a matching
	// TLS fingerprint (e.g. Google's GSA UA needs Safari iOS profile).
	ClientProfile string

	MinDelay        time.Duration
	MaxRetries      int
	RetryableStatus func(statusCode int) bool
}

EngineConfig defines a search engine's scraping pipeline.

func AllEngines

func AllEngines() []EngineConfig

AllEngines returns every built-in engine.

func EngineByName

func EngineByName(name string) (EngineConfig, bool)

EngineByName looks up any engine by name, including engines not in AllEngines() (e.g. Google). Returns false if not found.

type HTTPClient

type HTTPClient interface {
	Do(req *http.Request) (*http.Response, error)
}

HTTPClient is the interface the pipeline calls. Tests substitute a fake.

func NewClient

func NewClient(opts ClientOpts) (HTTPClient, error)

NewClient creates an HTTPClient backed by bogdanfinn/tls-client. This is the only place in the codebase that imports tls-client directly; everything else uses the HTTPClient interface from result.go.

type MultiSearch

type MultiSearch struct {
	Client  HTTPClient
	Engines []EngineConfig

	// EngineTimeout is the maximum time to wait for any single engine.
	// If an engine exceeds this deadline (e.g. due to rate-limit retries),
	// its context is canceled and results from faster engines are returned.
	// Zero means 10 seconds.
	EngineTimeout time.Duration
}

MultiSearch dispatches a query to multiple engines concurrently.

func (*MultiSearch) Search

func (m *MultiSearch) Search(ctx context.Context, query string, opts SearchOpts) (*SearchResult, error)

Search runs all engines concurrently, deduplicates by URL, collects per-engine errors.

type RedirectPattern

type RedirectPattern int

RedirectPattern identifies a URL redirect scheme.

const (
	RedirectNone   RedirectPattern = iota
	RedirectDDG                    // //duckduckgo.com/l/?uddg=...
	RedirectYahoo                  // .../RU=.../RK=...
	RedirectGoogle                 // /url?q=...
)

type Result

type Result struct {
	Title   string `json:"title"`
	URL     string `json:"url"`
	Snippet string `json:"snippet"`
	Engine  string `json:"engine"`
}

Result is a single search result from any engine.

func Execute

func Execute(ctx context.Context, client HTTPClient, engine EngineConfig, query string, opts SearchOpts) ([]Result, error)

Execute runs the full engine pipeline: BuildRequest -> HTTP -> ParseResponse -> PostProcess. Handles rate limiting and retries with exponential backoff.

type SearchOpts

type SearchOpts struct {
	MaxResults int
	Page       int    // 1-based page number (default: 1)
	Region     string // e.g. "us-en"
	SafeSearch string // "on", "moderate", "off"
	TimeLimit  string // "d" (day), "w" (week), "m" (month), "y" (year)
}

SearchOpts controls a search request.

type SearchResult

type SearchResult struct {
	Results []Result
	Errors  map[string]error
}

SearchResult is what MultiSearch returns.

Directories

Path Synopsis
cmd
search command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL