knownbots

package module
v1.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 12, 2026 License: MIT Imports: 21 Imported by: 0

README

knownbots

Go Reference Go Report Card

KnownBots is a high-performance Go library for verifying search engine crawlers and identifying legitimate bots. It protects your web services from bot impersonation by validating User-Agent strings and IP addresses through RDNS lookups and IP range verification.

Why KnownBots?

The Problem: Malicious actors can easily spoof User-Agent strings to impersonate legitimate search engine bots (Googlebot, Bingbot, etc.) to bypass rate limits, scrape content, or exploit bot-specific logic.

The Solution: KnownBots performs cryptographic-strength verification by:

  1. Matching User-Agent markers (case-sensitive word boundaries)
  2. Verifying IP ownership through reverse DNS lookups or official IP ranges
  3. Caching results to avoid expensive DNS queries on subsequent requests

Key Features

🚀 High Performance
  • Lock-free reads via atomic.Pointer[T] for bot configuration and RDNS cache
  • Zero-allocation hot paths using netip.Prefix for IP matching
  • Byte-level indexing for O(1) bot lookup (150-300ns for 40 bots vs 640ns linear scan)
  • Copy-on-Write caching optimized for read-heavy workloads (1-20 writes/day)
  • Embedded bots - 57 built-in configs compiled into binary (no file I/O at startup)
  • Optional UA classification - Disabled by default for maximum performance
  • Logging control - Disable log output via knownbots.EnableLog = false
🔒 Security First
  • Case-sensitive matching prevents forgery attempts (official bots use fixed casing)
  • Word boundary validation prevents partial matches (e.g., "MyGooglebot" won't match)
  • LRU fail cache for fast rejection of known-bad IPs (1000 entry limit)
  • Browser detection distinguishes legitimate users from suspicious bot-like patterns (opt-in)
📦 Production Ready
  • Persistent RDNS cache survives restarts (file-based storage)
  • Background scheduler automatically refreshes IP ranges from official URLs
  • Graceful degradation (cache persistence failures don't affect runtime)
  • Comprehensive tests with benchmarks for 3-40 bot scenarios
🌍 Extensible
  • YAML-based configuration for easy bot additions (no code changes)
  • Pluggable verification supports both IP ranges and RDNS verification
  • Official source integration automatically downloads and updates IP lists

Installation

go get github.com/cnlangzi/knownbots

Requirements: Go 1.21+

Quick Start

Basic Usage
package main

import (
    "fmt"
    "log"

    "github.com/cnlangzi/knownbots"
)

func main() {
    // Initialize validator (starts background scheduler)
    v, err := knownbots.New()
    if err != nil {
        log.Fatal(err)
    }
    defer v.Close()

    // Verify a bot claim
    result := v.Validate(
        "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "66.249.66.1",
    )

    fmt.Printf("Status: %s\n", result.Status)      // "verified"
    fmt.Printf("IsBot: %t\n", result.IsBot)        // true
    fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
    fmt.Printf("Bot Name: %s\n", result.Name)      // "googlebot"
}
HTTP Middleware Example
func BotVerificationMiddleware(v *knownbots.Validator) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            ua := r.Header.Get("User-Agent")
            ip := r.RemoteAddr // In production, extract from X-Forwarded-For

            result := v.Validate(ua, ip)

            // Block fake bots (claims to be bot but IP not verified)
            if result.IsBot && !result.IsVerified {
                http.Error(w, "Forbidden: Bot verification failed", http.StatusForbidden)
                return
            }

            // Add verification metadata to request context
            ctx := context.WithValue(r.Context(), "botVerified", result)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}
Configuration Options
v, err := knownbots.New(
    knownbots.WithRoot("./custom-bots"),    // Custom bot config directory
    knownbots.WithFailLimit(5000),          // Failed lookup cache size
    knownbots.WithClassifyUA(),             // Enable UA classification (disabled by default)
)

// Disable logging to reduce console pollution (e.g., in benchmarks)
knownbots.EnableLog = false

Configuration

Directory Structure
bots/
├── conf.d/              # Bot configurations (YAML)
│   ├── googlebot.yaml
│   ├── bingbot.yaml
│   └── ...
├── googlebot/           # Bot-specific data (auto-created)
│   ├── rdns.txt        # Persistent RDNS cache
│   └── ips.txt         # Downloaded IP ranges
└── ...
Bot Configuration (YAML)
name: googlebot
ua: "Googlebot"                           # EXACT casing required (case-sensitive)
urls:                                      # Official IP list URLs (auto-downloaded)
  - "https://www.gstatic.com/ipranges/google.json"
custom:                                    # Static CIDR ranges (always checked)
  - "66.249.64.0/19"
asn:                                       # ASN numbers for verification (optional)
  - 15169
domains:                                   # Verified RDNS domains
  - "googlebot.com"
  - "google.com"
rdns: true                             # Enable RDNS verification (false = IP-only)

Important:

  • User-Agent markers (ua) are case-sensitive. Official bots use fixed casing (e.g., "Googlebot", never "googlebot"). This prevents forgery attempts where attackers alter casing to bypass detection.
  • Set rdns: false for bots that only need IP range verification (faster, no DNS queries)
  • ASN verification is optional and provides faster IP ownership verification (~35ns) compared to RDNS (~450ns) for bots with official ASN registrations
Parser Selection

Choose the correct parser based on the IP list format:

Format JSON Example Parser
Google-style {"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}]} google
OpenAI-style {"prefixes": [{"prefix": "1.2.3.4/24"}]} openai
Plain text 1.2.3.4/24 or 172.16.0.5 txt
GitHub-style {"hooks": ["1.2.3.4/24"], "web": [...]} github
Stripe-style {"WEBHOOKS": ["3.18.12.63"]} stripe
User-Agent Matching Rules
  1. Case-sensitive: Use exact casing from official documentation

    • ✅ Correct: ua: "Googlebot" or ua: "bingbot"
    • ❌ Wrong: ua: "googlebot" or ua: "BINGBOT"
  2. Match type: Word boundary matching (not substring)

    • ua: "Googlebot" matches: Googlebot/2.1, Mozilla/5.0 (compatible; Googlebot/2.1; ...)
    • ua: "Googlebot" does NOT match: MyGooglebot, GooglebotPro
  3. Special bots: Some bots don't use Mozilla prefix

    • ua: "GPTBot" (OpenAI)
    • ua: "curl" (CLI tool)

How It Works

Verification Flow
┌─────────────────────────────────────────────────────────────┐
│                     Incoming Request                         │
│                  (User-Agent + IP Address)                   │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
         ┌────────────────────┐
         │  UA Matches Bot?   │──No──▶ Classify UA Type
         └────────┬───────────┘        (Browser/Suspicious/Unknown)
                  │ Yes                         │
                  ▼                             ▼
         ┌────────────────────┐        Return: IsBot=false
         │  Check IP Ranges   │        (legitimate browser)
         │  (CIDR matching)   │
         └────────┬───────────┘
                  │
                  ├─ Hit ──▶ Return: verified
                  │
                  ├─ Miss + asn empty ──▶ Check RDNS
                  │
                  ├─ Miss + asn defined ──▶ Check ASN
                  │                              │
                  │                              ├─ Hit ──▶ Return: verified
                  │                              │
                  │                              └─ Miss ──▶ Check RDNS
                  │
                  ▼
         ┌────────────────────┐
         │   Bot.RDNS=true?   │──No──▶ Return: failed
         └────────┬───────────┘        (IP-only bot, no DNS check)
                  │ Yes
                  ▼
         ┌────────────────────┐
         │  Check Fail Cache  │──Hit──▶ Return: failed
         │  (LRU, 1000 IPs)   │        (known fake bot)
         └────────┬───────────┘
                  │ Miss
                  ▼
         ┌────────────────────┐
         │ Check RDNS Cache   │──Hit──▶ Domain match?
         │  (persistent)      │         Yes: verified
         └────────┬───────────┘         No: failed
                  │ Miss
                  ▼
         ┌────────────────────┐
         │ Perform RDNS Lookup│──▶ Domain match?
         │  (50-200ms delay)  │     Yes: verified + cache
         └────────────────────┘     No: failed + fail cache
Background Scheduler (Every 24h)
┌─────────────────────────────────────────────────────────────┐
│                    Background Scheduler                      │
└─────────────────┬───────────────────────────────────────────┘
                  │
        ┌─────────┴─────────┬──────────┐
        │                   │          │
        ▼                   ▼          ▼
  ┌──────────┐      ┌──────────────┐ ┌──────────┐
  │ Refresh  │      │ Update ASN   │ │ Prune &  │
  │ IP Lists │      │ Data         │ │ Save     │
  │ (HTTP)   │      │ (RIPE API)   │ │ RDNS     │
  └──────────┘      └──────────────┘ │ Cache    │
        │                   │        │ (rdns=true) │
        ▼                   ▼        └──────────┘
  Update memory      Update cache          │
  Persist to file    Persist to file       ▼
                     (per-bot dir)    Remove invalid
                                        Persist to file

Performance

Benchmarks (40 bots, Intel i5-1038NG7 @ 2.00GHz)
Operation Time/op Allocs/op Notes
UA matching (hit first) 165ns 0 Byte index + word boundary check
UA matching (hit middle) 300ns 0 Worst case: mid-list match
UA matching (miss) 640ns 0 Full scan + browser classification
Validate (IP range hit) 227ns 0 Radix tree CIDR matching
Validate (ASN hit) 35ns 1 O(1) Patricia tree lookup
Validate (RDNS hit) 450ns 0 Cache lookup + domain match
Validate (cold lookup) 50-200ms 1-2 DNS query (first time only)

Key Insight: Verification priority is IP ranges → ASN → RDNS. ASN verification (~35ns) is faster than RDNS cache lookup (~450ns) and ideal for bots with official ASN registrations.

Scalability
Bot Count Index Benefit Recommended Index
< 20 bots Minimal (2x) Single byte (current)
20-50 bots Significant (4-5x) Single byte (current)
> 50 bots Critical (10x+) Consider 3-char prefix

Current implementation is optimized for 3-50 bots (covers 99% of use cases).

API Reference

Types
type Validator struct { /* ... */ }

type Result struct {
    Name       string       // Bot name (e.g., "googlebot")
    Status     ResultStatus // "verified" | "failed" | "unknown"
    IsBot      bool         // True if UA matches any bot or looks bot-like
    IsVerified bool         // True if IP ownership verified
}

type ResultStatus string
const (
    StatusVerified ResultStatus = "verified" // Bot confirmed (UA + IP match)
    StatusFailed   ResultStatus = "failed"   // Bot suspected but IP invalid
    StatusUnknown  ResultStatus = "unknown"  // Not a known bot
)
Methods
// New creates a validator with background scheduler
func New(opts ...Option) (*Validator, error)

// Validate verifies User-Agent and IP address
func (v *Validator) Validate(ua, ip string) Result

// Close stops background scheduler
func (v *Validator) Close() error
Options
// WithRoot sets custom bot directory (default: "./bots")
func WithRoot(dir string) Option

// WithFailLimit sets failed lookup cache size (default: 1000)
func WithFailLimit(limit int) Option

Real-World Use Cases

1. Rate Limiting
// Apply different rate limits for verified bots vs browsers
result := validator.Validate(ua, ip)
if result.IsVerified {
    limiter = rateLimits.Bot  // Generous: 10/sec
} else if result.IsBot {
    limiter = rateLimits.FakeBot  // Strict: 1/min
} else {
    limiter = rateLimits.Browser  // Normal: 5/sec
}
2. Analytics Exclusion
// Exclude verified bots from user analytics
result := validator.Validate(ua, ip)
if !result.IsBot || !result.IsVerified {
    analytics.Track(userID, event)
}
3. SEO Testing
// Allow verified Googlebot to bypass feature flags
result := validator.Validate(ua, ip)
if result.Name == "googlebot" && result.IsVerified {
    features.EnableAll()  // Show production content for indexing
}
4. Content Protection
// Block fake bots from scraping paywalled content
result := validator.Validate(ua, ip)
if result.IsBot && !result.IsVerified {
    return http.StatusForbidden  // Suspected scraper
}

Supported Bots (Built-in Configs)

Current built-in configurations:

  • Googlebot (Google Search)
  • Bingbot (Microsoft Bing)
  • facebookexternalhit (Facebook/Meta link previews)
  • GPTBot (OpenAI)
  • Applebot (Apple Search and Siri)
  • GitHub (GitHub webhooks)
  • Stripe (Stripe webhooks)
  • UptimeRobot (Uptime monitoring)

Need more bots? Add YAML configs to bots/conf.d/ - no code changes required!

Common bots to add:

  • Yandex (YandexBot)
  • Baidu (Baiduspider)
  • DuckDuckGo (DuckDuckBot)
  • Twitter (Twitterbot)
  • Slack (Slackbot)

See bots/conf.d/googlebot.yaml for configuration examples.

Testing

# Run all tests
go test ./...

# Run only unit tests (skip integration tests)
go test -short ./...

# Run benchmarks
go test -bench=. -benchmem

# Run specific test
go test -v -run ^TestValidator$

# Coverage report
go test -cover ./...

Integration Tests: The project includes integration tests that verify parsing of real API responses from:

  • GoogleBot: 307 prefixes
  • Bingbot: 28 prefixes
  • GPTBot: 21 prefixes
  • GitHub: 50 prefixes
  • Stripe: 12 IPs
  • UptimeRobot: 116 prefixes
  • Applebot: 12 prefixes

Architecture Decisions

Why atomic.Pointer[T] instead of RWMutex?

Bot configurations change rarely (on reload/schedule, 1-20x/day) but are read on every request (1000s/sec). atomic.Pointer[T] provides:

  • Lock-free reads - single atomic load, no lock acquisition overhead
  • Readers never block - writes don't wait for readers, readers don't wait for writes (Copy-on-Write)
  • Consistent performance - no priority inversion or cache line contention from lock operations

Consistent sub-microsecond performance for read-heavy workloads.

Why case-sensitive UA matching?

Official bots use fixed casing ("Googlebot", never "googlebot"). Case variations indicate forgery. Case-sensitive matching:

  1. Rejects fakes at first stage (no expensive DNS queries)
  2. 4x faster than case-insensitive (16ns vs 67ns)
  3. Improves both security and performance
Why Copy-on-Write cache?

RDNS cache sees 1-20 new IPs per day but 1000s of reads per second (99.99% read ratio). Copy-on-Write with atomic swap provides:

  • Zero-allocation reads (no locking)
  • Safe concurrent access
  • Simple implementation (vs lock-free data structures)
Why byte-level index?

Linear bot list scan is fast for 3 bots (52ns) but degrades to 640ns at 40 bots. Single-character index provides 4-5x speedup for 20-50 bots at minimal memory cost (<1KB).

Why Bot method encapsulation?

IP and ASN lifecycle operations (load, refresh, persist) are shared between initialization and the background scheduler. Encapsulating these as Bot methods:

  • Eliminates duplicate code - initBot and runScheduler both call the same loadCachedIPs, refreshIPs, initializeASN, and refreshASN methods
  • Centralizes state - IPTree and ASN cache pointers live on the Bot struct, making ownership clear
  • Improves testability - Each lifecycle method can be unit tested in isolation
  • Enables future extensions - New verification methods (e.g., BGP feeds) can follow the same pattern

Example Bot methods:

func (b *Bot) loadCachedIPs(path string)   // Load cached prefixes from file
func (b *Bot) refreshIPs(http *http.Client, root string) // Download and persist new prefixes
func (b *Bot) initializeASN(store *asn.Store)  // Load ASN cache with fallback to API
func (b *Bot) refreshASN(store *asn.Store)     // Refresh ASN prefixes from API

Adding New Bots

Adding a new bot requires no code changes - just create a YAML configuration file.

Step 1: Choose Verification Method
Method When to Use Example
URL + Parser Bot has official JSON/TXT IP list Googlebot, Bingbot, GPTBot
ASN Bot has official ASN registration Cloudflare (AS13335), Google (AS15169)
RDNS Only No official IP list, verify via DNS Baidu, Yandex
Step 2: Create Configuration File

Create bots/conf.d/newbot.yaml:

# Case 1: Bot with official JSON IP list (RECOMMENDED)
kind: SearchEngine        # Category: SearchEngine, SocialMedia, Tool, etc.
name: newbot              # Unique identifier (used in results)
parser: google            # Parser: google, openai, txt, github, stripe
ua: "NewBot"              # User-Agent fragment (case-sensitive!)
urls:
  - "https://example.com/bot-ips.json"

# Case 2: Bot with ASN verification (fastest option)
kind: SearchEngine
name: newbot
ua: "NewBot"
asn:
  - 12345                # ASN number (fetched from RIPE API)

# Case 3: Bot with RDNS verification only (no official IP list)
kind: SearchEngine
name: newbot
ua: "NewBot"
domains:
  - "newbot.example.com"
rdns: true
Step 3: Configure Parser

Choose the correct parser based on the IP list format:

Google-style (ipv4Prefix/ipv6Prefix fields):

{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}, {"ipv6Prefix": "2001:db8::/32"}]}

Parser: google

OpenAI-style (prefix field):

{"prefixes": [{"prefix": "1.2.3.4/24"}]}

Parser: openai

Plain text (one CIDR or individual IP per line):

1.2.3.4/24
5.6.7.8/24
172.16.0.5

Parser: txt (converts individual IPs to /32 or /128 CIDR notation)

GitHub-style (hooks, web, api string arrays):

{"hooks": ["192.30.252.0/22"], "web": ["192.30.252.0/22"], "api": ["192.30.252.0/22"]}

Parser: github

Stripe-style (WEBHOOKS array with individual IPs):

{"WEBHOOKS": ["3.18.12.63", "3.130.192.231", "13.235.14.237"]}

Parser: stripe (converts individual IPs to /32 or /128 CIDR notation)

Step 4: Restart Validator

To apply new bot configurations, restart your application or recreate the Validator:

// Create a new validator with updated bots
v, err := knownbots.New(knownbots.WithRoot("./bots"))
if err != nil {
    log.Fatal(err)
}
defer v.Close()
Step 5: Verify
result := v.Validate(
    "Mozilla/5.0 (compatible; NewBot/1.0; +https://example.com/bot)",
    "1.2.3.4",
)

fmt.Printf("Status: %s\n", result.Status)      // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot)        // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
Example Configurations

Googlebot (official JSON, fast verification):

kind: SearchEngine
name: googlebot
parser: google
ua: "Googlebot"
urls:
  - "https://www.gstatic.com/ipranges/google.json"

Bingbot (official JSON):

kind: SearchEngine
name: bingbot
parser: google
ua: "bingbot"
urls:
  - "https://www.bing.com/toolbox/bingbot.json"

GPTBot (OpenAI uses Google-style JSON):

kind: AiTraining
name: gptbot
parser: google
ua: "GPTBot"
urls:
  - "https://openai.com/gptbot.json"

Applebot (official JSON from developer.apple.com):

kind: SearchEngine
name: applebot
parser: google
ua: "Applebot"
urls:
  - "https://search.developer.apple.com/applebot.json"

GitHub Webhooks:

kind: Tool
name: github
parser: github
ua: "GitHub-Hookshot"
urls:
  - "https://api.github.com/meta"

Stripe Webhooks:

kind: Tool
name: stripe
parser: stripe
ua: "Stripe"
urls:
  - "https://stripe.com/files/ips/ips_webhooks.json"

UptimeRobot (plain text with individual IPs):

kind: Monitoring
name:uptimerobot
parser: txt
ua: "UptimeRobot"
urls:
  - "https://uptimerobot.com/inc/files/ips/IPv4.txt"

Baidu (RDNS only, no official IP list):

kind: SearchEngine
name: baiduspider
ua: "Baiduspider"
domains:
  - "baidu.com"
  - "baidu.jp"
rdns: true

Yandex (RDNS only):

kind: SearchEngine
name: yandexbot
ua: "YandexBot"
domains:
  - "yandex.com"
  - "yandex.ru"
rdns: true
Common Mistakes
Mistake Problem Solution
Wrong casing "googlebot" won't match "Googlebot/2.1" Use exact casing: "Googlebot"
Wrong parser JSON not parsed correctly Match parser to JSON structure
Missing rdns: true RDNS verification not performed Add rdns: true for DNS-based bots
Empty custom: [] Unnecessary configuration Omit empty fields
Testing New Bot Config
# Run tests to verify bot parsing
go test -v ./...

# Run specific parser test
go test -v -run TestGoogleParser ./parser/

# Validate IP list format
curl -s https://example.com/bot-ips.json | jq '.prefixes[0]'

Contributing

Contributions are welcome! Whether you want to add new bots, fix bugs, or improve documentation.

Ways to Contribute
  1. Add new bot configurations - Most contributions are just YAML files in bots/conf.d/
  2. Fix parser issues - Handle new or different IP list formats
  3. Improve documentation - Fix typos, clarify instructions, add examples
  4. Report bugs - Open issues with minimal reproduction steps
  5. Suggest features - Open discussions about new functionality
Submitting Pull Requests
  1. Fork the repository on GitHub
  2. Create a feature branch: git checkout -b add-newbot
  3. Add your bot configuration to bots/conf.d/newbot.yaml
  4. Test your changes:
    go test -short ./...
    go test -v -run TestNewBot ./parser/
    
  5. Commit using Google Git convention:
    git commit -m "feat: add NewBot configuration
    
    - Add NewBot YAML configuration
    - Verify User-Agent matching
    - Test IP parsing with official API
    
    PiperOrigin-RevId: XXXXXXXX
    Change-Id: IXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    
  6. Push and create a Pull Request
Bot Configuration Guidelines

When adding a new bot configuration:

  1. Verify the User-Agent from official documentation

    • Use exact casing (e.g., "Googlebot", not "googlebot")
    • Check for word boundary matching requirements
  2. Find the official IP list URL

    • Most major bots publish JSON/TXT IP lists
    • Prefer official sources over third-party aggregators
  3. Choose the correct parser

    • Match the parser to the actual JSON structure
    • Test with real API response before submitting
  4. Test thoroughly

    • Run go test -short ./... to verify no regressions
    • Check integration tests pass for new bot if applicable
Code Style
  • Follow standard Go conventions
  • Run go fmt ./... before committing
  • Run go vet ./... to catch potential issues
  • Add tests for new functionality

License

MIT License

Author

Dayi Chen - GitHub

Acknowledgments

  • Inspired by Google's official bot verification documentation
  • Performance patterns influenced by Go stdlib's sync/atomic and net/netip designs
  • Special thanks to all contributors and users providing feedback

⭐ Star this project if you find it useful!

📝 Questions? Open an issue or start a discussion!

🐛 Found a bug? Please report it with minimal reproduction steps!

Documentation

Overview

Package knownbots provides bot verification through UserAgent and IP validation.

Index

Constants

View Source
const (
	FailLRULimit      = 1000
	SchedulerInterval = 24 * time.Hour
)

Default settings

Variables

View Source
var EnableLog = true

EnableLog controls whether knownbots outputs log messages. Set to false to disable logging and reduce console pollution. Default: true (logging enabled)

Functions

func Distinct added in v1.0.7

func Distinct(prefixes []netip.Prefix) []netip.Prefix

Distinct removes duplicate prefixes from a slice.

func FilterInvalidPrefixes added in v1.0.7

func FilterInvalidPrefixes(prefixes []netip.Prefix) []netip.Prefix

FilterInvalidPrefixes removes invalid prefixes from a slice.

Types

type ASN added in v1.0.7

type ASN struct {
	// contains filtered or unexported fields
}

ASN provides fast IP lookup for ASN prefix matching using IPTree. Immutable after creation - use atomic.Pointer[ASN] at call site for thread-safety.

func NewASN added in v1.0.7

func NewASN() *ASN

NewASN creates a new ASN cache with an empty tree.

func (*ASN) ASNs added in v1.0.7

func (c *ASN) ASNs() []int

ASNs returns the list of ASN numbers in this cache.

func (*ASN) Add added in v1.0.7

func (c *ASN) Add(asn int, prefixes []netip.Prefix)

Add adds all prefixes from an ASN to the cache.

func (*ASN) Contains added in v1.0.7

func (c *ASN) Contains(ip netip.Addr) bool

Contains checks if an IP exists in any of the loaded prefixes.

func (*ASN) Count added in v1.0.7

func (c *ASN) Count() int

Count returns the total number of prefixes in the cache.

type Bot

type Bot struct {
	Name   string   `yaml:"name"`
	Kind   BotKind  `yaml:"kind"`
	Parser string   `yaml:"parser"`
	UA     string   `yaml:"ua"`
	URLs   []string `yaml:"urls"`

	ASN     []int    `yaml:"asn"`
	Domains []string `yaml:"domains"`
	RDNS    bool     `yaml:"rdns"`
	// contains filtered or unexported fields
}

func Load

func Load(dir string) ([]*Bot, error)

func (*Bot) ContainsIP

func (b *Bot) ContainsIP(ipStr string) bool

func (*Bot) VerifyRDNS

func (b *Bot) VerifyRDNS(ipStr string) ResultStatus

type BotKind

type BotKind string
const (
	KindSearchEngine BotKind = "SearchEngine"
	KindSocialMedia  BotKind = "SocialMedia"
	KindAITraining   BotKind = "AITraining"
	KindAIAssist     BotKind = "AIAssist"
	KindAIMixed      BotKind = "AIMixed"
	KindSEO          BotKind = "SEO"
	KindMonitor      BotKind = "Monitor"
	KindSecurity     BotKind = "Security"
	KindScraper      BotKind = "Scraper"
	KindUnknown      BotKind = "Unknown"
)

type BrowserKind

type BrowserKind string

BrowserKind represents the classification of a UserAgent structure.

const (
	Browser    BrowserKind = "browser"
	Suspicious BrowserKind = "suspicious"
	Unknown    BrowserKind = "unknown"
)

type Config

type Config struct {
	Root       string
	FailLimit  int
	ClassifyUA bool
}

Config holds the options for creating a Validator.

type IPPrefix

type IPPrefix = netip.Prefix

type IPTree added in v1.0.7

type IPTree struct {
	// contains filtered or unexported fields
}

func NewIPTree added in v1.0.7

func NewIPTree() *IPTree

func (*IPTree) Add added in v1.0.7

func (t *IPTree) Add(prefix netip.Prefix)

func (*IPTree) Contains added in v1.0.7

func (t *IPTree) Contains(ip netip.Addr) bool

func (*IPTree) Count added in v1.0.7

func (t *IPTree) Count() int

type LRU

type LRU struct {
	// contains filtered or unexported fields
}

LRU is an LRU cache for failed RDNS lookups. It provides fast rejection of IP addresses that have already been verified as invalid bot sources.

func NewLRU

func NewLRU(limit int) *LRU

NewLRU creates a new LRU cache with the specified limit.

func (*LRU) Add

func (l *LRU) Add(key string)

Add inserts a key into the cache. If the key already exists, it moves the entry to the front. If the cache is full, the least recently used entry is evicted.

func (*LRU) Contains

func (l *LRU) Contains(key string) bool

Contains returns true if the key exists in the cache. Uses read lock for better concurrency.

type Option

type Option func(*Config)

Option is a functional option for configuring a Validator.

func WithClassifyUA added in v1.0.2

func WithClassifyUA() Option

WithClassifyUA enables UA classification for non-bot UAs. By default, classifyUA is disabled for performance. Enable it to distinguish legitimate browsers from suspicious UAs.

func WithFailLimit

func WithFailLimit(limit int) Option

WithFailLimit sets the limit for failed lookup cache.

func WithRoot

func WithRoot(dir string) Option

WithRoot sets the bots root directory (containing conf.d and data subdirs).

type RDNS added in v1.0.7

type RDNS struct {
	// contains filtered or unexported fields
}

RDNS provides thread-safe reverse DNS lookup caching. It stores successfully verified IP→hostname mappings (persistent).

func NewRDNS added in v1.0.7

func NewRDNS(filePath string) (*RDNS, error)

NewRDNS creates a new RDNS cache instance.

func (*RDNS) Close added in v1.0.7

func (c *RDNS) Close() error

Close is a no-op. Cache persistence is handled by Persist().

func (*RDNS) Get added in v1.0.7

func (c *RDNS) Get(key string) (string, bool)

Get retrieves a value from the cache.

func (*RDNS) Persist added in v1.0.7

func (c *RDNS) Persist() error

Persist writes all entries to the persistent file.

func (*RDNS) Prune added in v1.0.7

func (c *RDNS) Prune(domains []string)

Prune removes entries from the cache that are no longer valid.

func (*RDNS) Set added in v1.0.7

func (c *RDNS) Set(key, value string)

Set stores a successful lookup result in the cache.

func (*RDNS) Size added in v1.0.7

func (c *RDNS) Size() int

Size returns the number of entries in the cache.

type Result

type Result struct {
	BotName string       `json:"bot_name"`
	BotKind BotKind      `json:"bot_kind"`
	IsBot   bool         `json:"is_bot"`
	Status  ResultStatus `json:"status"`
}

Result represents the verification result.

type ResultStatus

type ResultStatus int

ResultStatus represents the verification result status.

const (
	StatusVerified ResultStatus = 1 // IP verified successfully
	StatusPending  ResultStatus = 2 // RDNS network error, can retry
	StatusFailed   ResultStatus = 3 // IP not matched, suspected fake bot
	StatusUnknown  ResultStatus = 0 // Not a bot (normal browser)
)

type Validator

type Validator struct {
	// contains filtered or unexported fields
}

Validator is the core bot verification engine.

func New

func New(opts ...Option) (*Validator, error)

New creates a new Validator instance with background scheduler.

func (*Validator) Close

func (v *Validator) Close() error

Close stops the scheduler.

func (*Validator) Validate

func (v *Validator) Validate(ua, ip string) Result

Validate verifies if the given UserAgent and IP belong to a known bot. By default (classifyUA disabled), unknown UAs return IsBot=false for performance. When WithClassifyUA() enabled:

  • IsBot: true if UA matches a known bot or is suspicious, false if it's a legitimate browser
  • IsVerified: true if the IP is verified for the bot
  • Status: verified (bot confirmed), failed (bot suspected, IP not verified), or unknown

Directories

Path Synopsis
Package asn provides ASN (Autonomous System Number) verification for bots.
Package asn provides ASN (Autonomous System Number) verification for bots.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL