knownbots

package module

v1.0.7 Latest Latest Go to latest Published: Jan 12, 2026 License: MIT Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cnlangzi/knownbots

Links

Open Source Insights

README ¶

knownbots

KnownBots is a high-performance Go library for verifying search engine crawlers and identifying legitimate bots. It protects your web services from bot impersonation by validating User-Agent strings and IP addresses through RDNS lookups and IP range verification.

Why KnownBots?

The Problem: Malicious actors can easily spoof User-Agent strings to impersonate legitimate search engine bots (Googlebot, Bingbot, etc.) to bypass rate limits, scrape content, or exploit bot-specific logic.

The Solution: KnownBots performs cryptographic-strength verification by:

Matching User-Agent markers (case-sensitive word boundaries)
Verifying IP ownership through reverse DNS lookups or official IP ranges
Caching results to avoid expensive DNS queries on subsequent requests

Key Features

🚀 High Performance

Lock-free reads via atomic.Pointer[T] for bot configuration and RDNS cache
Zero-allocation hot paths using netip.Prefix for IP matching
Byte-level indexing for O(1) bot lookup (150-300ns for 40 bots vs 640ns linear scan)
Copy-on-Write caching optimized for read-heavy workloads (1-20 writes/day)
Embedded bots - 57 built-in configs compiled into binary (no file I/O at startup)
Optional UA classification - Disabled by default for maximum performance
Logging control - Disable log output via knownbots.EnableLog = false

🔒 Security First

Case-sensitive matching prevents forgery attempts (official bots use fixed casing)
Word boundary validation prevents partial matches (e.g., "MyGooglebot" won't match)
LRU fail cache for fast rejection of known-bad IPs (1000 entry limit)
Browser detection distinguishes legitimate users from suspicious bot-like patterns (opt-in)

📦 Production Ready

Persistent RDNS cache survives restarts (file-based storage)
Background scheduler automatically refreshes IP ranges from official URLs
Graceful degradation (cache persistence failures don't affect runtime)
Comprehensive tests with benchmarks for 3-40 bot scenarios

🌍 Extensible

YAML-based configuration for easy bot additions (no code changes)
Pluggable verification supports both IP ranges and RDNS verification
Official source integration automatically downloads and updates IP lists

Installation

go get github.com/cnlangzi/knownbots

Requirements: Go 1.21+

Quick Start

Basic Usage

package main

import (
    "fmt"
    "log"

    "github.com/cnlangzi/knownbots"
)

func main() {
    // Initialize validator (starts background scheduler)
    v, err := knownbots.New()
    if err != nil {
        log.Fatal(err)
    }
    defer v.Close()

    // Verify a bot claim
    result := v.Validate(
        "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "66.249.66.1",
    )

    fmt.Printf("Status: %s\n", result.Status)      // "verified"
    fmt.Printf("IsBot: %t\n", result.IsBot)        // true
    fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
    fmt.Printf("Bot Name: %s\n", result.Name)      // "googlebot"
}

HTTP Middleware Example

func BotVerificationMiddleware(v *knownbots.Validator) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            ua := r.Header.Get("User-Agent")
            ip := r.RemoteAddr // In production, extract from X-Forwarded-For

            result := v.Validate(ua, ip)

            // Block fake bots (claims to be bot but IP not verified)
            if result.IsBot && !result.IsVerified {
                http.Error(w, "Forbidden: Bot verification failed", http.StatusForbidden)
                return
            }

            // Add verification metadata to request context
            ctx := context.WithValue(r.Context(), "botVerified", result)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Configuration Options

v, err := knownbots.New(
    knownbots.WithRoot("./custom-bots"),    // Custom bot config directory
    knownbots.WithFailLimit(5000),          // Failed lookup cache size
    knownbots.WithClassifyUA(),             // Enable UA classification (disabled by default)
)

// Disable logging to reduce console pollution (e.g., in benchmarks)
knownbots.EnableLog = false

Configuration

Directory Structure

bots/
├── conf.d/              # Bot configurations (YAML)
│   ├── googlebot.yaml
│   ├── bingbot.yaml
│   └── ...
├── googlebot/           # Bot-specific data (auto-created)
│   ├── rdns.txt        # Persistent RDNS cache
│   └── ips.txt         # Downloaded IP ranges
└── ...

Bot Configuration (YAML)

name: googlebot
ua: "Googlebot"                           # EXACT casing required (case-sensitive)
urls:                                      # Official IP list URLs (auto-downloaded)
  - "https://www.gstatic.com/ipranges/google.json"
custom:                                    # Static CIDR ranges (always checked)
  - "66.249.64.0/19"
asn:                                       # ASN numbers for verification (optional)
  - 15169
domains:                                   # Verified RDNS domains
  - "googlebot.com"
  - "google.com"
rdns: true                             # Enable RDNS verification (false = IP-only)

Important:

User-Agent markers (ua) are case-sensitive. Official bots use fixed casing (e.g., "Googlebot", never "googlebot"). This prevents forgery attempts where attackers alter casing to bypass detection.
Set rdns: false for bots that only need IP range verification (faster, no DNS queries)
ASN verification is optional and provides faster IP ownership verification (~35ns) compared to RDNS (~450ns) for bots with official ASN registrations

Parser Selection

Choose the correct parser based on the IP list format:

Format	JSON Example	Parser
Google-style	`{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}]}`	`google`
OpenAI-style	`{"prefixes": [{"prefix": "1.2.3.4/24"}]}`	`openai`
Plain text	`1.2.3.4/24` or `172.16.0.5`	`txt`
GitHub-style	`{"hooks": ["1.2.3.4/24"], "web": [...]}`	`github`
Stripe-style	`{"WEBHOOKS": ["3.18.12.63"]}`	`stripe`

User-Agent Matching Rules

Case-sensitive: Use exact casing from official documentation
- ✅ Correct: ua: "Googlebot" or ua: "bingbot"
- ❌ Wrong: ua: "googlebot" or ua: "BINGBOT"
Match type: Word boundary matching (not substring)
- ua: "Googlebot" matches: Googlebot/2.1, Mozilla/5.0 (compatible; Googlebot/2.1; ...)
- ua: "Googlebot" does NOT match: MyGooglebot, GooglebotPro
Special bots: Some bots don't use Mozilla prefix
- ua: "GPTBot" (OpenAI)
- ua: "curl" (CLI tool)

How It Works

Verification Flow

┌─────────────────────────────────────────────────────────────┐
│                     Incoming Request                         │
│                  (User-Agent + IP Address)                   │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
         ┌────────────────────┐
         │  UA Matches Bot?   │──No──▶ Classify UA Type
         └────────┬───────────┘        (Browser/Suspicious/Unknown)
                  │ Yes                         │
                  ▼                             ▼
         ┌────────────────────┐        Return: IsBot=false
         │  Check IP Ranges   │        (legitimate browser)
         │  (CIDR matching)   │
         └────────┬───────────┘
                  │
                  ├─ Hit ──▶ Return: verified
                  │
                  ├─ Miss + asn empty ──▶ Check RDNS
                  │
                  ├─ Miss + asn defined ──▶ Check ASN
                  │                              │
                  │                              ├─ Hit ──▶ Return: verified
                  │                              │
                  │                              └─ Miss ──▶ Check RDNS
                  │
                  ▼
         ┌────────────────────┐
         │   Bot.RDNS=true?   │──No──▶ Return: failed
         └────────┬───────────┘        (IP-only bot, no DNS check)
                  │ Yes
                  ▼
         ┌────────────────────┐
         │  Check Fail Cache  │──Hit──▶ Return: failed
         │  (LRU, 1000 IPs)   │        (known fake bot)
         └────────┬───────────┘
                  │ Miss
                  ▼
         ┌────────────────────┐
         │ Check RDNS Cache   │──Hit──▶ Domain match?
         │  (persistent)      │         Yes: verified
         └────────┬───────────┘         No: failed
                  │ Miss
                  ▼
         ┌────────────────────┐
         │ Perform RDNS Lookup│──▶ Domain match?
         │  (50-200ms delay)  │     Yes: verified + cache
         └────────────────────┘     No: failed + fail cache

Background Scheduler (Every 24h)

┌─────────────────────────────────────────────────────────────┐
│                    Background Scheduler                      │
└─────────────────┬───────────────────────────────────────────┘
                  │
        ┌─────────┴─────────┬──────────┐
        │                   │          │
        ▼                   ▼          ▼
  ┌──────────┐      ┌──────────────┐ ┌──────────┐
  │ Refresh  │      │ Update ASN   │ │ Prune &  │
  │ IP Lists │      │ Data         │ │ Save     │
  │ (HTTP)   │      │ (RIPE API)   │ │ RDNS     │
  └──────────┘      └──────────────┘ │ Cache    │
        │                   │        │ (rdns=true) │
        ▼                   ▼        └──────────┘
  Update memory      Update cache          │
  Persist to file    Persist to file       ▼
                     (per-bot dir)    Remove invalid
                                        Persist to file

Performance

Benchmarks (40 bots, Intel i5-1038NG7 @ 2.00GHz)

Operation	Time/op	Allocs/op	Notes
UA matching (hit first)	165ns	0	Byte index + word boundary check
UA matching (hit middle)	300ns	0	Worst case: mid-list match
UA matching (miss)	640ns	0	Full scan + browser classification
Validate (IP range hit)	227ns	0	Radix tree CIDR matching
Validate (ASN hit)	35ns	1	O(1) Patricia tree lookup
Validate (RDNS hit)	450ns	0	Cache lookup + domain match
Validate (cold lookup)	50-200ms	1-2	DNS query (first time only)

Key Insight: Verification priority is IP ranges → ASN → RDNS. ASN verification (~35ns) is faster than RDNS cache lookup (~450ns) and ideal for bots with official ASN registrations.

Scalability

Bot Count	Index Benefit	Recommended Index
< 20 bots	Minimal (2x)	Single byte (current)
20-50 bots	Significant (4-5x)	Single byte (current)
> 50 bots	Critical (10x+)	Consider 3-char prefix

Current implementation is optimized for 3-50 bots (covers 99% of use cases).

API Reference

Types

type Validator struct { /* ... */ }

type Result struct {
    Name       string       // Bot name (e.g., "googlebot")
    Status     ResultStatus // "verified" | "failed" | "unknown"
    IsBot      bool         // True if UA matches any bot or looks bot-like
    IsVerified bool         // True if IP ownership verified
}

type ResultStatus string
const (
    StatusVerified ResultStatus = "verified" // Bot confirmed (UA + IP match)
    StatusFailed   ResultStatus = "failed"   // Bot suspected but IP invalid
    StatusUnknown  ResultStatus = "unknown"  // Not a known bot
)

Methods

// New creates a validator with background scheduler
func New(opts ...Option) (*Validator, error)

// Validate verifies User-Agent and IP address
func (v *Validator) Validate(ua, ip string) Result

// Close stops background scheduler
func (v *Validator) Close() error

Options

// WithRoot sets custom bot directory (default: "./bots")
func WithRoot(dir string) Option

// WithFailLimit sets failed lookup cache size (default: 1000)
func WithFailLimit(limit int) Option

Real-World Use Cases

1. Rate Limiting

// Apply different rate limits for verified bots vs browsers
result := validator.Validate(ua, ip)
if result.IsVerified {
    limiter = rateLimits.Bot  // Generous: 10/sec
} else if result.IsBot {
    limiter = rateLimits.FakeBot  // Strict: 1/min
} else {
    limiter = rateLimits.Browser  // Normal: 5/sec
}

2. Analytics Exclusion

// Exclude verified bots from user analytics
result := validator.Validate(ua, ip)
if !result.IsBot || !result.IsVerified {
    analytics.Track(userID, event)
}

3. SEO Testing

// Allow verified Googlebot to bypass feature flags
result := validator.Validate(ua, ip)
if result.Name == "googlebot" && result.IsVerified {
    features.EnableAll()  // Show production content for indexing
}

4. Content Protection

// Block fake bots from scraping paywalled content
result := validator.Validate(ua, ip)
if result.IsBot && !result.IsVerified {
    return http.StatusForbidden  // Suspected scraper
}

Supported Bots (Built-in Configs)

Current built-in configurations:

Googlebot (Google Search)
Bingbot (Microsoft Bing)
facebookexternalhit (Facebook/Meta link previews)
GPTBot (OpenAI)
Applebot (Apple Search and Siri)
GitHub (GitHub webhooks)
Stripe (Stripe webhooks)
UptimeRobot (Uptime monitoring)

Need more bots? Add YAML configs to bots/conf.d/ - no code changes required!

Common bots to add:

Yandex (YandexBot)
Baidu (Baiduspider)
DuckDuckGo (DuckDuckBot)
Twitter (Twitterbot)
Slack (Slackbot)

See bots/conf.d/googlebot.yaml for configuration examples.

Testing

# Run all tests
go test ./...

# Run only unit tests (skip integration tests)
go test -short ./...

# Run benchmarks
go test -bench=. -benchmem

# Run specific test
go test -v -run ^TestValidator$

# Coverage report
go test -cover ./...

Integration Tests: The project includes integration tests that verify parsing of real API responses from:

GoogleBot: 307 prefixes
Bingbot: 28 prefixes
GPTBot: 21 prefixes
GitHub: 50 prefixes
Stripe: 12 IPs
UptimeRobot: 116 prefixes
Applebot: 12 prefixes

Architecture Decisions

Why atomic.Pointer[T] instead of RWMutex?

Bot configurations change rarely (on reload/schedule, 1-20x/day) but are read on every request (1000s/sec). atomic.Pointer[T] provides:

Lock-free reads - single atomic load, no lock acquisition overhead
Readers never block - writes don't wait for readers, readers don't wait for writes (Copy-on-Write)
Consistent performance - no priority inversion or cache line contention from lock operations

Consistent sub-microsecond performance for read-heavy workloads.

Why case-sensitive UA matching?

Official bots use fixed casing ("Googlebot", never "googlebot"). Case variations indicate forgery. Case-sensitive matching:

Rejects fakes at first stage (no expensive DNS queries)
4x faster than case-insensitive (16ns vs 67ns)
Improves both security and performance

Why Copy-on-Write cache?

RDNS cache sees 1-20 new IPs per day but 1000s of reads per second (99.99% read ratio). Copy-on-Write with atomic swap provides:

Zero-allocation reads (no locking)
Safe concurrent access
Simple implementation (vs lock-free data structures)

Why byte-level index?

Linear bot list scan is fast for 3 bots (52ns) but degrades to 640ns at 40 bots. Single-character index provides 4-5x speedup for 20-50 bots at minimal memory cost (<1KB).

Why Bot method encapsulation?

IP and ASN lifecycle operations (load, refresh, persist) are shared between initialization and the background scheduler. Encapsulating these as Bot methods:

Eliminates duplicate code - initBot and runScheduler both call the same loadCachedIPs, refreshIPs, initializeASN, and refreshASN methods
Centralizes state - IPTree and ASN cache pointers live on the Bot struct, making ownership clear
Improves testability - Each lifecycle method can be unit tested in isolation
Enables future extensions - New verification methods (e.g., BGP feeds) can follow the same pattern

Example Bot methods:

func (b *Bot) loadCachedIPs(path string)   // Load cached prefixes from file
func (b *Bot) refreshIPs(http *http.Client, root string) // Download and persist new prefixes
func (b *Bot) initializeASN(store *asn.Store)  // Load ASN cache with fallback to API
func (b *Bot) refreshASN(store *asn.Store)     // Refresh ASN prefixes from API

Adding New Bots

Adding a new bot requires no code changes - just create a YAML configuration file.

Step 1: Choose Verification Method

Method	When to Use	Example
URL + Parser	Bot has official JSON/TXT IP list	Googlebot, Bingbot, GPTBot
ASN	Bot has official ASN registration	Cloudflare (AS13335), Google (AS15169)
RDNS Only	No official IP list, verify via DNS	Baidu, Yandex

Step 2: Create Configuration File

Create bots/conf.d/newbot.yaml:

# Case 1: Bot with official JSON IP list (RECOMMENDED)
kind: SearchEngine        # Category: SearchEngine, SocialMedia, Tool, etc.
name: newbot              # Unique identifier (used in results)
parser: google            # Parser: google, openai, txt, github, stripe
ua: "NewBot"              # User-Agent fragment (case-sensitive!)
urls:
  - "https://example.com/bot-ips.json"

# Case 2: Bot with ASN verification (fastest option)
kind: SearchEngine
name: newbot
ua: "NewBot"
asn:
  - 12345                # ASN number (fetched from RIPE API)

# Case 3: Bot with RDNS verification only (no official IP list)
kind: SearchEngine
name: newbot
ua: "NewBot"
domains:
  - "newbot.example.com"
rdns: true

Step 3: Configure Parser

Choose the correct parser based on the IP list format:

Google-style (ipv4Prefix/ipv6Prefix fields):

{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}, {"ipv6Prefix": "2001:db8::/32"}]}

Parser: google

OpenAI-style (prefix field):

{"prefixes": [{"prefix": "1.2.3.4/24"}]}

Parser: openai

Plain text (one CIDR or individual IP per line):

1.2.3.4/24
5.6.7.8/24
172.16.0.5

Parser: txt (converts individual IPs to /32 or /128 CIDR notation)

GitHub-style (hooks, web, api string arrays):

{"hooks": ["192.30.252.0/22"], "web": ["192.30.252.0/22"], "api": ["192.30.252.0/22"]}

Parser: github

Stripe-style (WEBHOOKS array with individual IPs):

{"WEBHOOKS": ["3.18.12.63", "3.130.192.231", "13.235.14.237"]}

Parser: stripe (converts individual IPs to /32 or /128 CIDR notation)

Step 4: Restart Validator

To apply new bot configurations, restart your application or recreate the Validator:

// Create a new validator with updated bots
v, err := knownbots.New(knownbots.WithRoot("./bots"))
if err != nil {
    log.Fatal(err)
}
defer v.Close()

Step 5: Verify

result := v.Validate(
    "Mozilla/5.0 (compatible; NewBot/1.0; +https://example.com/bot)",
    "1.2.3.4",
)

fmt.Printf("Status: %s\n", result.Status)      // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot)        // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // true

Example Configurations

Googlebot (official JSON, fast verification):

kind: SearchEngine
name: googlebot
parser: google
ua: "Googlebot"
urls:
  - "https://www.gstatic.com/ipranges/google.json"

Bingbot (official JSON):

kind: SearchEngine
name: bingbot
parser: google
ua: "bingbot"
urls:
  - "https://www.bing.com/toolbox/bingbot.json"

GPTBot (OpenAI uses Google-style JSON):

kind: AiTraining
name: gptbot
parser: google
ua: "GPTBot"
urls:
  - "https://openai.com/gptbot.json"

Applebot (official JSON from developer.apple.com):

kind: SearchEngine
name: applebot
parser: google
ua: "Applebot"
urls:
  - "https://search.developer.apple.com/applebot.json"

GitHub Webhooks:

kind: Tool
name: github
parser: github
ua: "GitHub-Hookshot"
urls:
  - "https://api.github.com/meta"

Stripe Webhooks:

kind: Tool
name: stripe
parser: stripe
ua: "Stripe"
urls:
  - "https://stripe.com/files/ips/ips_webhooks.json"

UptimeRobot (plain text with individual IPs):

kind: Monitoring
name:uptimerobot
parser: txt
ua: "UptimeRobot"
urls:
  - "https://uptimerobot.com/inc/files/ips/IPv4.txt"

Baidu (RDNS only, no official IP list):

kind: SearchEngine
name: baiduspider
ua: "Baiduspider"
domains:
  - "baidu.com"
  - "baidu.jp"
rdns: true

Yandex (RDNS only):

kind: SearchEngine
name: yandexbot
ua: "YandexBot"
domains:
  - "yandex.com"
  - "yandex.ru"
rdns: true

Common Mistakes

Mistake	Problem	Solution
Wrong casing	"googlebot" won't match "Googlebot/2.1"	Use exact casing: "Googlebot"
Wrong parser	JSON not parsed correctly	Match parser to JSON structure
Missing `rdns: true`	RDNS verification not performed	Add `rdns: true` for DNS-based bots
Empty `custom: []`	Unnecessary configuration	Omit empty fields

Testing New Bot Config

# Run tests to verify bot parsing
go test -v ./...

# Run specific parser test
go test -v -run TestGoogleParser ./parser/

# Validate IP list format
curl -s https://example.com/bot-ips.json | jq '.prefixes[0]'

Contributing

Contributions are welcome! Whether you want to add new bots, fix bugs, or improve documentation.

Ways to Contribute

Add new bot configurations - Most contributions are just YAML files in bots/conf.d/
Fix parser issues - Handle new or different IP list formats
Improve documentation - Fix typos, clarify instructions, add examples
Report bugs - Open issues with minimal reproduction steps
Suggest features - Open discussions about new functionality

Submitting Pull Requests

Fork the repository on GitHub
Create a feature branch: git checkout -b add-newbot
Add your bot configuration to bots/conf.d/newbot.yaml

Test your changes:

go test -short ./...
go test -v -run TestNewBot ./parser/

Commit using Google Git convention:

git commit -m "feat: add NewBot configuration

- Add NewBot YAML configuration
- Verify User-Agent matching
- Test IP parsing with official API

PiperOrigin-RevId: XXXXXXXX
Change-Id: IXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Push and create a Pull Request

Bot Configuration Guidelines

When adding a new bot configuration:

Verify the User-Agent from official documentation
- Use exact casing (e.g., "Googlebot", not "googlebot")
- Check for word boundary matching requirements
Find the official IP list URL
- Most major bots publish JSON/TXT IP lists
- Prefer official sources over third-party aggregators
Choose the correct parser
- Match the parser to the actual JSON structure
- Test with real API response before submitting
Test thoroughly
- Run go test -short ./... to verify no regressions
- Check integration tests pass for new bot if applicable

Code Style

Follow standard Go conventions
Run go fmt ./... before committing
Run go vet ./... to catch potential issues
Add tests for new functionality

License

MIT License

Author

Dayi Chen - GitHub

Acknowledgments

Inspired by Google's official bot verification documentation
Performance patterns influenced by Go stdlib's sync/atomic and net/netip designs
Special thanks to all contributors and users providing feedback

⭐ Star this project if you find it useful!

📝 Questions? Open an issue or start a discussion!

🐛 Found a bug? Please report it with minimal reproduction steps!

Documentation ¶

Overview ¶

Package knownbots provides bot verification through UserAgent and IP validation.

Index ¶

Constants
Variables
func Distinct(prefixes []netip.Prefix) []netip.Prefix
func FilterInvalidPrefixes(prefixes []netip.Prefix) []netip.Prefix
type ASN
- func NewASN() *ASN
- func (c *ASN) ASNs() []int
- func (c *ASN) Add(asn int, prefixes []netip.Prefix)
- func (c *ASN) Contains(ip netip.Addr) bool
- func (c *ASN) Count() int
type Bot
- func Load(dir string) ([]*Bot, error)
- func (b *Bot) ContainsIP(ipStr string) bool
- func (b *Bot) VerifyRDNS(ipStr string) ResultStatus
type BotKind
type BrowserKind
type Config
type IPPrefix
type IPTree
- func NewIPTree() *IPTree
- func (t *IPTree) Add(prefix netip.Prefix)
- func (t *IPTree) Contains(ip netip.Addr) bool
- func (t *IPTree) Count() int
type LRU
- func NewLRU(limit int) *LRU
- func (l *LRU) Add(key string)
- func (l *LRU) Contains(key string) bool
type Option
- func WithClassifyUA() Option
- func WithFailLimit(limit int) Option
- func WithRoot(dir string) Option
type RDNS
- func NewRDNS(filePath string) (*RDNS, error)
- func (c *RDNS) Close() error
- func (c *RDNS) Get(key string) (string, bool)
- func (c *RDNS) Persist() error
- func (c *RDNS) Prune(domains []string)
- func (c *RDNS) Set(key, value string)
- func (c *RDNS) Size() int
type Result
type ResultStatus
type Validator
- func New(opts ...Option) (*Validator, error)
- func (v *Validator) Close() error
- func (v *Validator) Validate(ua, ip string) Result

Constants ¶

View Source

const (
	FailLRULimit      = 1000
	SchedulerInterval = 24 * time.Hour
)

Default settings

Variables ¶

View Source

var EnableLog = true

EnableLog controls whether knownbots outputs log messages. Set to false to disable logging and reduce console pollution. Default: true (logging enabled)

Functions ¶

func Distinct ¶ added in v1.0.7

func Distinct(prefixes []netip.Prefix) []netip.Prefix

Distinct removes duplicate prefixes from a slice.

func FilterInvalidPrefixes ¶ added in v1.0.7

func FilterInvalidPrefixes(prefixes []netip.Prefix) []netip.Prefix

FilterInvalidPrefixes removes invalid prefixes from a slice.

Types ¶

type ASN ¶ added in v1.0.7

type ASN struct {
	// contains filtered or unexported fields
}

ASN provides fast IP lookup for ASN prefix matching using IPTree. Immutable after creation - use atomic.Pointer[ASN] at call site for thread-safety.

func NewASN ¶ added in v1.0.7

func NewASN() *ASN

NewASN creates a new ASN cache with an empty tree.

func (*ASN) ASNs ¶ added in v1.0.7

func (c *ASN) ASNs() []int

ASNs returns the list of ASN numbers in this cache.

func (*ASN) Add ¶ added in v1.0.7

func (c *ASN) Add(asn int, prefixes []netip.Prefix)

Add adds all prefixes from an ASN to the cache.

func (*ASN) Contains ¶ added in v1.0.7

func (c *ASN) Contains(ip netip.Addr) bool

Contains checks if an IP exists in any of the loaded prefixes.

func (*ASN) Count ¶ added in v1.0.7

func (c *ASN) Count() int

Count returns the total number of prefixes in the cache.

type Bot ¶

type Bot struct {
	Name   string   `yaml:"name"`
	Kind   BotKind  `yaml:"kind"`
	Parser string   `yaml:"parser"`
	UA     string   `yaml:"ua"`
	URLs   []string `yaml:"urls"`

	ASN     []int    `yaml:"asn"`
	Domains []string `yaml:"domains"`
	RDNS    bool     `yaml:"rdns"`
	// contains filtered or unexported fields
}

func Load ¶

func Load(dir string) ([]*Bot, error)

func (*Bot) ContainsIP ¶

func (b *Bot) ContainsIP(ipStr string) bool

func (*Bot) VerifyRDNS ¶

func (b *Bot) VerifyRDNS(ipStr string) ResultStatus

type BotKind ¶

type BotKind string

const (
	KindSearchEngine BotKind = "SearchEngine"
	KindSocialMedia  BotKind = "SocialMedia"
	KindAITraining   BotKind = "AITraining"
	KindAIAssist     BotKind = "AIAssist"
	KindAIMixed      BotKind = "AIMixed"
	KindSEO          BotKind = "SEO"
	KindMonitor      BotKind = "Monitor"
	KindSecurity     BotKind = "Security"
	KindScraper      BotKind = "Scraper"
	KindUnknown      BotKind = "Unknown"
)

type BrowserKind ¶

type BrowserKind string

BrowserKind represents the classification of a UserAgent structure.

const (
	Browser    BrowserKind = "browser"
	Suspicious BrowserKind = "suspicious"
	Unknown    BrowserKind = "unknown"
)

type Config ¶

type Config struct {
	Root       string
	FailLimit  int
	ClassifyUA bool
}

Config holds the options for creating a Validator.

type IPPrefix ¶

type IPPrefix = netip.Prefix

type IPTree ¶ added in v1.0.7

type IPTree struct {
	// contains filtered or unexported fields
}

func NewIPTree ¶ added in v1.0.7

func NewIPTree() *IPTree

func (*IPTree) Add ¶ added in v1.0.7

func (t *IPTree) Add(prefix netip.Prefix)

func (*IPTree) Contains ¶ added in v1.0.7

func (t *IPTree) Contains(ip netip.Addr) bool

func (*IPTree) Count ¶ added in v1.0.7

func (t *IPTree) Count() int

type LRU ¶

type LRU struct {
	// contains filtered or unexported fields
}

LRU is an LRU cache for failed RDNS lookups. It provides fast rejection of IP addresses that have already been verified as invalid bot sources.

func NewLRU ¶

func NewLRU(limit int) *LRU

NewLRU creates a new LRU cache with the specified limit.

func (*LRU) Add ¶

func (l *LRU) Add(key string)

Add inserts a key into the cache. If the key already exists, it moves the entry to the front. If the cache is full, the least recently used entry is evicted.

func (*LRU) Contains ¶

func (l *LRU) Contains(key string) bool

Contains returns true if the key exists in the cache. Uses read lock for better concurrency.

type Option ¶

type Option func(*Config)

Option is a functional option for configuring a Validator.

func WithClassifyUA ¶ added in v1.0.2

func WithClassifyUA() Option

WithClassifyUA enables UA classification for non-bot UAs. By default, classifyUA is disabled for performance. Enable it to distinguish legitimate browsers from suspicious UAs.

func WithFailLimit ¶

func WithFailLimit(limit int) Option

WithFailLimit sets the limit for failed lookup cache.

func WithRoot ¶

func WithRoot(dir string) Option

WithRoot sets the bots root directory (containing conf.d and data subdirs).

type RDNS ¶ added in v1.0.7

type RDNS struct {
	// contains filtered or unexported fields
}

RDNS provides thread-safe reverse DNS lookup caching. It stores successfully verified IP→hostname mappings (persistent).

func NewRDNS ¶ added in v1.0.7

func NewRDNS(filePath string) (*RDNS, error)

NewRDNS creates a new RDNS cache instance.

func (*RDNS) Close ¶ added in v1.0.7

func (c *RDNS) Close() error

Close is a no-op. Cache persistence is handled by Persist().

func (*RDNS) Get ¶ added in v1.0.7

func (c *RDNS) Get(key string) (string, bool)

Get retrieves a value from the cache.

func (*RDNS) Persist ¶ added in v1.0.7

func (c *RDNS) Persist() error

Persist writes all entries to the persistent file.

func (*RDNS) Prune ¶ added in v1.0.7

func (c *RDNS) Prune(domains []string)

Prune removes entries from the cache that are no longer valid.

func (*RDNS) Set ¶ added in v1.0.7

func (c *RDNS) Set(key, value string)

Set stores a successful lookup result in the cache.

func (*RDNS) Size ¶ added in v1.0.7

func (c *RDNS) Size() int

Size returns the number of entries in the cache.

type Result ¶

type Result struct {
	BotName string       `json:"bot_name"`
	BotKind BotKind      `json:"bot_kind"`
	IsBot   bool         `json:"is_bot"`
	Status  ResultStatus `json:"status"`
}

Result represents the verification result.

type ResultStatus ¶

type ResultStatus int

ResultStatus represents the verification result status.

const (
	StatusVerified ResultStatus = 1 // IP verified successfully
	StatusPending  ResultStatus = 2 // RDNS network error, can retry
	StatusFailed   ResultStatus = 3 // IP not matched, suspected fake bot
	StatusUnknown  ResultStatus = 0 // Not a bot (normal browser)
)

type Validator ¶

type Validator struct {
	// contains filtered or unexported fields
}

Validator is the core bot verification engine.

func New ¶

func New(opts ...Option) (*Validator, error)

New creates a new Validator instance with background scheduler.

func (*Validator) Close ¶

func (v *Validator) Close() error

Close stops the scheduler.

func (*Validator) Validate ¶

func (v *Validator) Validate(ua, ip string) Result

Validate verifies if the given UserAgent and IP belong to a known bot. By default (classifyUA disabled), unknown UAs return IsBot=false for performance. When WithClassifyUA() enabled:

IsBot: true if UA matches a known bot or is suspicious, false if it's a legitimate browser
IsVerified: true if the IP is verified for the bot
Status: verified (bot confirmed), failed (bot suspected, IP not verified), or unknown

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
asn Package asn provides ASN (Autonomous System Number) verification for bots.	Package asn provides ASN (Autonomous System Number) verification for bots.
parser

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL