knownbots

KnownBots is a high-performance Go library for verifying search engine crawlers and identifying legitimate bots. It protects your web services from bot impersonation by validating User-Agent strings and IP addresses through RDNS lookups and IP range verification.
Why KnownBots?
The Problem: Malicious actors can easily spoof User-Agent strings to impersonate legitimate search engine bots (Googlebot, Bingbot, etc.) to bypass rate limits, scrape content, or exploit bot-specific logic.
The Solution: KnownBots performs cryptographic-strength verification by:
- Matching User-Agent markers (case-sensitive word boundaries)
- Verifying IP ownership through reverse DNS lookups or official IP ranges
- Caching results to avoid expensive DNS queries on subsequent requests
Key Features
- Lock-free reads via
atomic.Pointer[T] for bot configuration and RDNS cache
- Zero-allocation hot paths using
netip.Prefix for IP matching
- Byte-level indexing for O(1) bot lookup (150-300ns for 40 bots vs 640ns linear scan)
- Copy-on-Write caching optimized for read-heavy workloads (1-20 writes/day)
- Embedded bots - 57 built-in configs compiled into binary (no file I/O at startup)
- Optional UA classification - Disabled by default for maximum performance
- Logging control - Disable log output via
knownbots.EnableLog = false
🔒 Security First
- Case-sensitive matching prevents forgery attempts (official bots use fixed casing)
- Word boundary validation prevents partial matches (e.g., "MyGooglebot" won't match)
- LRU fail cache for fast rejection of known-bad IPs (1000 entry limit)
- Browser detection distinguishes legitimate users from suspicious bot-like patterns (opt-in)
📦 Production Ready
- Persistent RDNS cache survives restarts (file-based storage)
- Background scheduler automatically refreshes IP ranges from official URLs
- Graceful degradation (cache persistence failures don't affect runtime)
- Comprehensive tests with benchmarks for 3-40 bot scenarios
🌍 Extensible
- YAML-based configuration for easy bot additions (no code changes)
- Pluggable verification supports both IP ranges and RDNS verification
- Official source integration automatically downloads and updates IP lists
Installation
go get github.com/cnlangzi/knownbots
Requirements: Go 1.21+
Quick Start
Basic Usage
package main
import (
"fmt"
"log"
"github.com/cnlangzi/knownbots"
)
func main() {
// Initialize validator (starts background scheduler)
v, err := knownbots.New()
if err != nil {
log.Fatal(err)
}
defer v.Close()
// Verify a bot claim
result := v.Validate(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"66.249.66.1",
)
fmt.Printf("Status: %s\n", result.Status) // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot) // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
fmt.Printf("Bot Name: %s\n", result.Name) // "googlebot"
}
HTTP Middleware Example
func BotVerificationMiddleware(v *knownbots.Validator) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ua := r.Header.Get("User-Agent")
ip := r.RemoteAddr // In production, extract from X-Forwarded-For
result := v.Validate(ua, ip)
// Block fake bots (claims to be bot but IP not verified)
if result.IsBot && !result.IsVerified {
http.Error(w, "Forbidden: Bot verification failed", http.StatusForbidden)
return
}
// Add verification metadata to request context
ctx := context.WithValue(r.Context(), "botVerified", result)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
}
Configuration Options
v, err := knownbots.New(
knownbots.WithRoot("./custom-bots"), // Custom bot config directory
knownbots.WithFailLimit(5000), // Failed lookup cache size
knownbots.WithClassifyUA(), // Enable UA classification (disabled by default)
)
// Disable logging to reduce console pollution (e.g., in benchmarks)
knownbots.EnableLog = false
Configuration
Directory Structure
bots/
├── conf.d/ # Bot configurations (YAML)
│ ├── googlebot.yaml
│ ├── bingbot.yaml
│ └── ...
├── googlebot/ # Bot-specific data (auto-created)
│ ├── rdns.txt # Persistent RDNS cache
│ └── ips.txt # Downloaded IP ranges
└── ...
Bot Configuration (YAML)
name: googlebot
ua: "Googlebot" # EXACT casing required (case-sensitive)
urls: # Official IP list URLs (auto-downloaded)
- "https://www.gstatic.com/ipranges/google.json"
custom: # Static CIDR ranges (always checked)
- "66.249.64.0/19"
asn: # ASN numbers for verification (optional)
- 15169
domains: # Verified RDNS domains
- "googlebot.com"
- "google.com"
rdns: true # Enable RDNS verification (false = IP-only)
Important:
- User-Agent markers (
ua) are case-sensitive. Official bots use fixed casing (e.g., "Googlebot", never "googlebot"). This prevents forgery attempts where attackers alter casing to bypass detection.
- Set
rdns: false for bots that only need IP range verification (faster, no DNS queries)
- ASN verification is optional and provides faster IP ownership verification (~35ns) compared to RDNS (~450ns) for bots with official ASN registrations
Parser Selection
Choose the correct parser based on the IP list format:
| Format |
JSON Example |
Parser |
| Google-style |
{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}]} |
google |
| OpenAI-style |
{"prefixes": [{"prefix": "1.2.3.4/24"}]} |
openai |
| Plain text |
1.2.3.4/24 or 172.16.0.5 |
txt |
| GitHub-style |
{"hooks": ["1.2.3.4/24"], "web": [...]} |
github |
| Stripe-style |
{"WEBHOOKS": ["3.18.12.63"]} |
stripe |
User-Agent Matching Rules
-
Case-sensitive: Use exact casing from official documentation
- ✅ Correct:
ua: "Googlebot" or ua: "bingbot"
- ❌ Wrong:
ua: "googlebot" or ua: "BINGBOT"
-
Match type: Word boundary matching (not substring)
ua: "Googlebot" matches: Googlebot/2.1, Mozilla/5.0 (compatible; Googlebot/2.1; ...)
ua: "Googlebot" does NOT match: MyGooglebot, GooglebotPro
-
Special bots: Some bots don't use Mozilla prefix
ua: "GPTBot" (OpenAI)
ua: "curl" (CLI tool)
How It Works
Verification Flow
┌─────────────────────────────────────────────────────────────┐
│ Incoming Request │
│ (User-Agent + IP Address) │
└─────────────────┬───────────────────────────────────────────┘
│
▼
┌────────────────────┐
│ UA Matches Bot? │──No──▶ Classify UA Type
└────────┬───────────┘ (Browser/Suspicious/Unknown)
│ Yes │
▼ ▼
┌────────────────────┐ Return: IsBot=false
│ Check IP Ranges │ (legitimate browser)
│ (CIDR matching) │
└────────┬───────────┘
│
├─ Hit ──▶ Return: verified
│
├─ Miss + asn empty ──▶ Check RDNS
│
├─ Miss + asn defined ──▶ Check ASN
│ │
│ ├─ Hit ──▶ Return: verified
│ │
│ └─ Miss ──▶ Check RDNS
│
▼
┌────────────────────┐
│ Bot.RDNS=true? │──No──▶ Return: failed
└────────┬───────────┘ (IP-only bot, no DNS check)
│ Yes
▼
┌────────────────────┐
│ Check Fail Cache │──Hit──▶ Return: failed
│ (LRU, 1000 IPs) │ (known fake bot)
└────────┬───────────┘
│ Miss
▼
┌────────────────────┐
│ Check RDNS Cache │──Hit──▶ Domain match?
│ (persistent) │ Yes: verified
└────────┬───────────┘ No: failed
│ Miss
▼
┌────────────────────┐
│ Perform RDNS Lookup│──▶ Domain match?
│ (50-200ms delay) │ Yes: verified + cache
└────────────────────┘ No: failed + fail cache
Background Scheduler (Every 24h)
┌─────────────────────────────────────────────────────────────┐
│ Background Scheduler │
└─────────────────┬───────────────────────────────────────────┘
│
┌─────────┴─────────┬──────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Refresh │ │ Update ASN │ │ Prune & │
│ IP Lists │ │ Data │ │ Save │
│ (HTTP) │ │ (RIPE API) │ │ RDNS │
└──────────┘ └──────────────┘ │ Cache │
│ │ │ (rdns=true) │
▼ ▼ └──────────┘
Update memory Update cache │
Persist to file Persist to file ▼
(per-bot dir) Remove invalid
Persist to file
Benchmarks (40 bots, Intel i5-1038NG7 @ 2.00GHz)
| Operation |
Time/op |
Allocs/op |
Notes |
| UA matching (hit first) |
165ns |
0 |
Byte index + word boundary check |
| UA matching (hit middle) |
300ns |
0 |
Worst case: mid-list match |
| UA matching (miss) |
640ns |
0 |
Full scan + browser classification |
| Validate (IP range hit) |
227ns |
0 |
Radix tree CIDR matching |
| Validate (ASN hit) |
35ns |
1 |
O(1) Patricia tree lookup |
| Validate (RDNS hit) |
450ns |
0 |
Cache lookup + domain match |
| Validate (cold lookup) |
50-200ms |
1-2 |
DNS query (first time only) |
Key Insight: Verification priority is IP ranges → ASN → RDNS. ASN verification (~35ns) is faster than RDNS cache lookup (~450ns) and ideal for bots with official ASN registrations.
Scalability
| Bot Count |
Index Benefit |
Recommended Index |
| < 20 bots |
Minimal (2x) |
Single byte (current) |
| 20-50 bots |
Significant (4-5x) |
Single byte (current) |
| > 50 bots |
Critical (10x+) |
Consider 3-char prefix |
Current implementation is optimized for 3-50 bots (covers 99% of use cases).
API Reference
Types
type Validator struct { /* ... */ }
type Result struct {
Name string // Bot name (e.g., "googlebot")
Status ResultStatus // "verified" | "failed" | "unknown"
IsBot bool // True if UA matches any bot or looks bot-like
IsVerified bool // True if IP ownership verified
}
type ResultStatus string
const (
StatusVerified ResultStatus = "verified" // Bot confirmed (UA + IP match)
StatusFailed ResultStatus = "failed" // Bot suspected but IP invalid
StatusUnknown ResultStatus = "unknown" // Not a known bot
)
Methods
// New creates a validator with background scheduler
func New(opts ...Option) (*Validator, error)
// Validate verifies User-Agent and IP address
func (v *Validator) Validate(ua, ip string) Result
// Close stops background scheduler
func (v *Validator) Close() error
Options
// WithRoot sets custom bot directory (default: "./bots")
func WithRoot(dir string) Option
// WithFailLimit sets failed lookup cache size (default: 1000)
func WithFailLimit(limit int) Option
Real-World Use Cases
1. Rate Limiting
// Apply different rate limits for verified bots vs browsers
result := validator.Validate(ua, ip)
if result.IsVerified {
limiter = rateLimits.Bot // Generous: 10/sec
} else if result.IsBot {
limiter = rateLimits.FakeBot // Strict: 1/min
} else {
limiter = rateLimits.Browser // Normal: 5/sec
}
2. Analytics Exclusion
// Exclude verified bots from user analytics
result := validator.Validate(ua, ip)
if !result.IsBot || !result.IsVerified {
analytics.Track(userID, event)
}
3. SEO Testing
// Allow verified Googlebot to bypass feature flags
result := validator.Validate(ua, ip)
if result.Name == "googlebot" && result.IsVerified {
features.EnableAll() // Show production content for indexing
}
4. Content Protection
// Block fake bots from scraping paywalled content
result := validator.Validate(ua, ip)
if result.IsBot && !result.IsVerified {
return http.StatusForbidden // Suspected scraper
}
Supported Bots (Built-in Configs)
Current built-in configurations:
- Googlebot (Google Search)
- Bingbot (Microsoft Bing)
- facebookexternalhit (Facebook/Meta link previews)
- GPTBot (OpenAI)
- Applebot (Apple Search and Siri)
- GitHub (GitHub webhooks)
- Stripe (Stripe webhooks)
- UptimeRobot (Uptime monitoring)
Need more bots? Add YAML configs to bots/conf.d/ - no code changes required!
Common bots to add:
- Yandex (YandexBot)
- Baidu (Baiduspider)
- DuckDuckGo (DuckDuckBot)
- Twitter (Twitterbot)
- Slack (Slackbot)
See bots/conf.d/googlebot.yaml for configuration examples.
Testing
# Run all tests
go test ./...
# Run only unit tests (skip integration tests)
go test -short ./...
# Run benchmarks
go test -bench=. -benchmem
# Run specific test
go test -v -run ^TestValidator$
# Coverage report
go test -cover ./...
Integration Tests: The project includes integration tests that verify parsing of real API responses from:
- GoogleBot: 307 prefixes
- Bingbot: 28 prefixes
- GPTBot: 21 prefixes
- GitHub: 50 prefixes
- Stripe: 12 IPs
- UptimeRobot: 116 prefixes
- Applebot: 12 prefixes
Architecture Decisions
Why atomic.Pointer[T] instead of RWMutex?
Bot configurations change rarely (on reload/schedule, 1-20x/day) but are read on every request (1000s/sec). atomic.Pointer[T] provides:
- Lock-free reads - single atomic load, no lock acquisition overhead
- Readers never block - writes don't wait for readers, readers don't wait for writes (Copy-on-Write)
- Consistent performance - no priority inversion or cache line contention from lock operations
Consistent sub-microsecond performance for read-heavy workloads.
Why case-sensitive UA matching?
Official bots use fixed casing ("Googlebot", never "googlebot"). Case variations indicate forgery. Case-sensitive matching:
- Rejects fakes at first stage (no expensive DNS queries)
- 4x faster than case-insensitive (16ns vs 67ns)
- Improves both security and performance
Why Copy-on-Write cache?
RDNS cache sees 1-20 new IPs per day but 1000s of reads per second (99.99% read ratio). Copy-on-Write with atomic swap provides:
- Zero-allocation reads (no locking)
- Safe concurrent access
- Simple implementation (vs lock-free data structures)
Why byte-level index?
Linear bot list scan is fast for 3 bots (52ns) but degrades to 640ns at 40 bots. Single-character index provides 4-5x speedup for 20-50 bots at minimal memory cost (<1KB).
Why Bot method encapsulation?
IP and ASN lifecycle operations (load, refresh, persist) are shared between initialization and the background scheduler. Encapsulating these as Bot methods:
- Eliminates duplicate code -
initBot and runScheduler both call the same loadCachedIPs, refreshIPs, initializeASN, and refreshASN methods
- Centralizes state - IPTree and ASN cache pointers live on the
Bot struct, making ownership clear
- Improves testability - Each lifecycle method can be unit tested in isolation
- Enables future extensions - New verification methods (e.g., BGP feeds) can follow the same pattern
Example Bot methods:
func (b *Bot) loadCachedIPs(path string) // Load cached prefixes from file
func (b *Bot) refreshIPs(http *http.Client, root string) // Download and persist new prefixes
func (b *Bot) initializeASN(store *asn.Store) // Load ASN cache with fallback to API
func (b *Bot) refreshASN(store *asn.Store) // Refresh ASN prefixes from API
Adding New Bots
Adding a new bot requires no code changes - just create a YAML configuration file.
Step 1: Choose Verification Method
| Method |
When to Use |
Example |
| URL + Parser |
Bot has official JSON/TXT IP list |
Googlebot, Bingbot, GPTBot |
| ASN |
Bot has official ASN registration |
Cloudflare (AS13335), Google (AS15169) |
| RDNS Only |
No official IP list, verify via DNS |
Baidu, Yandex |
Step 2: Create Configuration File
Create bots/conf.d/newbot.yaml:
# Case 1: Bot with official JSON IP list (RECOMMENDED)
kind: SearchEngine # Category: SearchEngine, SocialMedia, Tool, etc.
name: newbot # Unique identifier (used in results)
parser: google # Parser: google, openai, txt, github, stripe
ua: "NewBot" # User-Agent fragment (case-sensitive!)
urls:
- "https://example.com/bot-ips.json"
# Case 2: Bot with ASN verification (fastest option)
kind: SearchEngine
name: newbot
ua: "NewBot"
asn:
- 12345 # ASN number (fetched from RIPE API)
# Case 3: Bot with RDNS verification only (no official IP list)
kind: SearchEngine
name: newbot
ua: "NewBot"
domains:
- "newbot.example.com"
rdns: true
Choose the correct parser based on the IP list format:
Google-style (ipv4Prefix/ipv6Prefix fields):
{"prefixes": [{"ipv4Prefix": "1.2.3.4/24"}, {"ipv6Prefix": "2001:db8::/32"}]}
Parser: google
OpenAI-style (prefix field):
{"prefixes": [{"prefix": "1.2.3.4/24"}]}
Parser: openai
Plain text (one CIDR or individual IP per line):
1.2.3.4/24
5.6.7.8/24
172.16.0.5
Parser: txt (converts individual IPs to /32 or /128 CIDR notation)
GitHub-style (hooks, web, api string arrays):
{"hooks": ["192.30.252.0/22"], "web": ["192.30.252.0/22"], "api": ["192.30.252.0/22"]}
Parser: github
Stripe-style (WEBHOOKS array with individual IPs):
{"WEBHOOKS": ["3.18.12.63", "3.130.192.231", "13.235.14.237"]}
Parser: stripe (converts individual IPs to /32 or /128 CIDR notation)
Step 4: Restart Validator
To apply new bot configurations, restart your application or recreate the Validator:
// Create a new validator with updated bots
v, err := knownbots.New(knownbots.WithRoot("./bots"))
if err != nil {
log.Fatal(err)
}
defer v.Close()
Step 5: Verify
result := v.Validate(
"Mozilla/5.0 (compatible; NewBot/1.0; +https://example.com/bot)",
"1.2.3.4",
)
fmt.Printf("Status: %s\n", result.Status) // "verified"
fmt.Printf("IsBot: %t\n", result.IsBot) // true
fmt.Printf("IsVerified: %t\n", result.IsVerified) // true
Example Configurations
Googlebot (official JSON, fast verification):
kind: SearchEngine
name: googlebot
parser: google
ua: "Googlebot"
urls:
- "https://www.gstatic.com/ipranges/google.json"
Bingbot (official JSON):
kind: SearchEngine
name: bingbot
parser: google
ua: "bingbot"
urls:
- "https://www.bing.com/toolbox/bingbot.json"
GPTBot (OpenAI uses Google-style JSON):
kind: AiTraining
name: gptbot
parser: google
ua: "GPTBot"
urls:
- "https://openai.com/gptbot.json"
Applebot (official JSON from developer.apple.com):
kind: SearchEngine
name: applebot
parser: google
ua: "Applebot"
urls:
- "https://search.developer.apple.com/applebot.json"
GitHub Webhooks:
kind: Tool
name: github
parser: github
ua: "GitHub-Hookshot"
urls:
- "https://api.github.com/meta"
Stripe Webhooks:
kind: Tool
name: stripe
parser: stripe
ua: "Stripe"
urls:
- "https://stripe.com/files/ips/ips_webhooks.json"
UptimeRobot (plain text with individual IPs):
kind: Monitoring
name:uptimerobot
parser: txt
ua: "UptimeRobot"
urls:
- "https://uptimerobot.com/inc/files/ips/IPv4.txt"
Baidu (RDNS only, no official IP list):
kind: SearchEngine
name: baiduspider
ua: "Baiduspider"
domains:
- "baidu.com"
- "baidu.jp"
rdns: true
Yandex (RDNS only):
kind: SearchEngine
name: yandexbot
ua: "YandexBot"
domains:
- "yandex.com"
- "yandex.ru"
rdns: true
Common Mistakes
| Mistake |
Problem |
Solution |
| Wrong casing |
"googlebot" won't match "Googlebot/2.1" |
Use exact casing: "Googlebot" |
| Wrong parser |
JSON not parsed correctly |
Match parser to JSON structure |
Missing rdns: true |
RDNS verification not performed |
Add rdns: true for DNS-based bots |
Empty custom: [] |
Unnecessary configuration |
Omit empty fields |
Testing New Bot Config
# Run tests to verify bot parsing
go test -v ./...
# Run specific parser test
go test -v -run TestGoogleParser ./parser/
# Validate IP list format
curl -s https://example.com/bot-ips.json | jq '.prefixes[0]'
Contributing
Contributions are welcome! Whether you want to add new bots, fix bugs, or improve documentation.
Ways to Contribute
- Add new bot configurations - Most contributions are just YAML files in
bots/conf.d/
- Fix parser issues - Handle new or different IP list formats
- Improve documentation - Fix typos, clarify instructions, add examples
- Report bugs - Open issues with minimal reproduction steps
- Suggest features - Open discussions about new functionality
Submitting Pull Requests
- Fork the repository on GitHub
- Create a feature branch:
git checkout -b add-newbot
- Add your bot configuration to
bots/conf.d/newbot.yaml
- Test your changes:
go test -short ./...
go test -v -run TestNewBot ./parser/
- Commit using Google Git convention:
git commit -m "feat: add NewBot configuration
- Add NewBot YAML configuration
- Verify User-Agent matching
- Test IP parsing with official API
PiperOrigin-RevId: XXXXXXXX
Change-Id: IXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
- Push and create a Pull Request
Bot Configuration Guidelines
When adding a new bot configuration:
-
Verify the User-Agent from official documentation
- Use exact casing (e.g., "Googlebot", not "googlebot")
- Check for word boundary matching requirements
-
Find the official IP list URL
- Most major bots publish JSON/TXT IP lists
- Prefer official sources over third-party aggregators
-
Choose the correct parser
- Match the parser to the actual JSON structure
- Test with real API response before submitting
-
Test thoroughly
- Run
go test -short ./... to verify no regressions
- Check integration tests pass for new bot if applicable
Code Style
- Follow standard Go conventions
- Run
go fmt ./... before committing
- Run
go vet ./... to catch potential issues
- Add tests for new functionality
License
MIT License
Author
Dayi Chen - GitHub
Acknowledgments
- Inspired by Google's official bot verification documentation
- Performance patterns influenced by Go stdlib's
sync/atomic and net/netip designs
- Special thanks to all contributors and users providing feedback
⭐ Star this project if you find it useful!
📝 Questions? Open an issue or start a discussion!
🐛 Found a bug? Please report it with minimal reproduction steps!