doctrove

module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 20, 2026 License: MIT

README

doctrove

CI Release Go Reference Go Report Card

A local documentation store for AI coding agents.

Mirrors LLM-targeted documentation (llms.txt and companion files) from websites to a local store with full-text search, git change tracking, and an MCP interface for agent access.

Install

make install                   # builds and installs to $GOBIN
make init-workspace            # creates ~/.config/doctrove with default config
doctrove mcp-config            # shows config to add to your agent

Workspace defaults to ~/.config/doctrove. Override with --dir or DOCTROVE_DIR.

Quick Start

# Discover what a site has
doctrove discover https://stripe.com

# Grab it (init + sync in one step)
doctrove grab https://supabase.com

# Search across all mirrored content
doctrove search "authentication"

# Search only API docs
doctrove search --category api-reference "webhooks"

# Refresh to pick up changes (uses ETag caching)
doctrove refresh supabase.com

# See what you have
doctrove catalog
doctrove stats

Commands

Command Description
discover <url> Probe a URL for LLM content without tracking
grab <url> Discover, track, and sync in one step
init <url> Add a site to track
sync [site|--all] Download/update content
refresh [site|--all] Re-sync tracked sites, skipping unchanged files via ETag caching
search <query> Full-text search with --site, --type, --category, --full
tag <site> <path> <cat> Override the category for a mirrored file
catalog Show site summaries with topics (from llms.txt structure)
stats Disk usage, file counts, sync freshness per site
stale Show sites not synced within --threshold (default 7d)
list List all tracked sites
status [site] Show sync status and file counts
check <site> Dry-run: show available content without downloading
history [site] Git-based change history with --since
diff [from] [to] Show content changes between syncs
remove <site> Stop tracking (with --keep-files option)
mcp Start MCP server (stdio transport)

All commands support --json for machine-readable output.

MCP Server

Generate your config snippet:

doctrove mcp-config

Add the mcpServers entry to the appropriate config file:

Agent Config File
Claude Code (user scope) ~/.claude.json
Claude Code (project scope) .mcp.json (project root)
Cursor .cursor/mcp.json (project root)

Example config:

{
  "mcpServers": {
    "doctrove": {
      "command": "/usr/local/bin/doctrove",
      "args": ["mcp", "--dir", "/Users/you/.config/doctrove"]
    }
  }
}
Tools (20)
Tool Description
trove_discover Probe a URL for LLM content
trove_scan Add and sync a site (content_types param to filter; persisted for refresh; re-scannable)
trove_refresh Re-sync a tracked site, using ETag caching (honours content_types filter)
trove_check Dry-run: show available content with sizes and content types
trove_search Full-text search with category, path filters; path-boosted ranking; summaries included
trove_search_full Search and return full content of best match (large; prefer outline+section read)
trove_outline Get heading structure with max_depth (default 3) and max_sections (default 100) caps
trove_read Read a file or specific section by heading match (section param)
trove_summarize Store an agent-written summary for a file (visible in search results and outlines)
trove_tag Override category for a file (validated, persists across re-syncs)
trove_list List tracked sites
trove_list_files Enumerate files with path, size, content type, and category (paginated, category filter)
trove_catalog Site summaries with topics
trove_stats Workspace statistics
trove_status Sync status, category breakdown, and staleness for a site
trove_history Git change history
trove_diff Content changes between refs (stat mode for compact summary)
trove_stale List sites not synced within a threshold (default 7d)
trove_find Find files by path pattern (faster than search for path lookups)
trove_remove Stop tracking a site
Context-Efficient Workflow

The tools are designed for hierarchical drill-down to minimize context usage:

trove_catalog          → which site has docs on my topic?
trove_search           → which files are relevant? (check summaries first)
trove_outline          → what sections does this file have? (+ summary if cached)
trove_read section=X   → read just the section I need
trove_summarize        → cache a summary so the next agent doesn't re-read

trove_tag and trove_summarize persist across re-syncs. If you read a large file, summarize it. If a category is wrong, fix it.

Content Discovery

doctrove probes multiple sources for LLM-targeted content:

  • Well-known paths: /llms.txt, /llms-full.txt, /llms-ctx.txt, /llms-ctx-full.txt, /ai.txt
  • Companion files: URLs referenced in llms.txt (markdown links followed permissively)
  • Sitemap: Checks sitemap.xml for paths containing /llms/ or ending in .md/.txt
  • .well-known: tdmrep.json, agent.json, agents.json
  • Context7: Bare library names (e.g. react, stripe-node) resolved via Context7 API when context7_api_key is configured
  • HTML conversion: Sites serving HTML at content URLs (Next.js, SPAs) are converted to markdown
  • MDX cleanup: Framework artifacts (JSX components, export statements, boilerplate banners) are stripped from mirrored content

Page Categories

Every indexed file is assigned a semantic category for task-appropriate filtering:

Category Examples
api-reference /api/, /reference/, code-heavy pages
tutorial /tutorials/, /getting-started/, /quickstart
guide /guides/, /learn/, /how-to/
spec /specification/, /schema, /seps/
changelog /changelog, /release-notes
marketing /pricing, /use-cases/, /customers, link-heavy pages
legal /privacy, /legal/, /terms
community /community/, /contributing
context7 Content fetched via Context7 API
index llms.txt, llms-full.txt, ai.txt (site index files)
other Unclassified companions, well-known metadata

Assigned by path patterns with body analysis as fallback. Override with trove_tag / doctrove tag.

# Search only API docs
doctrove search --category api-reference "hooks"

# Fix a misclassified page
doctrove tag stripe.com /payments marketing

Context7 Integration

With a Context7 API key, you can resolve bare library names (e.g. react, stripe-node) to documentation maintained by the Context7 community, in addition to site-sourced llms.txt content.

settings:
  context7_api_key: ctx7sk-...   # get a key at https://context7.com
# Discover and sync Context7 docs for a library
doctrove scan react
doctrove scan stripe-node

Content fetched via Context7 is categorized as context7 and stored under synthetic domains (e.g. context7.com~facebook_react), keeping it separate from site-sourced content. Context7 content is subject to Upstash Terms of Service.

ETag Caching

Re-syncs use HTTP conditional requests (If-None-Match, If-Modified-Since) to skip unchanged files. Cache headers are stored per-file in the index. Use refresh to take advantage of this:

doctrove refresh modelcontextprotocol.io   # only downloads changed files

Configuration

doctrove.yaml in the workspace root:

settings:
  rate_limit: 2            # req/sec per host
  rate_burst: 5            # burst capacity
  timeout: 30s             # HTTP timeout
  max_probes: 100          # companion probes per llms.txt
  user_agent: "doctrove/1.0"
  events_url: http://localhost:6060/events    # optional eventrelay integration
  context7_api_key: ctx7sk-...                # optional Context7 API key

sites:
  stripe.com:
    url: https://stripe.com
    include:
      - "/llms*"
      - "/docs/**/*.md"
    exclude:
      - "/internal/**"

Global Flags

--dir string         workspace directory (default ~/.config/doctrove)
--json               output as JSON
--respect-robots     respect robots.txt AI crawler directives (off by default)

Storage

Content is stored as plain files under sites/<domain>/, tracked by git for change history, with a SQLite FTS5 index for search. The workspace is self-contained; share it by cloning.

When a URL path conflicts with a child path (e.g. /deploy exists as a file but /deploy/getting_started needs to be stored), the parent file is promoted to a directory with its content at _index. ReadContent handles this automatically.

Event Relay Integration

When events_url is configured, doctrove emits structured events to an eventrelay server for real-time observability. Events follow the full eventrelay schema:

{
  "source": "doctrove",
  "channel": "mcp",
  "action": "trove_search",
  "level": "info",
  "agent_id": "myproject:00a3f1",
  "duration_ms": 42,
  "data": {"query": "authentication", "site": "stripe.com"},
  "ts": "2026-03-18T12:00:00Z"
}
Field Description
source Always doctrove
channel mcp for MCP tool calls, sync for engine operations (init, sync, discover, remove)
action Tool or operation name (e.g. trove_search, sync, init)
level info normally, error on failure, warn on partial errors
agent_id Auto-derived from working directory + PID (e.g. myproject:00a3f1)
duration_ms Operation wall time (top-level, displayed inline in the dashboard)
data Tool arguments (MCP) or operation details (engine)

Directories

Path Synopsis
cmd
doctrove command
Package content defines interfaces for parsing and processing document content.
Package content defines interfaces for parsing and processing document content.
internal
lockfile
Package lockfile provides file-based locking for workspace write operations.
Package lockfile provides file-based locking for workspace write operations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL