otwarchive-downloader

command module
v0.0.0-...-f3952f6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 30, 2025 License: CC0-1.0 Imports: 18 Imported by: 0

README

OTW Archive Downloader

A high-performance, concurrent tool for downloading and parsing works from any website powered by the OTW-Archive software, including Archive of Our Own (AO3), SquidgeWorld, and other otwarchive instances.

Features

  • Works with Multiple Sites: Compatible with any otwarchive-powered website (AO3, SquidgeWorld, etc.)
  • Mass Download: Efficiently download works using their IDs
  • Concurrent Processing: Download multiple works simultaneously
  • Proxy Support: Use SOCKS5 proxies to avoid rate limiting
  • Metadata Extraction: Extract rich metadata from works (tags, ratings, warnings, etc.)
  • Content Preservation: Save full text content along with metadata
  • Batch Processing: Process works in configurable batches for easier management
  • JSONL Output: Store works in JSON Lines format for easy data analysis
  • Robust Error Handling: Smart retries, proxy rotation, and connection management

Installation

# Clone the repository
git clone https://codeberg.org/nyuuzyou/otwarchive-downloader.git
cd otwarchive-downloader

# Build the executable
go build

Usage

./otwarchive-downloader [options]
Command-line Options
Option Description Default
--start-id Starting work ID 1
--end-id Ending work ID 100000
--batch-size Number of IDs per output file 10000
--concurrent Maximum number of concurrent requests 4
--retries Number of retry attempts per work 5
--proxy-file File containing proxies proxy.txt
--output Directory for output files output
--use-proxies Use proxies for connections false
--timeout Timeout for HTTP requests 60s
--base-url Base URL for downloads https://download.archiveofourown.org/downloads
--user-agent User agent string for requests Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Using with Different otwarchive Sites

To use with a different otwarchive-powered site, simply change the -base-url parameter:

# For SquidgeWorld
./otwarchive-downloader -base-url="https://squidgeworld.org/downloads"
Proxy Configuration

If using proxies, create a proxy.txt file with one proxy per line in either of these formats:

host:port:username:password
host:port

Example:

127.0.0.1:9050:user:pass
proxy.example.com:1080

Output Format

The tool outputs JSONL files (one JSON object per line) with the following structure:

{
  "id": "12345",
  "title": "Example Work Title",
  "metadata": {
    "author": "ExampleAuthor",
    "Rating": "Explicit",
    "Archive Warning": "No Archive Warnings Apply",
    "Category": "F/M",
    "Fandom": "Example Fandom",
    "Relationship": "Character A/Character B",
    "Character": "Character A, Character B",
    "Additional Tags": "Tag A, Tag B",
    "published": "1970-01-01",
    "completed": "2038-01-19",
    "words": "12,345",
    "chapters": "3/3"
  },
  "text": "Full text content of the work..."
}

Example

Download works with IDs from 1000 to 2000 using 8 concurrent connections:

./otwarchive-downloader -start-id=1000 -end-id=2000 -concurrent=8

Use proxies and target a specific otwarchive instance:

./otwarchive-downloader -start-id=1000 --end-id=10000 -use-proxies -concurrent=64 -base-url="https://download.your-archive-site.org/downloads"

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL