wikiscrape

command module
v0.0.0-...-8ee748d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2024 License: MIT Imports: 1 Imported by: 0

README

image
🌐 Get wiki pages from the command line. Increase brain volume 🧠.

GitHub go.mod Go version Static Badge


[!WARNING]
This project is unfinished! Not all of the features listed in the README are available.

Wikiscrape

Get wiki pages. Export to your desired format. Wikiscrape is a command-line tool which aims to provide a wiki-agnostic method for retrieving and exporting data from wiki pages.

Made with VHS

The whole motivation for this project is to provide a consistent and convenient interface for interacting with the sometimes frustrating world of wiki APIs. Despite the vast majority of wikis being built upon a small number of frameworks, I often found even those which shared a backend framework to have vastly different access patterns.

For example, despite both being built on top of MediaWiki, wikipedia and the oldschool runescape wiki differ in the following:

  • API Endpoint: en.wikipedia.org/w/api.php vs. oldschool.runescape.wiki/api.php
  • Page Prefix: en.wikipedia.org/wiki/pageName vs. oldschool.runescape.wiki/w/pageName

Features

  • Bl-Moderately Fast 🚀🔥
  • Effortless retrieval of full wiki pages or specific sections
  • Support for multiple wiki backends
  • Manifest file support: wikiscrape can iteratively scrape a from a list of pages given a json file.
Wiki Support

Because of the differences in API access patterns mentioned above, wikis must be explicitly supported by Wikiscrape in order to retrieve content from them. "Support" involves the following:

  • A wikiInfo entry in internal/util/wikisupport.go, which allows mapping known wiki names or URL host segments to information about their respective backends, API endpoints, and page prefixes for handling parsing page names from URLs.

  • A scraper and response in internal/scraper designed specifically for the wiki's backend to handle parsing API responses and their content.

For a list of the wikis and backends supported by Wikiscrape, please see the command wikiscrape list -h. Currently, supported backends are:

  • MediaWiki

If you have a wiki that you would like supported, and there is already existing support for its backend in the aforementioned internal/scraper, please feel free to submit an issue. If you have the skill or the time, please also feel free to contribute directly to the project by adding the wiki to the wikiHostInfo and wikiNameInfo maps in internal/util/wikisupport.go! Please see the contribution guide

Installation

Right now, the best way to get wikiscrape on your machine is to just use go:

go install github.com/mal0ner/wikiscrape@latest

Usage

Wikiscrape gives you a simple and intuitive command-line interface.

Scrape a single page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear"

# by name
wikiscrape page "Bear" --wiki wikipedia

Scrape the list of section headings from a page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear" --section-titles

# by name
wikiscrape page "Bear" --wiki wikipedia -t

Scrape a specific section:

wikiscrape page "Bear" --wiki wikipedia --section "Taxonomy"

# short
wikiscrape page Bear -w wikipedia -s Taxonomy

Scrape multiple pages from a manifest file:

wikiscrape pages --wiki wikipedia --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -f path/to/manifest.json

Scrape just references from a list of pages:

wikiscrape pages --wiki wikipedia --section "References" --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -s References -f path/to/manifest.json
Manifest

The format of the manifest file is just a simple JSON array. This was probably a strange design decision but I don't really wan't to change it! Page titles can be included raw without the need for url encoding, as this step is taken care of by the program.

["Hammer", "Zulrah/Strategies"]

This could potentially be expanded in the future to allow for the user to specify a section to scrape on a per-page basis, i.e. {"page": "Hammer", "section": "Uses"} but I have no plans for that now.

FAQ

Will you ever fix the logo alignment?

No 👍

Contribution

We welcome contributions, If you'd like to help out, please follow these steps:

  • Fork the repository
  • Create a new branch for your feature or bug fix
  • Make your changes and commit them with descriptive messages
  • Push your changes to your forked repository
  • Submit a pull request to the main repository

Roadmap

  • Multi-language support
  • Fuzzy-find pages (low priority)
  • Fuzzy-find sections (low priority)
  • Add more export formats
  • Link preservation
  • Table parsing
  • List parsing
  • Reference parsing and potentially BibTeX export? Could have a --references flag
  • Tests!
  • Adding more wikis (and the confluence backend)
  • Proper SemVer
  • Add configuration file for configuring default behaviour (for less verbosity)

Documentation

Overview

Package main is the primary entry point for the wikiscrape cli program.

Directories

Path Synopsis
cmd
internal
export
Package export handles interface specification for wiki-agnostic exporters for page instances created by the scrape package
Package export handles interface specification for wiki-agnostic exporters for page instances created by the scrape package
scrape
package scrape handles interface specifications and concrete wiki-specific implementations for the scraping and parsing of the content from pages served by various Wiki frameworks.
package scrape handles interface specifications and concrete wiki-specific implementations for the scraping and parsing of the content from pages served by various Wiki frameworks.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL