wikiscrape

command module

v0.0.0-...-8ee748d Latest Latest Go to latest Published: Apr 18, 2024 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mal0ner/wikiscrape

Links

Open Source Insights

README ¶

🌐 Get wiki pages from the command line. Increase brain volume 🧠.

[!WARNING]
This project is unfinished! Not all of the features listed in the README are available.

Wikiscrape

Get wiki pages. Export to your desired format. Wikiscrape is a command-line tool which aims to provide a wiki-agnostic method for retrieving and exporting data from wiki pages.

The whole motivation for this project is to provide a consistent and convenient interface for interacting with the sometimes frustrating world of wiki APIs. Despite the vast majority of wikis being built upon a small number of frameworks, I often found even those which shared a backend framework to have vastly different access patterns.

For example, despite both being built on top of MediaWiki, wikipedia and the oldschool runescape wiki differ in the following:

API Endpoint: en.wikipedia.org/w/api.php vs. oldschool.runescape.wiki/api.php
Page Prefix: en.wikipedia.org/wiki/pageName vs. oldschool.runescape.wiki/w/pageName

Features

Bl-Moderately Fast 🚀🔥
Effortless retrieval of full wiki pages or specific sections
Support for multiple wiki backends
Manifest file support: wikiscrape can iteratively scrape a from a list of pages given a json file.

Wiki Support

Because of the differences in API access patterns mentioned above, wikis must be explicitly supported by Wikiscrape in order to retrieve content from them. "Support" involves the following:

A wikiInfo entry in internal/util/wikisupport.go, which allows mapping known wiki names or URL host segments to information about their respective backends, API endpoints, and page prefixes for handling parsing page names from URLs.
A scraper and response in internal/scraper designed specifically for the wiki's backend to handle parsing API responses and their content.

For a list of the wikis and backends supported by Wikiscrape, please see the command wikiscrape list -h. Currently, supported backends are:

MediaWiki

If you have a wiki that you would like supported, and there is already existing support for its backend in the aforementioned internal/scraper, please feel free to submit an issue. If you have the skill or the time, please also feel free to contribute directly to the project by adding the wiki to the wikiHostInfo and wikiNameInfo maps in internal/util/wikisupport.go! Please see the contribution guide

Installation

Right now, the best way to get wikiscrape on your machine is to just use go:

go install github.com/mal0ner/wikiscrape@latest

Usage

Wikiscrape gives you a simple and intuitive command-line interface.

Scrape a single page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear"

# by name
wikiscrape page "Bear" --wiki wikipedia

Scrape the list of section headings from a page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear" --section-titles

# by name
wikiscrape page "Bear" --wiki wikipedia -t

Scrape a specific section:

wikiscrape page "Bear" --wiki wikipedia --section "Taxonomy"

# short
wikiscrape page Bear -w wikipedia -s Taxonomy

Scrape multiple pages from a manifest file:

wikiscrape pages --wiki wikipedia --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -f path/to/manifest.json

Scrape just references from a list of pages:

wikiscrape pages --wiki wikipedia --section "References" --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -s References -f path/to/manifest.json

Manifest

The format of the manifest file is just a simple JSON array. This was probably a strange design decision but I don't really wan't to change it! Page titles can be included raw without the need for url encoding, as this step is taken care of by the program.

["Hammer", "Zulrah/Strategies"]

This could potentially be expanded in the future to allow for the user to specify a section to scrape on a per-page basis, i.e. {"page": "Hammer", "section": "Uses"} but I have no plans for that now.

FAQ

Will you ever fix the logo alignment?

No 👍

Contribution

We welcome contributions, If you'd like to help out, please follow these steps:

Fork the repository
Create a new branch for your feature or bug fix
Make your changes and commit them with descriptive messages
Push your changes to your forked repository
Submit a pull request to the main repository

Roadmap

Multi-language support
Fuzzy-find pages (low priority)
Fuzzy-find sections (low priority)
Add more export formats
Link preservation
Table parsing
List parsing
Reference parsing and potentially BibTeX export? Could have a --references flag
Tests!
Adding more wikis (and the confluence backend)
Proper SemVer
Add configuration file for configuring default behaviour (for less verbosity)

Documentation ¶

Overview ¶

Package main is the primary entry point for the wikiscrape cli program.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cmd
list
internal
export Package export handles interface specification for wiki-agnostic exporters for page instances created by the scrape package	Package export handles interface specification for wiki-agnostic exporters for page instances created by the scrape package
logging
parse
scrape package scrape handles interface specifications and concrete wiki-specific implementations for the scraping and parsing of the content from pages served by various Wiki frameworks.	package scrape handles interface specifications and concrete wiki-specific implementations for the scraping and parsing of the content from pages served by various Wiki frameworks.
util
wiki

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL