arachne

command module
v0.0.0-...-2ab5e6a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2023 License: MIT Imports: 9 Imported by: 0

README

Arachne

Arachne is a web crawler implemented in Go, capable of crawling multiple URLs and saving the contents as Markdown files. This tool is perfect for those looking to build a local repository of web pages for offline reading or for performing subsequent data analysis.

Features

  • Scrape any webpage and save the content as Markdown files.
  • It allows you to specify starting URLs and the allowed URL prefixes for crawling.
  • Utilizes concurrency for faster scraping.
  • Automatically handles the conversion of HTML content to Markdown.
  • Respects and parses relative and absolute URLs.
  • Removes unnecessary elements from the webpage like scripts, styles, headers, footers, navigation, and sidebars to focus on the main content.

Usage

After cloning this repository and navigating into the project directory, you can run the program using:

go run main.go \
    --start "https://www.example.com,https://www.another-example.com" \
    --allow-prefix "https://www.example.com,https://www.another-example.com/some/path" \
    --out "/path/to/output"

The program accepts the following command-line arguments:

  • --start: A comma-separated list of URLs from which to start crawling.
  • --allow-prefix: A comma-separated list of URL prefixes which the crawler is allowed to visit.
  • --out: The directory where the Markdown files will be saved.

Note

Please use this tool responsibly, respecting the robots.txt files and not overloading servers. Be mindful of your network usage as well, as web scraping can consume a lot of bandwidth depending on the size of the websites you are scraping.

Contribution

This is an open-source project. Feel free to fork and make any contributions you feel will enhance the functionality of this web scraper.

License

This project is licensed under the MIT License - see the LICENSE file for details.

arachne

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL