sitemapgen

package module
v0.0.0-...-3fc7efa Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 30, 2022 License: MIT Imports: 14 Imported by: 0

README

Sitemap Generator

Build Status

A CLI app made for generating detailed sitemaps for websites which don't have one.

This piece of software is in early development stage and has not been fully tested, use at your own risk.

Features:

  • Multiple workers for faster parsing and generation
  • QueryString inclusion/exclusion for urls which meet regex requirements
  • URL unification (avoiding duplicates)
  • Streamed flow (low memory usage)
  • Output filtering based on URLs
  • Proxy support for avoiding rate limiting
  • Simple and powerful configuration
  • robots.txt support

Building CLI:

  1. Download required libraries
  • go get https://github.com/eapache/channels
  • go get github.com/temoto/robotstxt
  1. Build it
  • cd sitemap-generator
  • go build

Usage:

./sitemap-generator -config config.json - where config.json is location of config file

Example config:

{
    "url": "http://example.com/", "#":"URL that will be parsed first",
    "parsing": {
        "workers": 2, "#":"Amount of parallel workers parsing pages",
        "parseExclusions": [
            "example.com/p/",
            "http://.+\\.example\\.com", "#Don't parse subdomains",
            "\\.jsp"
        ], "#":"Which sites not to parse, this doesn't exclude it from being but into sitemap file",
        "params": [
            {
              "regex": "",
              "include": true, "#": "Include means that only params specified below will be kept, Exclude will remove given params.",
              "params": ["id"]
            }
        ], "#":"Specifiec which params should be kept and which one should be stripped",
        "respectRobots": true, "#": "Whether robots.txt should be respected",
        "userAgent": "BOT-SitemapGenerator", "#": "UserAgent for requests",
        "noProxyClient": true, "#": "Whether to create http client without proxy",
        "requestsPerSec": 1, "#": "Amount of requests per client ",
        "stripQueryString": false, "#": "Whether to completely ignore query string",
        "stripWWW": true, "#": "Whether to treat www.example.com and example.com as thesame page.",
        "burst": 2, "#": "Request burst - accumulation of unused request opportunities from request per sec",
        "cutProtocol": true, "#": "Whether to remove http(s) protocol",
        "proxies": [
            {
                "address": "http://000.000.000.000", "#": "Proxy address",
                "username": "username", "#": "Username",
                "password": "password", "#": "Password"
            }
        ]
    },
    "output": [
        {
            "perFile": 4000, "#": "How many sites per file",
            "regex": "example.com/p/", "#": "Which sites should apply",
            "filePrefix": "products",
            "modifiers": {
              "changeFrequency": "daily",
              "priority": 0.1
            }
        }
    ]
}

ToDo:

  • Unit tests
  • Config documentation
  • Benchamarks
  • Adapt for library usage

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetRobots

func GetRobots(url *url.URL) (*robotstxt.RobotsData, error)

GetRobots gets RobotsData for given url

func ReorderAndCrop

func ReorderAndCrop(conf *config.ParsingConfig, url *url.URL)

ReorderAndCrop removes the anchor (#sth) Fragment, sorts, removes and encodes query string parameters and lowercases Host

func ShallParse

func ShallParse(conf *config.ParsingConfig, url string) bool

ShallParse checks whether site's source should be parsed

func StripProtocol

func StripProtocol(url string) string

StripProtocol strips a protocol from URL represented in string

func StripWWW

func StripWWW(host string) string

StripWWW strips a www. prefix/subdomain from URL represented in string

Types

type Generator

type Generator struct {
	WorkerQueue *channels.InfiniteChannel
	// contains filtered or unexported fields
}

func NewGenerator

func NewGenerator(config *config.Config) *Generator

NewSitemapGenerator constructs a new sitemap generator instance, Call Start() in order to start the proccesszz

func (*Generator) Start

func (sg *Generator) Start() error

Start gives the whole machine a spin TODO: Divide and conquer :>

type Validator

type Validator struct {
	Input chan *url.URL
	// contains filtered or unexported fields
}

Validator manages address flow by pushing them to certain creation proccesses and makes sure no links are parsed twice.

func NewValidator

func NewValidator(config config.Config, workerQueue *channels.InfiniteChannel, waitGroup *sync.WaitGroup, robots *robotstxt.RobotsData, generator chan string) *Validator

NewValidator creates a new validator instance

type Worker

type Worker struct {
	// contains filtered or unexported fields
}

func NewWorker

func NewWorker(workQueue *channels.InfiniteChannel, validator chan<- *url.URL, waitGroup *sync.WaitGroup, generator chan<- string, httpClients chan *limit.Client) *Worker

func (*Worker) Start

func (w *Worker) Start()

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL