sitemapgen

package module

v0.0.0-...-3fc7efa Latest Latest Go to latest Published: May 30, 2022 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dev2576/sitemap-generator

Links

Open Source Insights

README ¶

Sitemap Generator

A CLI app made for generating detailed sitemaps for websites which don't have one.

This piece of software is in early development stage and has not been fully tested, use at your own risk.

Features:

Multiple workers for faster parsing and generation
QueryString inclusion/exclusion for urls which meet regex requirements
URL unification (avoiding duplicates)
Streamed flow (low memory usage)
Output filtering based on URLs
Proxy support for avoiding rate limiting
Simple and powerful configuration
robots.txt support

Building CLI:

Download required libraries

go get https://github.com/eapache/channels
go get github.com/temoto/robotstxt

Build it

cd sitemap-generator
go build

Usage:

./sitemap-generator -config config.json - where config.json is location of config file

Example config:

{
    "url": "http://example.com/", "#":"URL that will be parsed first",
    "parsing": {
        "workers": 2, "#":"Amount of parallel workers parsing pages",
        "parseExclusions": [
            "example.com/p/",
            "http://.+\\.example\\.com", "#Don't parse subdomains",
            "\\.jsp"
        ], "#":"Which sites not to parse, this doesn't exclude it from being but into sitemap file",
        "params": [
            {
              "regex": "",
              "include": true, "#": "Include means that only params specified below will be kept, Exclude will remove given params.",
              "params": ["id"]
            }
        ], "#":"Specifiec which params should be kept and which one should be stripped",
        "respectRobots": true, "#": "Whether robots.txt should be respected",
        "userAgent": "BOT-SitemapGenerator", "#": "UserAgent for requests",
        "noProxyClient": true, "#": "Whether to create http client without proxy",
        "requestsPerSec": 1, "#": "Amount of requests per client ",
        "stripQueryString": false, "#": "Whether to completely ignore query string",
        "stripWWW": true, "#": "Whether to treat www.example.com and example.com as thesame page.",
        "burst": 2, "#": "Request burst - accumulation of unused request opportunities from request per sec",
        "cutProtocol": true, "#": "Whether to remove http(s) protocol",
        "proxies": [
            {
                "address": "http://000.000.000.000", "#": "Proxy address",
                "username": "username", "#": "Username",
                "password": "password", "#": "Password"
            }
        ]
    },
    "output": [
        {
            "perFile": 4000, "#": "How many sites per file",
            "regex": "example.com/p/", "#": "Which sites should apply",
            "filePrefix": "products",
            "modifiers": {
              "changeFrequency": "daily",
              "priority": 0.1
            }
        }
    ]
}

ToDo:

Unit tests
Config documentation
Benchamarks
Adapt for library usage

Documentation ¶

Index ¶

func GetRobots(url *url.URL) (*robotstxt.RobotsData, error)
func ReorderAndCrop(conf *config.ParsingConfig, url *url.URL)
func ShallParse(conf *config.ParsingConfig, url string) bool
func StripProtocol(url string) string
func StripWWW(host string) string
type Generator
- func NewGenerator(config *config.Config) *Generator
- func (sg *Generator) Start() error
type Validator
- func NewValidator(config config.Config, workerQueue *channels.InfiniteChannel, ...) *Validator
type Worker
- func NewWorker(workQueue *channels.InfiniteChannel, validator chan<- *url.URL, ...) *Worker
- func (w *Worker) Start()

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetRobots ¶

func GetRobots(url *url.URL) (*robotstxt.RobotsData, error)

GetRobots gets RobotsData for given url

func ReorderAndCrop ¶

func ReorderAndCrop(conf *config.ParsingConfig, url *url.URL)

ReorderAndCrop removes the anchor (#sth) Fragment, sorts, removes and encodes query string parameters and lowercases Host

func ShallParse ¶

func ShallParse(conf *config.ParsingConfig, url string) bool

ShallParse checks whether site's source should be parsed

func StripProtocol ¶

func StripProtocol(url string) string

StripProtocol strips a protocol from URL represented in string

func StripWWW ¶

func StripWWW(host string) string

StripWWW strips a www. prefix/subdomain from URL represented in string

Types ¶

type Generator ¶

type Generator struct {
	WorkerQueue *channels.InfiniteChannel
	// contains filtered or unexported fields
}

func NewGenerator ¶

func NewGenerator(config *config.Config) *Generator

NewSitemapGenerator constructs a new sitemap generator instance, Call Start() in order to start the proccesszz

func (*Generator) Start ¶

func (sg *Generator) Start() error

Start gives the whole machine a spin TODO: Divide and conquer :>

type Validator ¶

type Validator struct {
	Input chan *url.URL
	// contains filtered or unexported fields
}

Validator manages address flow by pushing them to certain creation proccesses and makes sure no links are parsed twice.

func NewValidator ¶

func NewValidator(config config.Config, workerQueue *channels.InfiniteChannel, waitGroup *sync.WaitGroup, robots *robotstxt.RobotsData, generator chan string) *Validator

NewValidator creates a new validator instance

type Worker ¶

type Worker struct {
	// contains filtered or unexported fields
}

func NewWorker ¶

func NewWorker(workQueue *channels.InfiniteChannel, validator chan<- *url.URL, waitGroup *sync.WaitGroup, generator chan<- string, httpClients chan *limit.Client) *Worker

func (*Worker) Start ¶

func (w *Worker) Start()

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
config
filegen
limit
sitemap-generator

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL