sitemap

package module
v0.0.0-...-d42e0ac Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 23, 2024 License: MIT Imports: 13 Imported by: 0

README

go-sitemap-parser

codecov Go Reference Go Report Card

A Go package to parse XML Sitemaps compliant with the Sitemaps.org protocol.

Features

  • Recursive parsing

Formats supported

  • robots.txt
  • XML .xml
  • Gzip compressed XML .xml.gz

Installation

go get github.com/aafeher/go-sitemap-parser
import "github.com/aafeher/go-sitemap-parser"

Usage

Create instance

To create a new instance with default settings, you can simply call the New() function.

s := sitemap.New()
Configuration defaults
  • userAgent: "go-sitemap-parser (+https://github.com/aafeher/go-sitemap-parser/blob/main/README.md)"
  • fetchTimeout: 3 seconds
Overwrite defaults
User Agent

To set the user agent, use the SetUserAgent() function.

s := sitemap.New()
s = s.SetUserAgent("YourUserAgent")

... or ...

s := sitemap.New().SetUserAgent("YourUserAgent")
Fetch timeout

To set the fetch timeout, use the SetFetchTimeout() function. It should be specified in seconds as an uint8 value.

s := sitemap.New()
s = s.SetFetchTimeout(10)

... or ...

s := sitemap.New().SetFetchTimeout(10)
Chaining methods

In both cases, the functions return a pointer to the main object of the package, allowing you to chain these setting methods in a fluent interface style:

s := sitemap.New().SetUserAgent("YourUserAgent").SetFetchTimeout(10)
Parse

Once you have properly initialized and configured your instance, you can parse sitemaps using the Parse() function.

The Parse() function takes in two parameters:

  • url: the URL of the sitemap to be parsed,
    • url can be a robots.txt or sitemapindex or sitemap (urlset)
  • urlContent: an optional string pointer for the content of the URL.

If you wish to provide the content yourself, pass the content as the second parameter. If not, simply pass nil and the function will fetch the content on its own. The Parse() function performs concurrent parsing and fetching optimized by the use of Go's goroutines and sync package, ensuring efficient sitemap handling.

s, err := s.Parse("https://www.sitemaps.org/sitemap.xml", nil)

In this example, sitemap is parsed from "https://www.sitemaps.org/sitemap.xml". The function fetches the content itself, as we passed nil as the urlContent.

Examples

Examples can be found in /examples.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type S

type S struct {
	// contains filtered or unexported fields
}

S is a structure that holds various data related to processing URLs. It contains a cfg field of type `config`, which stores configuration settings. The mainURL field of type string represents the main URL being processed. The mainURLContent field of type string stores the content of the main URL. The robotsTxtSitemapURLs field is a slice of strings that contains the URLs present in the robots.txt file's sitemap directive. The sitemapLocations field is a slice of strings that represents the locations of the sitemap files. The urls field is a slice of URL structs that stores the URLs to be processed. The errs field is a slice of errors that holds any encountered errors during processing.

func New

func New() *S

New creates a new instance of the S structure. It initializes the structure with default configuration values and returns a pointer to the created instance.

func (*S) GetErrors

func (s *S) GetErrors() []error

func (*S) GetErrorsCount

func (s *S) GetErrorsCount() int64

func (*S) GetRandomURLs

func (s *S) GetRandomURLs(n int) []URL

GetRandomURLs returns a slice of randomly selected URLs from the S object's URL list. The number of URLs to select is specified by the parameter n. If the S object is nil, an empty slice is returned. The function creates a copy of the original URLs list and randomly selects n URLs from it, removing them to avoid duplicates. The selected URLs are returned as a new slice.

func (*S) GetURLCount

func (s *S) GetURLCount() int64

GetURLCount returns the count of URLs in the S struct.

func (*S) GetURLs

func (s *S) GetURLs() []URL

GetURLs returns the list of parsed URLs.

func (*S) Parse

func (s *S) Parse(url string, urlContent *string) (*S, error)

Parse is a method of the S structure. It parses the given URL and its content. It sets the mainURL field to the given URL and the mainURLContent field to the given URL content. It returns an error if there was an error setting the content. If the URL ends with "/robots.txt", it parses the robots.txt file and fetches URLs from the sitemap files mentioned in the robots.txt. The URLs are fetched concurrently using goroutines and the wait group wg. If there was an error fetching a sitemap file, the error is appended to the errs field. The fetched content is checked and unzipped if necessary. The fetched sitemap file URLs are parsed and fetched. If the URL does not end with "/robots.txt", the mainURLContent is checked and unzipped if necessary. The mainURLContent is then parsed and fetched. After all URLs are fetched and parsed, the method waits for all goroutines to complete using wg.Wait(). It returns the S structure and nil error if the method was able to complete successfully.

func (*S) SetFetchTimeout

func (s *S) SetFetchTimeout(fetchTimeout uint8) *S

SetFetchTimeout sets the fetch timeout for the Sitemap Parser. The fetch timeout determines how long the parser will wait for an HTTP request to complete. It should be specified in seconds as an uint8 value. The function returns a pointer to the S structure to allow method chaining.

func (*S) SetUserAgent

func (s *S) SetUserAgent(userAgent string) *S

SetUserAgent sets the user agent for the Sitemap Parser. The user agent is used for making HTTP requests when parsing and fetching URLs. It should be a string representing the user agent header value. The function returns a pointer to the S structure to allow method chaining.

type URL

type URL struct {
	Loc        string         `xml:"loc"`
	LastMod    *lastModTime   `xml:"lastmod"`
	ChangeFreq *urlChangeFreq `xml:"changefreq"`
	Priority   *float32       `xml:"priority"`
}

URL is a structure of <url> in <urlset>

type URLSet

type URLSet struct {
	XMLName xml.Name `xml:"urlset"`
	URL     []URL    `xml:"url"`
}

URLSet is a structure of <urlset>

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL