scraper

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 11, 2023 License: MIT Imports: 26 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// PageExtension is the file extension that downloaded pages get.
	PageExtension = ".html"
	// PageDirIndex is the file name of the index file for every dir.
	PageDirIndex = "index" + PageExtension
)

Variables

This section is empty.

Functions

func GetPageFilePath

func GetPageFilePath(url *url.URL) string

GetPageFilePath returns a filename for a URL that represents a page.

Types

type Config added in v0.1.1

type Config struct {
	URL      string
	Includes []string
	Excludes []string

	ImageQuality uint // image quality from 0 to 100%, 0 to disable reencoding
	MaxDepth     uint // download depth, 0 for unlimited
	Timeout      uint // time limit in seconds to process each http request

	OutputDirectory string
	Username        string
	Password        string

	Proxy string
}

Config contains the scraper configuration.

type Scraper

type Scraper struct {
	URL *url.URL
	// contains filtered or unexported fields
}

Scraper contains all scraping data.

func New

func New(logger *zap.Logger, cfg Config) (*Scraper, error)

New creates a new Scraper instance.

func (*Scraper) GetFilePath

func (s *Scraper) GetFilePath(url *url.URL, isAPage bool) string

GetFilePath returns a file path for a URL to store the URL content in.

func (*Scraper) RemoveAnchor

func (s *Scraper) RemoveAnchor(path string) string

RemoveAnchor removes anchors from URLS.

func (*Scraper) Start

func (s *Scraper) Start() error

Start starts the scraping.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL