scraper

package

v0.1.1 Latest Latest Go to latest Published: Feb 11, 2023 License: MIT Imports: 26 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cornelk/goscrape

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func GetPageFilePath(url *url.URL) string
type Config
type Scraper
- func New(logger *zap.Logger, cfg Config) (*Scraper, error)

Constants ¶

View Source

const (
	// PageExtension is the file extension that downloaded pages get.
	PageExtension = ".html"
	// PageDirIndex is the file name of the index file for every dir.
	PageDirIndex = "index" + PageExtension
)

Variables ¶

This section is empty.

Functions ¶

func GetPageFilePath ¶

func GetPageFilePath(url *url.URL) string

GetPageFilePath returns a filename for a URL that represents a page.

Types ¶

type Config ¶ added in v0.1.1

type Config struct {
	URL      string
	Includes []string
	Excludes []string

	ImageQuality uint // image quality from 0 to 100%, 0 to disable reencoding
	MaxDepth     uint // download depth, 0 for unlimited
	Timeout      uint // time limit in seconds to process each http request

	OutputDirectory string
	Username        string
	Password        string

	Proxy string
}

Config contains the scraper configuration.

type Scraper ¶

type Scraper struct {
	URL *url.URL
	// contains filtered or unexported fields
}

Scraper contains all scraping data.

func New ¶

func New(logger *zap.Logger, cfg Config) (*Scraper, error)

New creates a new Scraper instance.

func (*Scraper) GetFilePath ¶

func (s *Scraper) GetFilePath(url *url.URL, isAPage bool) string

GetFilePath returns a file path for a URL to store the URL content in.

func (*Scraper) RemoveAnchor ¶

func (s *Scraper) RemoveAnchor(path string) string

RemoveAnchor removes anchors from URLS.

func (*Scraper) Start ¶

func (s *Scraper) Start() error

Start starts the scraping.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL