gophetch

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 14, 2023 License: Apache-2.0 Imports: 15 Imported by: 0

README

gophetch

GoPhetch is a library for parsing and extracting metadata and other details from HTML.

Documentation

Overview

Package gophetch is a library for fetching and extracting metadata from HTML pages.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractDomain

func ExtractDomain(rawURL string) (string, error)

ExtractDomain extracts the domain from a given URL string

Types

type Extractor

type Extractor struct {
	Rules  map[string]rules.Rule
	Errors []error
}

Extractor is the struct that encapsulates the rules used to extract metadata from HTML.

func NewExtractor

func NewExtractor() *Extractor

NewExtractor creates a new Extractor struct with the default rules.

func (*Extractor) ApplySiteSpecificRules

func (e *Extractor) ApplySiteSpecificRules(site sites.Site)

ApplySiteSpecificRules applies the custom rules for the given site.

func (*Extractor) ExtractMetadata

func (e *Extractor) ExtractMetadata(node *html.Node, targetURL *url.URL) (metadata.Metadata, error)

ExtractMetadata extracts metadata from the given HTML node. The url parameter is used to fix relative paths.

func (*Extractor) ExtractRule

func (e *Extractor) ExtractRule(node *html.Node, targetURL *url.URL, rule rules.Rule) (rules.ExtractResult, error)

func (*Extractor) ExtractRuleByKey

func (e *Extractor) ExtractRuleByKey(node *html.Node, targetURL *url.URL, key string) (rules.ExtractResult, error)

type Gophetch

type Gophetch struct {
	Parser       *Parser
	Extractor    *Extractor
	Fetchers     []fetchers.HTMLFetcher
	SiteRegistry map[string]sites.Site
}

Gophetch is the main struct that encapsulates the parser, extractor, and fetchers.

func New

func New(fetchers ...fetchers.HTMLFetcher) *Gophetch

New creates a new Gophetch struct with the provided fetchers.

func (*Gophetch) FetchAndParse

func (g *Gophetch) FetchAndParse(targetURL string) (Result, error)

FetchAndParse accepts a target URL string as its parameter. It initiates an HTTP request to fetch the HTML content from the specified URL, parses the fetched HTML to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content needs to be fetched from the internet before parsing.

func (*Gophetch) ReadAndParse

func (g *Gophetch) ReadAndParse(r io.Reader, targetURL string) (Result, error)

ReadAndParse accepts two parameters: an io.Reader containing the HTML to be parsed, and a target URL string. It reads the HTML content from the provided io.Reader, parses it to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content is already available and does not need to be fetched from the internet.

func (*Gophetch) RegisterSite

func (g *Gophetch) RegisterSite(site sites.Site)

RegisterSite registers a site with the Gophetch instance. This allows the Gophetch instance to apply site-specific rules when extracting metadata from the HTML content.

type Headers

type Headers map[string][]string

Headers is a map of HTTP headers

type ImageFetcher added in v0.0.3

type ImageFetcher interface {
	NewImageFromURL(url string) (*image.Image, error)
}

ImageFetcher is an interface for fetching images

type ImageInliner added in v0.0.3

type ImageInliner struct {
	ShouldInline ShouldInlineFunc
	// contains filtered or unexported fields
}

ImageInliner is responsible for fetching and replacing images in HTML documents.

func NewImageInliner added in v0.0.3

func NewImageInliner(opts ImageInlinerOptions) *ImageInliner

NewImageInliner creates a new ImageInliner with the given fetcher, upload function, and storage strategy.

func (*ImageInliner) InlineImages added in v0.0.3

func (inliner *ImageInliner) InlineImages(readableHTML string) (string, error)

InlineImages replaces image URLs with either base64 inline versions or cloud URLs based on the set strategy.

type ImageInlinerOptions added in v0.0.3

type ImageInlinerOptions struct {
	// ShouldInlineFunc is the function to use for determining whether an image should be inlined. Default is to inline
	// if image size is less than 100KB or if dimensions are smaller than 800x600 (based on the maxContentSize, maxWidth,
	// and maxHeight options).
	ShouldInlineFunc ShouldInlineFunc
	// Fetcher is the ImageFetcher to use for fetching images.
	Fetcher ImageFetcher
	// UploadFunc is the function to use for uploading images to cloud storage.
	UploadFunc UploadFunc
	// Strategy is the storage strategy to use.
	Strategy InlineStrategy
	// MaxContentSize is the maximum size in bytes for images to be processed in a hybrid strategy. Default is 100KB.
	MaxContentSize int64
	// MaxWidth is the maximum width in pixels for images to be processed in a hybrid strategy. Default is 800.
	MaxWidth int
	// MaxHeight is the maximum height in pixels for images to be processed in a hybrid strategy. Default is 600.
	MaxHeight int
}

ImageInlinerOptions are options for creating a new ImageInliner.

type InlineStrategy added in v0.0.3

type InlineStrategy string
const (
	// StrategyInline stores images as base64 inline
	StrategyInline InlineStrategy = "inline"
	// StrategyHybrid stores images as base64 inline if they are less than maxContentSize or
	// smaller than maxWidth x maxHeight, otherwise it uploads them to cloud storage
	StrategyHybrid InlineStrategy = "hybrid"
	// StrategyUpload stores images in cloud storage
	StrategyUpload InlineStrategy = "upload"
)

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser is the struct that encapsulates the HTML parser.

func NewParser

func NewParser() *Parser

NewParser creates a new Parser struct.

func (*Parser) Headers

func (p *Parser) Headers() Headers

Headers returns the HTTP headers as a map.

func (*Parser) IsHTML

func (p *Parser) IsHTML() bool

IsHTML returns true if the response is HTML, false otherwise.

func (*Parser) MimeType

func (p *Parser) MimeType() string

MimeType returns the MIME type of the response

func (*Parser) Node

func (p *Parser) Node() *html.Node

Node returns the parsed HTML as a html.Node struct.

func (*Parser) Parse

func (p *Parser) Parse(reader io.Reader, resp *http.Response, targetURL string) error

Parse parses the HTML content from the provided io.Reader, and encapsulates the parsed HTML into a html.Node struct. It will also parse the HTTP headers from the provided http.Response struct. The targetURL parameter is used to fix relative paths.

func (*Parser) URL

func (p *Parser) URL() *url.URL

URL returns the target URL as a url.URL struct.

type RealImageFetcher added in v0.0.3

type RealImageFetcher struct{}

RealImageFetcher uses the actual implementation

func (*RealImageFetcher) NewImageFromURL added in v0.0.3

func (r *RealImageFetcher) NewImageFromURL(url string) (*image.Image, error)

type Result

type Result struct {
	HTMLNode    *html.Node
	Headers     map[string][]string
	IsHTML      bool
	Metadata    metadata.Metadata
	MimeType    string
	Response    *http.Response
	StatusCode  int
	FetcherName string
}

Result is the struct that encapsulates the extracted metadata, along with the response data.

type ShouldInlineFunc added in v0.0.3

type ShouldInlineFunc func(*image.Image) bool

type UploadFunc added in v0.0.3

type UploadFunc func(*image.Image) (string, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL