gophetch

package module

v0.0.3 Latest Latest Go to latest Published: Oct 14, 2023 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/minsoft-io/gophetch

Links

Open Source Insights

README ¶

gophetch

GoPhetch is a library for parsing and extracting metadata and other details from HTML.

Documentation ¶

Overview ¶

Package gophetch is a library for fetching and extracting metadata from HTML pages.

Index ¶

func ExtractDomain(rawURL string) (string, error)
type Extractor
- func NewExtractor() *Extractor
- func (e *Extractor) ApplySiteSpecificRules(site sites.Site)
- func (e *Extractor) ExtractMetadata(node *html.Node, targetURL *url.URL) (metadata.Metadata, error)
- func (e *Extractor) ExtractRule(node *html.Node, targetURL *url.URL, rule rules.Rule) (rules.ExtractResult, error)
- func (e *Extractor) ExtractRuleByKey(node *html.Node, targetURL *url.URL, key string) (rules.ExtractResult, error)
type Gophetch
- func New(fetchers ...fetchers.HTMLFetcher) *Gophetch
- func (g *Gophetch) FetchAndParse(targetURL string) (Result, error)
- func (g *Gophetch) ReadAndParse(r io.Reader, targetURL string) (Result, error)
- func (g *Gophetch) RegisterSite(site sites.Site)
type Headers
type ImageFetcher
type ImageInliner
- func NewImageInliner(opts ImageInlinerOptions) *ImageInliner
- func (inliner *ImageInliner) InlineImages(readableHTML string) (string, error)
type ImageInlinerOptions
type InlineStrategy
type Parser
- func NewParser() *Parser
- func (p *Parser) Headers() Headers
- func (p *Parser) IsHTML() bool
- func (p *Parser) MimeType() string
- func (p *Parser) Node() *html.Node
- func (p *Parser) Parse(reader io.Reader, resp *http.Response, targetURL string) error
- func (p *Parser) URL() *url.URL
type RealImageFetcher
- func (r *RealImageFetcher) NewImageFromURL(url string) (*image.Image, error)
type Result
type ShouldInlineFunc
type UploadFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ExtractDomain ¶

func ExtractDomain(rawURL string) (string, error)

ExtractDomain extracts the domain from a given URL string

Types ¶

type Extractor ¶

type Extractor struct {
	Rules  map[string]rules.Rule
	Errors []error
}

Extractor is the struct that encapsulates the rules used to extract metadata from HTML.

func NewExtractor ¶

func NewExtractor() *Extractor

NewExtractor creates a new Extractor struct with the default rules.

func (*Extractor) ApplySiteSpecificRules ¶

func (e *Extractor) ApplySiteSpecificRules(site sites.Site)

ApplySiteSpecificRules applies the custom rules for the given site.

func (*Extractor) ExtractMetadata ¶

func (e *Extractor) ExtractMetadata(node *html.Node, targetURL *url.URL) (metadata.Metadata, error)

ExtractMetadata extracts metadata from the given HTML node. The url parameter is used to fix relative paths.

func (*Extractor) ExtractRule ¶

func (e *Extractor) ExtractRule(node *html.Node, targetURL *url.URL, rule rules.Rule) (rules.ExtractResult, error)

func (*Extractor) ExtractRuleByKey ¶

func (e *Extractor) ExtractRuleByKey(node *html.Node, targetURL *url.URL, key string) (rules.ExtractResult, error)

type Gophetch ¶

type Gophetch struct {
	Parser       *Parser
	Extractor    *Extractor
	Fetchers     []fetchers.HTMLFetcher
	SiteRegistry map[string]sites.Site
}

Gophetch is the main struct that encapsulates the parser, extractor, and fetchers.

func New ¶

func New(fetchers ...fetchers.HTMLFetcher) *Gophetch

New creates a new Gophetch struct with the provided fetchers.

func (*Gophetch) FetchAndParse ¶

func (g *Gophetch) FetchAndParse(targetURL string) (Result, error)

FetchAndParse accepts a target URL string as its parameter. It initiates an HTTP request to fetch the HTML content from the specified URL, parses the fetched HTML to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content needs to be fetched from the internet before parsing.

func (*Gophetch) ReadAndParse ¶

func (g *Gophetch) ReadAndParse(r io.Reader, targetURL string) (Result, error)

ReadAndParse accepts two parameters: an io.Reader containing the HTML to be parsed, and a target URL string. It reads the HTML content from the provided io.Reader, parses it to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content is already available and does not need to be fetched from the internet.

func (*Gophetch) RegisterSite ¶

func (g *Gophetch) RegisterSite(site sites.Site)

RegisterSite registers a site with the Gophetch instance. This allows the Gophetch instance to apply site-specific rules when extracting metadata from the HTML content.

type Headers ¶

type Headers map[string][]string

Headers is a map of HTTP headers

type ImageFetcher ¶ added in v0.0.3

type ImageFetcher interface {
	NewImageFromURL(url string) (*image.Image, error)
}

ImageFetcher is an interface for fetching images

type ImageInliner ¶ added in v0.0.3

type ImageInliner struct {
	ShouldInline ShouldInlineFunc
	// contains filtered or unexported fields
}

ImageInliner is responsible for fetching and replacing images in HTML documents.

func NewImageInliner ¶ added in v0.0.3

func NewImageInliner(opts ImageInlinerOptions) *ImageInliner

NewImageInliner creates a new ImageInliner with the given fetcher, upload function, and storage strategy.

func (*ImageInliner) InlineImages ¶ added in v0.0.3

func (inliner *ImageInliner) InlineImages(readableHTML string) (string, error)

InlineImages replaces image URLs with either base64 inline versions or cloud URLs based on the set strategy.

type ImageInlinerOptions ¶ added in v0.0.3

type ImageInlinerOptions struct {
	// ShouldInlineFunc is the function to use for determining whether an image should be inlined. Default is to inline
	// if image size is less than 100KB or if dimensions are smaller than 800x600 (based on the maxContentSize, maxWidth,
	// and maxHeight options).
	ShouldInlineFunc ShouldInlineFunc
	// Fetcher is the ImageFetcher to use for fetching images.
	Fetcher ImageFetcher
	// UploadFunc is the function to use for uploading images to cloud storage.
	UploadFunc UploadFunc
	// Strategy is the storage strategy to use.
	Strategy InlineStrategy
	// MaxContentSize is the maximum size in bytes for images to be processed in a hybrid strategy. Default is 100KB.
	MaxContentSize int64
	// MaxWidth is the maximum width in pixels for images to be processed in a hybrid strategy. Default is 800.
	MaxWidth int
	// MaxHeight is the maximum height in pixels for images to be processed in a hybrid strategy. Default is 600.
	MaxHeight int
}

ImageInlinerOptions are options for creating a new ImageInliner.

type InlineStrategy ¶ added in v0.0.3

type InlineStrategy string

const (
	// StrategyInline stores images as base64 inline
	StrategyInline InlineStrategy = "inline"
	// StrategyHybrid stores images as base64 inline if they are less than maxContentSize or
	// smaller than maxWidth x maxHeight, otherwise it uploads them to cloud storage
	StrategyHybrid InlineStrategy = "hybrid"
	// StrategyUpload stores images in cloud storage
	StrategyUpload InlineStrategy = "upload"
)

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

Parser is the struct that encapsulates the HTML parser.

func NewParser ¶

func NewParser() *Parser

NewParser creates a new Parser struct.

func (*Parser) Headers ¶

func (p *Parser) Headers() Headers

Headers returns the HTTP headers as a map.

func (*Parser) IsHTML ¶

func (p *Parser) IsHTML() bool

IsHTML returns true if the response is HTML, false otherwise.

func (*Parser) MimeType ¶

func (p *Parser) MimeType() string

MimeType returns the MIME type of the response

func (*Parser) Node ¶

func (p *Parser) Node() *html.Node

Node returns the parsed HTML as a html.Node struct.

func (*Parser) Parse ¶

func (p *Parser) Parse(reader io.Reader, resp *http.Response, targetURL string) error

Parse parses the HTML content from the provided io.Reader, and encapsulates the parsed HTML into a html.Node struct. It will also parse the HTTP headers from the provided http.Response struct. The targetURL parameter is used to fix relative paths.

func (*Parser) URL ¶

func (p *Parser) URL() *url.URL

URL returns the target URL as a url.URL struct.

type RealImageFetcher ¶ added in v0.0.3

type RealImageFetcher struct{}

RealImageFetcher uses the actual implementation

func (*RealImageFetcher) NewImageFromURL ¶ added in v0.0.3

func (r *RealImageFetcher) NewImageFromURL(url string) (*image.Image, error)

type Result ¶

type Result struct {
	HTMLNode    *html.Node
	Headers     map[string][]string
	IsHTML      bool
	Metadata    metadata.Metadata
	MimeType    string
	Response    *http.Response
	StatusCode  int
	FetcherName string
}

Result is the struct that encapsulates the extracted metadata, along with the response data.

type ShouldInlineFunc ¶ added in v0.0.3

type ShouldInlineFunc func(*image.Image) bool

type UploadFunc ¶ added in v0.0.3

type UploadFunc func(*image.Image) (string, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
fetchers
helpers
image
metadata
rules
sites

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL