Documentation
¶
Overview ¶
Package gophetch is a library for fetching and extracting metadata from HTML pages.
Index ¶
- func ExtractDomain(rawURL string) (string, error)
- type Extractor
- func (e *Extractor) ApplySiteSpecificRules(site sites.Site)
- func (e *Extractor) ExtractMetadata(node *html.Node, targetURL *url.URL) (metadata.Metadata, error)
- func (e *Extractor) ExtractRule(node *html.Node, targetURL *url.URL, rule rules.Rule) (rules.ExtractResult, error)
- func (e *Extractor) ExtractRuleByKey(node *html.Node, targetURL *url.URL, key string) (rules.ExtractResult, error)
- type Gophetch
- type Headers
- type ImageFetcher
- type ImageInliner
- type ImageInlinerOptions
- type InlineStrategy
- type Parser
- type RealImageFetcher
- type Result
- type ShouldInlineFunc
- type UploadFunc
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ExtractDomain ¶
ExtractDomain extracts the domain from a given URL string
Types ¶
type Extractor ¶
Extractor is the struct that encapsulates the rules used to extract metadata from HTML.
func NewExtractor ¶
func NewExtractor() *Extractor
NewExtractor creates a new Extractor struct with the default rules.
func (*Extractor) ApplySiteSpecificRules ¶
ApplySiteSpecificRules applies the custom rules for the given site.
func (*Extractor) ExtractMetadata ¶
ExtractMetadata extracts metadata from the given HTML node. The url parameter is used to fix relative paths.
func (*Extractor) ExtractRule ¶
type Gophetch ¶
type Gophetch struct { Parser *Parser Extractor *Extractor Fetchers []fetchers.HTMLFetcher SiteRegistry map[string]sites.Site }
Gophetch is the main struct that encapsulates the parser, extractor, and fetchers.
func New ¶
func New(fetchers ...fetchers.HTMLFetcher) *Gophetch
New creates a new Gophetch struct with the provided fetchers.
func (*Gophetch) FetchAndParse ¶
FetchAndParse accepts a target URL string as its parameter. It initiates an HTTP request to fetch the HTML content from the specified URL, parses the fetched HTML to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content needs to be fetched from the internet before parsing.
func (*Gophetch) ReadAndParse ¶
ReadAndParse accepts two parameters: an io.Reader containing the HTML to be parsed, and a target URL string. It reads the HTML content from the provided io.Reader, parses it to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content is already available and does not need to be fetched from the internet.
func (*Gophetch) RegisterSite ¶
RegisterSite registers a site with the Gophetch instance. This allows the Gophetch instance to apply site-specific rules when extracting metadata from the HTML content.
type ImageFetcher ¶ added in v0.0.3
ImageFetcher is an interface for fetching images
type ImageInliner ¶ added in v0.0.3
type ImageInliner struct { ShouldInline ShouldInlineFunc // contains filtered or unexported fields }
ImageInliner is responsible for fetching and replacing images in HTML documents.
func NewImageInliner ¶ added in v0.0.3
func NewImageInliner(opts ImageInlinerOptions) *ImageInliner
NewImageInliner creates a new ImageInliner with the given fetcher, upload function, and storage strategy.
func (*ImageInliner) InlineImages ¶ added in v0.0.3
func (inliner *ImageInliner) InlineImages(readableHTML string) (string, error)
InlineImages replaces image URLs with either base64 inline versions or cloud URLs based on the set strategy.
type ImageInlinerOptions ¶ added in v0.0.3
type ImageInlinerOptions struct { // ShouldInlineFunc is the function to use for determining whether an image should be inlined. Default is to inline // if image size is less than 100KB or if dimensions are smaller than 800x600 (based on the maxContentSize, maxWidth, // and maxHeight options). ShouldInlineFunc ShouldInlineFunc // Fetcher is the ImageFetcher to use for fetching images. Fetcher ImageFetcher // UploadFunc is the function to use for uploading images to cloud storage. UploadFunc UploadFunc // Strategy is the storage strategy to use. Strategy InlineStrategy // MaxContentSize is the maximum size in bytes for images to be processed in a hybrid strategy. Default is 100KB. MaxContentSize int64 // MaxWidth is the maximum width in pixels for images to be processed in a hybrid strategy. Default is 800. MaxWidth int // MaxHeight is the maximum height in pixels for images to be processed in a hybrid strategy. Default is 600. MaxHeight int }
ImageInlinerOptions are options for creating a new ImageInliner.
type InlineStrategy ¶ added in v0.0.3
type InlineStrategy string
const ( // StrategyInline stores images as base64 inline StrategyInline InlineStrategy = "inline" // StrategyHybrid stores images as base64 inline if they are less than maxContentSize or // smaller than maxWidth x maxHeight, otherwise it uploads them to cloud storage StrategyHybrid InlineStrategy = "hybrid" // StrategyUpload stores images in cloud storage StrategyUpload InlineStrategy = "upload" )
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser is the struct that encapsulates the HTML parser.
type RealImageFetcher ¶ added in v0.0.3
type RealImageFetcher struct{}
RealImageFetcher uses the actual implementation
func (*RealImageFetcher) NewImageFromURL ¶ added in v0.0.3
func (r *RealImageFetcher) NewImageFromURL(url string) (*image.Image, error)
type Result ¶
type Result struct { HTMLNode *html.Node Headers map[string][]string IsHTML bool Metadata metadata.Metadata MimeType string Response *http.Response StatusCode int FetcherName string }
Result is the struct that encapsulates the extracted metadata, along with the response data.