loaders

package
v0.0.0-...-22ee6a3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 13, 2024 License: MIT Imports: 12 Imported by: 1

Documentation

Index

Constants

View Source
const (
	BODY_EXPR       = "" /* 224-byte string literal not displayed */
	BODY_EXPR_SHORT = ".ArticleBase-Body, .post, .content, article, body"
)
View Source
const (
	THE_HACKERSNEWS_SOURCE = "THE HACKERS NEWS"
	YC_HACKERNEWS_SOURCE   = "YC HACKER NEWS"
	MEDIUM_SOURCE          = "MEDIUM"
)
View Source
const (
	ARTICLE = "article"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type WebLoader

type WebLoader struct {
	Config *WebLoaderConfig
	// contains filtered or unexported fields
}

// GENERIC WEB SITE LOADER //// loader class for web links and sites the loaded content is cached

func NewDefaultNewsSitemapLoader

func NewDefaultNewsSitemapLoader(days int, sitemap_url string) *WebLoader

Loads articles from https://feeds.feedburner.com/TheHackersNews that have been posted in the last N days

func NewDefaultWebTextLoader

func NewDefaultWebTextLoader(config *WebLoaderConfig) *WebLoader

sitemap_url can be "" if the collector is not purposed for any specific sitemap scrapping

func NewMediumSiteLoader

func NewMediumSiteLoader(days int) *WebLoader

loades medium posts from https://medium.com/sitemap/sitemap.xml that have been modified in the last N days

func NewRedditLinkLoader

func NewRedditLinkLoader() *WebLoader

func NewYCHackerNewsSiteLoader

func NewYCHackerNewsSiteLoader() *WebLoader

loads story links from https://hacker-news.firebaseio.com/v0/topstories.json posted in the last N days

func (*WebLoader) Get

func (c *WebLoader) Get(url string) *doc.Document

func (*WebLoader) ListAll

func (c *WebLoader) ListAll() []*doc.Document

func (*WebLoader) LoadDocument

func (c *WebLoader) LoadDocument(url string) *doc.Document

this function will return an instance of an extracted WebArticle if the url contains an HTML body

func (*WebLoader) LoadSite

func (c *WebLoader) LoadSite() []*doc.Document

this function will load all the documents from a sitemap or rss feed

type WebLoaderConfig

type WebLoaderConfig struct {
	Sitemap           string
	DisallowedFilters []string
	Timeout           time.Duration
	LocalCache        string
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL