swan

package module

v0.0.0-...-d1079a5 Latest Latest Go to latest Published: Sep 4, 2019 License: Apache-2.0 Imports: 31 Imported by: 4

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/thatguystone/swan

Links

Open Source Insights

README ¶

Swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang.

Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.

Check out the go documentation page for full usage and examples.

Features

Main content extraction from almost any source
Extract HTML content with images
Get article metadata, publish dates, and a lot more
Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)

Planned

Inline videos into HTML content when found in an article
Recognize news sources and extract corresponding video / audio content
Recognize and extract more types of content
An interesting idea: https://github.com/buriy/python-readability/issues/57#issuecomment-67926023

Documentation ¶

Overview ¶

Package swan implements the Goose HTML Content / Article Extractor algorithm.

Currently, swan will try to extract the following content types:

Comics: if something looks like a web comic, it will be extracted as just an image. This is a WIP.

Everything else: it will look for article text and try to extract any header image that goes with it.

Examples ¶

FromHTML

Constants ¶

View Source

const (
	// Version of the library
	Version = "1.0"
)

Variables ¶

This section is empty.

Functions ¶

func ToUtf8 ¶

func ToUtf8(html []byte) ([]byte, error)

ToUtf8 takes a page body, determines its character encoding, and converts it to UTF8.

Types ¶

type Article ¶

type Article struct {
	// Final URL after all redirects
	URL string

	// Newline-separated and cleaned content
	CleanedText string

	// Node from which CleanedText was created. Call .Html() on this to get
	// printable HTML.
	TopNode *goquery.Selection

	// A header image to use for the article. Nil if no image could be
	// detected.
	Img *Image

	// All metadata associated with the original document
	Meta struct {
		Authors     []string
		Canonical   string
		Description string
		Domain      string
		Favicon     string
		Keywords    string
		Links       []string
		Lang        string
		OpenGraph   map[string]string
		PublishDate string
		Tags        []string
		Title       string
	}

	// Full document backing this article
	Doc *goquery.Document
	// contains filtered or unexported fields
}

Article is a fully extracted and cleaned document.

func FromDoc ¶

func FromDoc(url string, doc *goquery.Document) (*Article, error)

FromDoc does its best to extract an article from a single document

Pass in the URL the document came from so that images can be resolved correctly.

func FromHTML ¶

func FromHTML(url string, html []byte) (*Article, error)

FromHTML does its best to extract an article from a single HTML page.