swan

package module
v0.0.0-...-d1079a5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 4, 2019 License: Apache-2.0 Imports: 31 Imported by: 4

README

Swan Build Status GoDoc

swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang.

Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.

Check out the go documentation page for full usage and examples.


Features

  • Main content extraction from almost any source
  • Extract HTML content with images
  • Get article metadata, publish dates, and a lot more
  • Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)

Planned

Documentation

Overview

Package swan implements the Goose HTML Content / Article Extractor algorithm.

Currently, swan will try to extract the following content types:

Comics: if something looks like a web comic, it will be extracted as just an image. This is a WIP.

Everything else: it will look for article text and try to extract any header image that goes with it.

Index

Examples

Constants

View Source
const (
	// Version of the library
	Version = "1.0"
)

Variables

This section is empty.

Functions

func ToUtf8

func ToUtf8(html []byte) ([]byte, error)

ToUtf8 takes a page body, determines its character encoding, and converts it to UTF8.

Types

type Article

type Article struct {
	// Final URL after all redirects
	URL string

	// Newline-separated and cleaned content
	CleanedText string

	// Node from which CleanedText was created. Call .Html() on this to get
	// printable HTML.
	TopNode *goquery.Selection

	// A header image to use for the article. Nil if no image could be
	// detected.
	Img *Image

	// All metadata associated with the original document
	Meta struct {
		Authors     []string
		Canonical   string
		Description string
		Domain      string
		Favicon     string
		Keywords    string
		Links       []string
		Lang        string
		OpenGraph   map[string]string
		PublishDate string
		Tags        []string
		Title       string
	}

	// Full document backing this article
	Doc *goquery.Document
	// contains filtered or unexported fields
}

Article is a fully extracted and cleaned document.

func FromDoc

func FromDoc(url string, doc *goquery.Document) (*Article, error)

FromDoc does its best to extract an article from a single document

Pass in the URL the document came from so that images can be resolved correctly.

func FromHTML

func FromHTML(url string, html []byte) (*Article, error)

FromHTML does its best to extract an article from a single HTML page.

Pass in the URL the document came from so that images can be resolved correctly.

Example
htmlIn := `<html>
		<head>
			<title> Example Title </title>
			<meta property="og:site_name" content="Example Name"/>
		</head>
		<body>
			<p>some article body with a bunch of text in it</p>
		</body>
	</html>`

a, err := FromHTML("http://example.com/article/1", []byte(htmlIn))
if err != nil {
	panic(err)
}

if a.TopNode == nil {
	panic("no article could be extracted, " +
		"but a.Doc and a.Meta are still cleaned " +
		"and can be messed with ")
}

// Get the document title
fmt.Printf("Title: %s\n", a.Meta.Title)

// Hit any open graph tags
fmt.Printf("Site Name: %s\n", a.Meta.OpenGraph["site_name"])

// Print out any cleaned-up HTML that was found
html, _ := a.TopNode.Html()
fmt.Printf("HTML: %s\n", strings.TrimSpace(html))

// Print out any cleaned-up text that was found
fmt.Printf("Plain: %s\n", a.CleanedText)
Output:

Title: Example Title
Site Name: Example Name
HTML: <p>some article body with a bunch of text in it</p>
Plain: some article body with a bunch of text in it

func FromURL

func FromURL(url string) (a *Article, err error)

FromURL does its best to extract an article from the given URL

type Image

type Image struct {
	Src        string
	Width      uint
	Height     uint
	Bytes      int64
	Confidence uint
	Sel        *goquery.Selection
}

Image contains information about the header image associated with an article

Directories

Path Synopsis
Internal tool: update the stopwords list from python-goose.
Internal tool: update the stopwords list from python-goose.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL