readability

package module

v0.0.0-...-be4f809 Latest Latest Go to latest Published: Aug 9, 2024 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/WiseEcho/go-readability

Links

Open Source Insights

README ¶

Go-Readability

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Status

This package is stable enough for use and up to date with Readability.js v0.4.4 (commit b359811).

Installation

To install this package, just run go get :

go get -u -v github.com/WiseEcho/go-readability

Example

To get the readable content from an URL, you can use readability.FromURL. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content :

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	readability "github.com/WiseEcho/go-readability"
)

var (
	urls = []string{
		// this one is article, so it's parse-able
		"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
		// while this one is not an article, so readability will fail to parse.
		"https://www.nytimes.com/",
	}
)

func main() {
	for i, url := range urls {
		article, err := readability.FromURL(url, 30*time.Second)
		if err != nil {
			log.Fatalf("failed to parse %s, %v\n", url, err)
		}

		dstTxtFile, _ := os.Create(fmt.Sprintf("text-%02d.txt", i+1))
		defer dstTxtFile.Close()
		dstTxtFile.WriteString(article.TextContent)

		dstHTMLFile, _ := os.Create(fmt.Sprintf("html-%02d.html", i+1))
		defer dstHTMLFile.Close()
		dstHTMLFile.WriteString(article.Content)

		fmt.Printf("URL     : %s\n", url)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Printf("Text content saved to \"text-%02d.txt\"\n", i+1)
		fmt.Printf("HTML content saved to \"html-%02d.html\"\n", i+1)
		fmt.Println()
	}
}

However, sometimes you want to parse an URL no matter if it's an article or not. For example is when you only want to get metadata of the page. To do that, you have to download the page manually using http.Get, then parse it using readability.FromReader :

package main

import (
	"fmt"
	"log"
	"net/http"
	"net/url"

	readability "github.com/WiseEcho/go-readability"
)

var (
	urls = []string{
		// Both will be parse-able now
		"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
		// But this one will not have any content
		"https://www.nytimes.com/",
	}
)

func main() {
	for _, u := range urls {
		resp, err := http.Get(u)
		if err != nil {
			log.Fatalf("failed to download %s: %v\n", u, err)
		}
		defer resp.Body.Close()

		parsedURL, err := url.Parse(u)
		if err != nil {
			log.Fatalf("error parsing url")
		}

		article, err := readability.FromReader(resp.Body, parsedURL)
		if err != nil {
			log.Fatalf("failed to parse %s: %v\n", u, err)
		}

		fmt.Printf("URL     : %s\n", u)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Println()
	}
}

Command Line Usage

You can also use go-readability as command line app. To do that, first install the CLI :

go get -u -v github.com/WiseEcho/go-readability/cmd/...

Now you can use it by running go-readability in your terminal :

$ go-readability -h

go-readability is parser to fetch the readable content of a web page.
The source can be an url or existing file in your storage.

Usage:
  go-readability [flags] source

Flags:
  -h, --help          help for go-readability
  -l, --http string   start the http server at the specified address
  -m, --metadata      only print the page's metadata
  -t, --text          only print the page's text

Licenses

Go-Readability is distributed under MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. If you like this project, please consider donating to me either via PayPal or Ko-Fi.

Documentation ¶

Overview ¶

Package readability is a Go package that find the main readable content from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Index ¶

func Check(input io.Reader) bool
func CheckDocument(doc *html.Node) bool
func WithHeader(headers map[string]string) requestWith
func WithUserAgent(userAgent string) requestWith
type Article
type Parser
- func NewParser() Parser

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Check ¶

func Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing. It's the wrapper for `Parser.Check()` and useful if you only use the default parser.

func CheckDocument ¶

func CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing. It's the wrapper for `Parser.CheckDocument()` and useful if you only use the default parser.

func WithHeader ¶

func WithHeader(headers map[string]string) requestWith

WithHeader 增加header

func WithUserAgent ¶

func WithUserAgent(userAgent string) requestWith

Types ¶

type Article ¶

type Article struct {
	Title         string
	Byline        string
	Node          *html.Node
	Content       string
	TextContent   string
	Length        int
	Excerpt       string
	SiteName      string
	Image         string
	Favicon       string
	Language      string
	PublishedTime *time.Time
	ModifiedTime  *time.Time
	HttpCode      int
}

Article is the final readable content.

func FromDocument ¶

func FromDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

FromDocument parses an document and returns the readable content. It's the wrapper or `Parser.ParseDocument()` and useful if you only want to use the default parser.

func FromReader ¶

func FromReader(input io.Reader, pageURL *nurl.URL) (Article, error)

FromReader parses an `io.Reader` and returns the readable content. It's the wrapper or `Parser.Parse()` and useful if you only want to use the default parser.

func FromURL ¶

func FromURL(pageURL string, timeout time.Duration, requestModifiers ...requestWith) (Article, error)

FromURL fetch the web page from specified url then parses the response to find the readable content.

type Parser ¶

type Parser struct {
	// MaxElemsToParse is the max number of nodes supported by this
	// parser. Default: 0 (no limit)
	MaxElemsToParse int
	// NTopCandidates is the number of top candidates to consider when
	// analysing how tight the competition is among candidates.
	NTopCandidates int
	// CharThresholds is the default number of chars an article must
	// have in order to return a result
	CharThresholds int
	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string
	// KeepClasses specify whether the classes should be stripped or not.
	KeepClasses bool
	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// Debug determines if the log should be printed or not. Default: false.
	Debug bool
	// DisableJSONLD determines if metadata in JSON+LD will be extracted
	// or not. Default: false.
	DisableJSONLD bool
	// AllowedVideoRegex is a regular expression that matches video URLs that should be
	// allowed to be included in the article content. If undefined, it will use default filter.
	AllowedVideoRegex *regexp.Regexp
	// contains filtered or unexported fields
}

Parser is the parser that parses the page to get the readable content.

func NewParser ¶

func NewParser() Parser

NewParser returns new Parser which set up with default value.

func (*Parser) Check ¶

func (ps *Parser) Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing.

func (*Parser) CheckDocument ¶

func (ps *Parser) CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing.

func (*Parser) Parse ¶

func (ps *Parser) Parse(input io.Reader, pageURL *nurl.URL) (Article, error)

Parse parses a reader and find the main readable content.

func (*Parser) ParseDocument ¶

func (ps *Parser) ParseDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseDocument parses the specified document and find the main readable content.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
go-readability
scripts

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL