README

Go-Readability

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Status

This package is stable enough for use, and up to date with Readability.js until commit 2982216.

Installation

To install this package, just run go get :

go get -u -v gitea.com/huiyifyj/readability

Example

To get the readable content from an URL, you can use readability.FromURL. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content :

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	readability "gitea.com/huiyifyj/readability"
)

var (
	urls = []string{
		// this one is article, so it's parse-able
		"https://huiyifyj.cn/a-case-for-using-void-in-modern-javascript/",
		// while this one is not an article, so readability will fail to parse.
		"https://huiyifyj.cn",
	}
)

func main() {
	for i, url := range urls {
		article, err := readability.FromURL(url, 30*time.Second)
		if err != nil {
			log.Fatalf("failed to parse %s, %v\n", url, err)
		}

		dstTxtFile, _ := os.Create(fmt.Sprintf("text-%02d.txt", i+1))
		defer dstTxtFile.Close()
		dstTxtFile.WriteString(article.TextContent)

		dstHTMLFile, _ := os.Create(fmt.Sprintf("html-%02d.html", i+1))
		defer dstHTMLFile.Close()
		dstHTMLFile.WriteString(article.Content)

		fmt.Printf("URL     : %s\n", url)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Printf("Text content saved to \"text-%02d.txt\"\n", i+1)
		fmt.Printf("HTML content saved to \"html-%02d.html\"\n", i+1)
		fmt.Println()
	}
}

However, sometimes you want to parse an URL no matter if it's an article or not. For example is when you only want to get metadata of the page. To do that, you have to download the page manually using http.Get, then parse it using readability.FromReader :

package main

import (
	"fmt"
	"log"
	"net/http"

	readability "gitea.com/huiyifyj/readability"
)

var (
	urls = []string{
		// Both will be parse-able now
		"https://huiyifyj.cn/a-case-for-using-void-in-modern-javascript",
		// But this one will not have any content
		"https://huiyifyj.cn",
	}
)

func main() {
	for _, url := range urls {
		resp, err := http.Get(url)
		if err != nil {
			log.Fatalf("failed to download %s: %v\n", url, err)
		}
		defer resp.Body.Close()

		article, err := readability.FromReader(resp.Body, url)
		if err != nil {
			log.Fatalf("failed to parse %s: %v\n", url, err)
		}

		fmt.Printf("URL     : %s\n", url)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Println()
	}
}

Command Line Usage

You can also use go-readability as command line app. To do that, first install the CLI :

go get -u -v gitea.com/huiyifyj/readability/cmd/...

Now you can use it by running go-readability in your terminal :

$ go-readability -h

go-readability is parser to fetch the readable content of a web page.
The source can be an url or existing file in your storage.

Usage:
  go-readability [flags] source

Flags:
  -h, --help       help for go-readability
  -m, --metadata   only print the page's metadata

Licenses

Go-Readability is distributed under MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request.

Expand ▾ Collapse ▴

Documentation

Overview

    Package readability is a Go package that find the main readable content from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

    This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

    Index

    Constants

    This section is empty.

    Variables

    This section is empty.

    Functions

    func IsReadable

    func IsReadable(input io.Reader) bool

      IsReadable decides whether or not the document is reader-able without parsing the whole thing. It's the wrapper for `Parser.IsReadable()` and useful if you only use the default parser.

      Types

      type Article

      type Article struct {
      	Title       string
      	Byline      string
      	Node        *html.Node
      	Content     string
      	TextContent string
      	Length      int
      	Excerpt     string
      	SiteName    string
      	Image       string
      	Favicon     string
      }

        Article is the final readable content.

        func FromReader

        func FromReader(input io.Reader, pageURL string) (Article, error)

          FromReader parses input from an `io.Reader` and returns the readable content. It's the wrapper for `Parser.Parse()` and useful if you only want to use the default parser.

          func FromURL

          func FromURL(pageURL string, timeout time.Duration) (Article, error)

            FromURL fetch the web page from specified url, check if it's readable, then parses the response to find the readable content.

            type Parser

            type Parser struct {
            	// MaxElemsToParse is the max number of nodes supported by this
            	// parser. Default: 0 (no limit)
            	MaxElemsToParse int
            	// NTopCandidates is the number of top candidates to consider when
            	// analysing how tight the competition is among candidates.
            	NTopCandidates int
            	// CharThresholds is the default number of chars an article must
            	// have in order to return a result
            	CharThresholds int
            	// ClassesToPreserve are the classes that readability sets itself.
            	ClassesToPreserve []string
            	// KeepClasses specify whether the classes should be stripped or not.
            	KeepClasses bool
            	// TagsToScore is element tags to score by default.
            	TagsToScore []string
            	// Debug determines if the log should be printed or not. Default: false.
            	Debug bool
            	// contains filtered or unexported fields
            }

              Parser is the parser that parses the page to get the readable content.

              func NewParser

              func NewParser() Parser

                NewParser returns new Parser which set up with default value.

                func (*Parser) IsReadable

                func (ps *Parser) IsReadable(input io.Reader) bool

                  IsReadable decides whether or not the document is reader-able without parsing the whole thing. In `mozilla/readability`, this method is located in `Readability-readable.js`.

                  func (*Parser) Parse

                  func (ps *Parser) Parse(input io.Reader, pageURL string) (Article, error)

                    Parse parses input and find the main readable content.

                    Directories

                    Path Synopsis
                    cmd