goose

package module
v0.0.0-...-0b4d255 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 6, 2022 License: Apache-2.0 Imports: 25 Imported by: 0

README

GoOse

Build Status Coverage Status

Html Content / Article Extractor in Golang

This is a golang port of "Goose" originaly licensed to Gravity.com under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership.

Golang port was written by Antonio Linari

Gravity.com licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

INSTALL

go get github.com/advancedlogic/GoOse

HOW TO USE IT

package main

import (
	"github.com/advancedlogic/GoOse"
)

func main() {
	g := goose.New()
	article := g.ExtractFromURL("http://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
	println("title", article.Title)
	println("description", article.MetaDescription)
	println("keywords", article.MetaKeywords)
	println("content", article.CleanedText)
	println("url", article.FinalURL)
	println("top image", article.TopImage)
}

Development - Getting started

This application is written in GO language, please refere to the guides in https://golang.org for getting started.

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

To build the project

make build

Before committing the code, please check if it passes all tests using

make qa

TODO

  • better organize code
  • add comments and godoc
  • improve "xpath" like queries
  • add other image extractions techniques (imagemagick)

THANKS TO

@Martin Angers for goquery
@Fatih Arslan for set
GoLang team for the amazing language and net/html

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func OpenGraphResolver

func OpenGraphResolver(article *Article) string

OpenGraphResolver return OpenGraph properties

func ReadLinesOfFile

func ReadLinesOfFile(filename string) []string

ReadLinesOfFile returns the lines from a file as a slice of strings

func RegSplit

func RegSplit(text string, reg *regexp.Regexp) []string

RegSplit splits the strings into strings using the regular expression as separator

func WebPageResolver

func WebPageResolver(article *Article) string

WebPageResolver fetches the main image from the HTML page

Types

type Article

type Article struct {
	Title           string             `json:"title,omitempty"`
	CleanedText     string             `json:"content,omitempty"`
	MetaDescription string             `json:"description,omitempty"`
	MetaLang        string             `json:"lang,omitempty"`
	MetaFavicon     string             `json:"favicon,omitempty"`
	MetaKeywords    string             `json:"keywords,omitempty"`
	CanonicalLink   string             `json:"canonicalurl,omitempty"`
	Domain          string             `json:"domain,omitempty"`
	TopNode         *goquery.Selection `json:"-"`
	TopImage        string             `json:"image,omitempty"`
	Tags            *set.Set           `json:"tags,omitempty"`
	Movies          *set.Set           `json:"movies,omitempty"`
	FinalURL        string             `json:"url,omitempty"`
	LinkHash        string             `json:"linkhash,omitempty"`
	RawHTML         string             `json:"rawhtml,omitempty"`
	Doc             *goquery.Document  `json:"-"`
	Links           []string           `json:"links,omitempty"`
	PublishDate     string             `json:"publishdate,omitempty"`
	AdditionalData  map[string]string  `json:"additionaldata,omitempty"`
	Delta           int64              `json:"delta,omitempty"`
}

Article is a collection of properties extracted from the HTML body

func (*Article) ToString

func (article *Article) ToString() string

ToString is a simple method to just show the title TODO: add more fields and pretty print

type Cleaner

type Cleaner struct {
	// contains filtered or unexported fields
}

Cleaner removes menus, ads, sidebars, etc. and leaves the main content

func NewCleaner

func NewCleaner(config Configuration) Cleaner

NewCleaner returns a new instance of a Cleaner

type Configuration

type Configuration struct {
	// contains filtered or unexported fields
}

Configuration is a wrapper for various config options

func GetDefaultConfiguration

func GetDefaultConfiguration(args ...string) Configuration

GetDefaultConfiguration returns safe default configuration options

type ContentExtractor

type ContentExtractor struct {
	// contains filtered or unexported fields
}

ContentExtractor can parse the HTML and fetch various properties

func NewExtractor

func NewExtractor(config Configuration) ContentExtractor

NewExtractor returns a configured HTML parser

type Crawler

type Crawler struct {
	RawHTML string
	// contains filtered or unexported fields
}

Crawler can fetch the target HTML page

func NewCrawler

func NewCrawler(config Configuration, url string, RawHTML string) Crawler

NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body

func (Crawler) Crawl

func (c Crawler) Crawl() *Article

Crawl fetches the HTML body and returns an Article

type Goose

type Goose struct {
	// contains filtered or unexported fields
}

Goose is the main entry point of the program

func New

func New(args ...string) Goose

New returns a new instance of the article extractor

func (Goose) ExtractFromRawHTML

func (g Goose) ExtractFromRawHTML(url string, RawHTML string) *Article

ExtractFromRawHTML returns an article object from the raw HTML content

func (Goose) ExtractFromURL

func (g Goose) ExtractFromURL(url string) *Article

ExtractFromURL follows the URL, fetches the HTML page and returns an article object

type Helper

type Helper struct {
	// contains filtered or unexported fields
}

Helper is a utility struct to clean up URLs and charsets

func NewRawHelper

func NewRawHelper(url string, RawHTML string) Helper

NewRawHelper converts the text to UTF8

func NewURLHelper

func NewURLHelper(url string) Helper

NewURLHelper wraps the URL

type Parser

type Parser struct{}

Parser is an HTML parser specialised in extraction of main content and other properties

func NewParser

func NewParser() *Parser

NewParser returns an HTML parser

type StopWords

type StopWords struct {
	// contains filtered or unexported fields
}

StopWords implements a simple language detector

func NewStopwords

func NewStopwords() StopWords

NewStopwords returns an instance of a stop words detector

func (StopWords) SimpleLanguageDetector

func (stop StopWords) SimpleLanguageDetector(text string) string

SimpleLanguageDetector returns the language code for the text, based on its stop words

type VideoExtractor

type VideoExtractor struct {
	// contains filtered or unexported fields
}

VideoExtractor can extract the main video from an HTML page

func NewVideoExtractor

func NewVideoExtractor() VideoExtractor

NewVideoExtractor returns a new instance of a HTML video extractor

func (*VideoExtractor) GetVideos

func (ve *VideoExtractor) GetVideos(article *Article) *set.Set

GetVideos returns the video tags embedded in the article

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL