ratt

package module
v0.0.0-...-555d5c5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 9, 2021 License: MIT Imports: 22 Imported by: 0

README

ratt

RSS all the things!

ratt is a tool for converting websites to rss/atom feeds. It uses config files which define the extraction of the feed data by using css selectors, or Lua script.

Config files are in yaml format:

#for automatic extraction, ratt checks all config files and matches the regex
regex: https://videoportal.joj.sk/.*
selectors:
    #settings for all http requests for the website
    httpsettings:
        cookie: {}
        header: {}
        useragent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36
    #css selectors to get the feed data
    feed:
        title: .title.my-2
        desctription: .description
        authorname:
        authoremail:
    #css selectors to get item data
    item:
        #the item container
        container: article.b-article.title-xs.article-lp
        #all subsequent attributes of the item are selected from the subtree of the item container
        title: div.content > h3
        link: a
        linkattr: href
        created: .date
        createdformat: 2.1.2006
        description: div.col > .date
        image: img.img-fluid
        imageattr: data-original

Configs

Config files are yaml files. ratt has some confs embedded. When calling eg: ratt auto https://1337x.to/top-100 ratt will try to find the config for the website url, it searches the embedded config files, the current directory and in ~/.config/ratt/*.yml.

Installation

Install latest with go:

go get git.sr.ht/~ghost08/ratt@latest

Install on Arch Linux from AUR with your favorite helper:

yay -S ratt-git

Issues

File bugs and TODOs through the issue tracker or send an email to ~ghost08/ratt@todo.sr.ht. For general discussion, use the mailing list: ~ghost08/ratt@lists.sr.ht.

Usage

ratt has three commands:

auto - automatically searches for the config that will be used.

extract - with other arguments, ratt will scrap the website to generate the RSS/Atom feed.

save - when you have the correct css selectors/lua scripts, save the config to a yaml file

ratt save --feed-title=".featured-heading strong" --item-container=".table-list-wrap tbody tr" --item-title="a:nth-child(2)" --item-link='a = sel:find("a:nth-child(2)")
itemURL = "https://1337x.to" .. a:attr("href")
doc, err = goquery.newDocFromURL(itemURL)
if err ~= nil then
    error(err)
end
link = doc:find("ul li a[onclick]"):first():attr("href")
link = link:gsub("%s+", "")
print(link)' --item-created=".coll-date" --item-created-format="" "https://1337x.to/.*" 1337x.yml

What will I do with this RSS feed?

That's a very good question. I'm happy you asked :)

You might feed the feed directly to photon, which is a modern RSS/Atom reader. photon will play you the media from your feed. It uses mpv and youtube-dl to automaticaly play videos, download torrents, view images and much more :)

So try this out:

ratt auto https://1337x.to/top-100 | photon -

photon 1337x screenshot

Lua

If a css selector isn't enough to select the needed data, every feed and item attribute can be written as a multiline value and ratt will interpret it as Lua script.

The Lua script will get some global variables, to help with the extraction:

goquery is a module imported by default and it is a subset of the famous goquery library

sel is the selection object of the feed/item container on which it can be queried for the selectors

gojq is a module imported by default, it is the gojq) library

setGlobal sets a global variable that will be visible in other lua scripts. eg. in feed title setGlobal("myvar", 1) is called and than in every subsequent item title, item link, ..., item image the variable will be visible print(myvar)

index number of the item processed

ratt will take the stdout of the Lua script and insert it as the data of the feed/item. When a error has occured, just use the error function.

examples

Calling another link, parsing it to a goquery.Document and querying the new doc:

item:
  #select the item container html element
  container: .table-list-wrap tbody tr
  #select the title element in the item container
  title: a:nth-child(2)
  #lua script
  link: |-
    --sel is the item container element, find <a/>
    a = sel:find("a:nth-child(2)")
	--get the href attribute of <a/> and make a item url link from it
    itemURL = "https://1337x.to" .. a:attr("href")
	--request and parse the document
    doc, err = goquery.newDocFromURL(itemURL)
    if err ~= nil then
	  --return error if the request was unsuccesfull
      error(err)
    end
	--find the item link you want
    link = doc:find("ul li a[onclick]"):first():attr("href")
	--trim space characters
    link = link:gsub("%s+", "")
	--and finally print the link out so ratt can include it in the item.link
    print(link)

You can also parse and query json data, with the help of the awesome gojq) library:

feed:
    title: .title
    description: |-
        --find the <script> element where the json data is
        script = sel:find("script"):first():text()
        index = script::find("var myJsonData =")
        --cut of the "var myJsonData =" prefix
        jsonData = script:sub(index+16)
        --parse a gojq query, that will find the obj["description'] value
        query, err = gojq.parse(".description")
        if err ~= nil then
          error(err)
        end
        --expecting that the input data is a map/object (otherwise if it's a array use runArray)
        desc, err = query.runMap(jsonData)
        if err ~= nil then
          error(err)
        end
        print(desc[1]["description"])

Check the confs dir for other examples.

Contribution

ratt needs config files for it to run. I really rely on the community to create configs for all the sites!

So please create config files, push it here, than everybody can make the world RSS again!

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ConstructFeed

func ConstructFeed(doc *goquery.Document, u string, selectors Selectors, verbose bool) (feed *feeds.Feed, err error)

func ConstructFeedFromURL

func ConstructFeedFromURL(url *url.URL, selectors Selectors, verbose bool) (feed *feeds.Feed, err error)

func Extract

func Extract(link *url.URL, selectors Selectors, outputType string, verbose bool, encode string)

func Save

func Save(filepath, regex string, selectors Selectors)

Types

type ByCreated

type ByCreated []*feeds.Item

func (ByCreated) Len

func (a ByCreated) Len() int

func (ByCreated) Less

func (a ByCreated) Less(i, j int) bool

func (ByCreated) Swap

func (a ByCreated) Swap(i, j int)

type ByTitle

type ByTitle []*feeds.Item

func (ByTitle) Len

func (a ByTitle) Len() int

func (ByTitle) Less

func (a ByTitle) Less(i, j int) bool

func (ByTitle) Swap

func (a ByTitle) Swap(i, j int)

type Conf

type Conf struct {
	Regex string
	Selectors
}

type HTTPSettings

type HTTPSettings struct {
	Cookie    map[string]string `optional short:"b" help:"sets the cookie of all outgoing http requests"`
	Header    map[string]string `optional short:"H" help:"sets the headers of all outgoing http requests"`
	UserAgent string            `` /* 196-byte string literal not displayed */
	Insecure  bool              `` /* 222-byte string literal not displayed */
}
var GlobalHTTPSettings *HTTPSettings

func (*HTTPSettings) Client

func (s *HTTPSettings) Client() *http.Client

type Selectors

type Selectors struct {
	HTTPSettings HTTPSettings `yaml: "httpsettings" embed`
	Feed         struct {
		Title       string `required name:"feed-title" help:"css selector for the feed title"`
		Description string `optional name:"feed-description" help:"css selector for the feed description"`
		AuthorName  string `optional help:"css selector for the feed author name"`
		AuthorEmail string `optional help:"css selector for the feed author email"`
	} `yaml:"feed" embed`
	Item struct {
		Container     string `required name:"item-container" help:"css selector for the item container"`
		Title         string `required name:"item-title" help:"css selector for the item title"`
		Link          string `required name:"item-link" help:"css selector for the item link"`
		LinkAttr      string `default:"href" name:"item-link-attr" help:"get attribute value of the item link element"`
		Created       string `required name:"item-created" help:"css selector for the item created time"`
		CreatedFormat string `required name:"item-created-format" help:"css selector for the item created time format"`
		Description   string `name:"item-description" help:"css selector for the item description"`
		Image         string `name:"item-image" help:"css selector for the item image"`
		ImageAttr     string `name:"item-image-attr" default:"src" help:"get attribute value of the item image element"`
	} `yaml:"item" embed`
	NextPage      string   `optional help:"css selector for the link to the next page to be scraped"`
	NextPageAttr  string   `optional default:"href" help:"get attribute value of the next page element"`
	NextPageCount int      `optional help:"how deep to follow the next page link (integer value)"`
	Sort          SortEnum `` /* 155-byte string literal not displayed */
}

func FindSelectors

func FindSelectors(url string, verbose bool) (Selectors, error)

type SortEnum

type SortEnum string
const (
	SortDontSort    SortEnum = ""
	SortReverse     SortEnum = "REVERSE"
	SortCreatedASD  SortEnum = "CREATED_ASD"
	SortCreatedDESC SortEnum = "CREATED_DESC"
	SortTitleASD    SortEnum = "TITLE_ASD"
	SortTitleDESC   SortEnum = "TITLE_DESC"
)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL