urlfmt

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 8, 2023 License: MIT Imports: 11 Imported by: 2

README

url-fmt

Create constant URL strings that can be matched, filled, extracted, fetched, and souped all by utilising the power of Go's string interpolation verbs!

Why

Scraping from multiple different URLs can become messy. Defining string constants with string interpolation verbs embedded into them so you can insert parameters and path segments. url-fmt makes it easy to define enumerations of constant URL formats that can be:

  • Fill-ed: fill a URL format with the given arguments. Acts the same as fmt.Sprintf.
  • Regex-ed: generates a regular expression pattern for the URL format by converting each string interpolation verb into its corresponding regex pattern.
  • Match-ed: match a URL format to an already filled URL string.
  • ExtractArgs-ed: extract the corresponding string interpolation verbs from a filled URL string.
  • Standardise-d: extract the arguments from a filled URL string and fill the URL format with the extracted args.
  • Request-ed: generate a http.Request for the given URL format.
  • Soup-ed: make a request to the given URL format and parse the returned HTML content into a searchable BeautifulSoup-like object that can be searched. The BeautifulSoup implementation comes from Anas Khan's soup library.
  • JSON-ed: make a request to the given URL format and parse the returned JSON content into a map[string]any.

How

Define your URL formats for the URLs that you will access/scrape. We will be using some examples for Steam and Itch.IO scraping:

package main

import "github.com/andygello555/url-fmt"

const (
	SteamAppPage urlfmt.URL = "%s://store.steampowered.com/app/%d"
	SteamAppReviews urlfmt.URL = "https://store.steampowered.com/appreviews/%d?json=1&cursor=%s&language=%s&day_range=9223372036854775807&num_per_page=%d&review_type=all&purchase_type=%s&filter=%s&start_date=%d&end_date=%d&date_range_type=%s"
	ItchIOGamePage urlfmt.URL = "http://%s.itch.io/%s"
)

Notice how we can provide the protocol (https:// or http://), or not (%s://). url-fmt will automatically add the HTTPS protocol when filling (this won't interfere with the arguments that you provide), and generate the following regex when Regex is called: https?.

Then you can use these however you require:

package main

import (
	"fmt"
	"github.com/andygello555/url-fmt"
)

const (
	SteamAppPage    urlfmt.URL = "%s://store.steampowered.com/app/%d"
	SteamAppReviews urlfmt.URL = "https://store.steampowered.com/appreviews/%d?json=1&cursor=%s&language=%s&day_range=9223372036854775807&num_per_page=%d&review_type=all&purchase_type=%s&filter=%s&start_date=%d&end_date=%d&date_range_type=%s"
	ItchIOGamePage  urlfmt.URL = "http://%s.itch.io/%s"
)

func main() {
	steamPageURLs := []string{
		"http://store.steampowered.com/app/477160",
		"http://store.steampowered.com/app/477160/Human_Fall_Flat/",
		"https://store.steampowered.com/app/477160",
		"https://store.steampowered.com/app/Human_Fall_Flat/",
	}

	// Check to see which URLs above match the schema of the SteamAppPage URL format
	for _, steamPageURL := range steamPageURLs {
		fmt.Printf("%s matches %s = %t\n", steamPageURL, SteamAppPage.Regex(), SteamAppPage.Match(steamPageURL))
	}

	itchIOPageURLs := []string{
		"https://hempuli.itch.io/baba-files-taxes",
		"https://sokpop.itch.io/ballspell",
    }

	// Extract the arguments from the above Itch.IO game page URLs
	for _, itchIOPageURL := range itchIOPageURLs {
		fmt.Printf("extracted %v from %s\n", ItchIOGamePage.ExtractArgs(itchIOPageURL), itchIOPageURL)
    }

	// Fetch the name for Steam App 477160 via its HTML Steam store page
	if soup, _, err := SteamAppPage.Soup(nil, 477160); err != nil {
		fmt.Printf("Could not get soup for %s, because %s\n", SteamAppPage.Fill(477160), err.Error())
	} else {
		fmt.Println("Name of Steam", soup.Find("div", "id", "appHubAppName").Text())
	}

	// Fetch the number of positive and negative reviews for Steam app 477160 via the Steam store's JSON API
	args := []any{477160, "*", "all", 20, "all", "all", -1, -1, "all"}
	if j, _, err := SteamAppReviews.JSON(nil, args...); err != nil {
		fmt.Printf("Could not get reviews for %s, because %s\n", SteamAppReviews.Fill(args...), err.Error())
	} else {
		fmt.Printf("Query summary for Steam app 477160: %v\n", j["query_summary"])
	}
}

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type URL

type URL string

URL represents a page on Steam. It is usually a format string has string interpolation applied to it before fetching. The protocol should be given as a string verb at the beginning of the URL.

func (URL) ExtractArgs

func (u URL) ExtractArgs(url string) (args []any)

ExtractArgs extracts the necessary arguments from the given URL to run the ScrapeURL.Soup, URL.JSON, and URL.Fill methods. This is useful when taking a URL matched by URL.Match and fetching the soup for that matched URL.

Example
const (
	SteamAppPage   URL = "%s://store.steampowered.com/app/%d"
	ItchIOGamePage URL = "%s://%s.itch.io/%s"
)

fmt.Println(SteamAppPage.ExtractArgs("http://store.steampowered.com/app/477160"))
fmt.Println(SteamAppPage.ExtractArgs("https://store.steampowered.com/app/477160/Human_Fall_Flat/"))
fmt.Println(ItchIOGamePage.ExtractArgs("https://hempuli.itch.io/baba-files-taxes"))
fmt.Println(ItchIOGamePage.ExtractArgs("https://sokpop.itch.io/ballspell"))
Output:

[477160]
[477160]
[hempuli baba-files-taxes]
[sokpop ballspell]

func (URL) Fill

func (u URL) Fill(args ...any) string

Fill will apply string interpolation to the URL. The protocol does not need to be included as "https" is always prepended to the args.

func (URL) GetRequest

func (u URL) GetRequest(args ...any) (url string, req *http.Request, err error)

GetRequest creates a new http.MethodGet http.Request for the given URL with the given arguments.

func (URL) JSON

func (u URL) JSON(req *http.Request, args ...any) (jsonBody map[string]any, resp *http.Response, err error)

JSON makes a request to the URL and parses the response to JSON. As well as returning the parsed JSON as a map, it also returns the response to the original HTTP request made to the given URL. If a non-nil http.Request is provided then it will be used to fetch the JSON resource, otherwise default http.MethodGet http.Request will be constructed instead.

Example
const SteamAppReviews URL = "%s://store.steampowered.com/appreviews/%d?json=1&cursor=%s&language=%s&day_range=9223372036854775807&num_per_page=%d&review_type=all&purchase_type=%s&filter=%s&start_date=%d&end_date=%d&date_range_type=%s"
args := []any{477160, "*", "all", 20, "all", "all", -1, -1, "all"}
fmt.Printf("Getting review stats for 477160 from %s:\n", SteamAppReviews.Fill(args...))
if j, _, err := SteamAppReviews.JSON(nil, args...); err != nil {
	fmt.Printf("Could not get reviews for %s, because %s", SteamAppReviews.Fill(args...), err.Error())
} else {
	var i int
	keys := make([]string, len(j["query_summary"].(map[string]any)))
	for key := range j["query_summary"].(map[string]any) {
		keys[i] = key
		i++
	}
	sort.Strings(keys)
	fmt.Println(keys)
}
Output:

Getting review stats for 477160 from https://store.steampowered.com/appreviews/477160?json=1&cursor=*&language=all&day_range=9223372036854775807&num_per_page=20&review_type=all&purchase_type=all&filter=all&start_date=-1&end_date=-1&date_range_type=all:
[num_reviews review_score review_score_desc total_negative total_positive total_reviews]

func (URL) Match

func (u URL) Match(url string) bool

Match the given URL with a URL to check if they are the same format.

Example
const (
	SteamAppPage   URL = "%s://store.steampowered.com/app/%d"
	ItchIOGamePage URL = "%s://%s.itch.io/%s"
)

fmt.Println(SteamAppPage.Match("http://store.steampowered.com/app/477160"))
fmt.Println(SteamAppPage.Match("http://store.steampowered.com/app/477160/Human_Fall_Flat/"))
fmt.Println(SteamAppPage.Match("https://store.steampowered.com/app/477160"))
fmt.Println(SteamAppPage.Match("https://store.steampowered.com/app/Human_Fall_Flat/"))
fmt.Println(ItchIOGamePage.Match("https://hempuli.itch.io/baba-files-taxes"))
fmt.Println(ItchIOGamePage.Match("https://sokpop.itch.io/ballspell"))
Output:

true
true
true
false
true
true

func (URL) Regex

func (u URL) Regex() *regexp.Regexp

Regex converts the URL to a regex by replacing the string interpolation verbs with their regex character set counterparts.

Example
const (
	SteamAppPage   URL = "%s://store.steampowered.com/app/%d"
	ItchIOGamePage URL = "%s://%s.itch.io/%s"
)

fmt.Println(SteamAppPage.Regex())
fmt.Println(ItchIOGamePage.Regex())
Output:

https?://store.steampowered.com/app/(\d+)
https?://([a-zA-Z0-9-._~]+).itch.io/([a-zA-Z0-9-._~]+)

func (URL) Request

func (u URL) Request(method string, body io.Reader, args ...any) (url string, req *http.Request, err error)

Request creates a new http.Request for the given URL with the given arguments, method, and io.Reader.

func (URL) RetryJSON

func (u URL) RetryJSON(req *http.Request, maxTries int, minDelay time.Duration, try func(jsonBody map[string]any, resp *http.Response) error, args ...any) error

RetryJSON will run JSON with the given args and try the given function. If the function returns an error then the function will be retried up to a total of the given number of maxTries. If minDelay is given, and is not 0, then before the function is retried it will sleep for (maxTries + 1 - currentTries) * minDelay. If a non-nil http.Request is provided then it will be used to fetch the JSON resource, otherwise default http.MethodGet http.Request will be constructed instead.

func (URL) RetrySoup

func (u URL) RetrySoup(req *http.Request, maxTries int, minDelay time.Duration, try func(doc *soup.Root, resp *http.Response) error, args ...any) error

RetrySoup will run Soup with the given args and try the given function. If the function returns an error then the function will be retried up to a total of the given number of maxTries. If minDelay is given, and is not 0, then before the function is retried it will sleep for (maxTries + 1 - currentTries) * minDelay. If a non-nil http.Request is provided then it will be used to fetch the page for the Soup, otherwise a default http.MethodGet http.Request will be constructed instead.

func (URL) Soup

func (u URL) Soup(req *http.Request, args ...any) (doc *soup.Root, resp *http.Response, err error)

Soup fetches the URL using the default HTTP client, then parses the returned HTML page into a soup.Root. It also returns the http.Response object returned by the http.Get request. A http.Request can be provided, but if nil is provided then a default http.MethodGet http.Request will be constructed instead.

Example
const SteamAppPage URL = "%s://store.steampowered.com/app/%d"
fmt.Printf("Getting name of app 477160 from %s:\n", SteamAppPage.Fill(477160))
if soup, _, err := SteamAppPage.Soup(nil, 477160); err != nil {
	fmt.Printf("Could not get soup for %s, because %s", SteamAppPage.Fill(477160), err.Error())
} else {
	fmt.Println(soup.Find("div", "id", "appHubAppName").Text())
}
Output:

Getting name of app 477160 from https://store.steampowered.com/app/477160:
Human: Fall Flat

func (URL) Standardise

func (u URL) Standardise(url string) string

Standardise will first extract the args from the given URL then Fill the referred to URL with those args.

func (URL) String

func (u URL) String() string

String returns the un-formatted URL with the protocol:

"%s://"

Replacing an existing protocol, if there is one already, or adding one on if there isn't one.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL