parser

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 10, 2025 License: MIT Imports: 11 Imported by: 0

Documentation

Index

Constants

View Source
const WikiPrefix string = "/wiki/"

WikiPrefix being present in URL means that the URL points to another Wikipedia page

Variables

This section is empty.

Functions

func FindAllLinks(n *html.Node, links *[]string)

FindAllLinks extracts all <a> tags from HTML Node and gets their href attribute values

func FindBodyContent

func FindBodyContent(n *html.Node) *html.Node

FindBodyContent recursively iterates through HTML until it finds a div with ID = "bodyContent"

func FormatURL

func FormatURL(href string) string

FormatURL joins provided href to Wikipedia host

func GetResponse

func GetResponse(url string) *http.Response

GetResponse prepares request and gets response from Wikipedia. If server returns invalid response status code or throws an error, then fails with fatal log. Otherwise, returns response data

func PrintReport

func PrintReport(report *PageReport) string

PrintReport gets full path to target URL and prepares string report

func ProcessPage

func ProcessPage(url string) *[]string

ProcessPage gets the page data, parsers HTML to extract all links, then clears the slice of URLs from duplicates and returns the resulting slice

func RemoveLangReference

func RemoveLangReference(url string) string

RemoveLangReference uses regular expression to remove language-specific domain from Wikipedia URL

func ValidateURL

func ValidateURL(url string) (string, error)

ValidateURL checks URL to point on Wikipedia page and not to any other content

func ValidateWikiURl

func ValidateWikiURl(url string) (string, error)

ValidateWikiURl checks if URL addresses Wiki page

Types

type PageReport

type PageReport struct {
	Parent   *PageReport
	Url      string
	Children *[]string
	Depth    int
}

func WideSearch

func WideSearch(initialUrl *string, targetUrl *string, maxConcurrency *int, maxDepth *int, ctx context.Context, cancel context.CancelFunc) (*PageReport, error)

WideSearch continually requests pages until it finds the one that has a target URL or reaches max depth

func (*PageReport) Process

func (p *PageReport) Process(
	ctx context.Context,
	sem chan struct{},
	wg *sync.WaitGroup,
	mu *sync.Mutex,
	processedUrls map[string]bool,
	reports *[]*PageReport,
	queue *[]*PageReport,
	finalReport **PageReport,
	targetUrl *string,
	pageCount *int,
	cancel context.CancelFunc,
)

Process requests URLs that are collected in page. If targetUrl is found among the collected URLs, then the process is stopped and final report is compiled. Otherwise, the queue is updated and parsing continues

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL