parser

package

v1.0.1 Latest Latest Go to latest Published: May 10, 2025 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/loginchik/WikiPath

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func FindAllLinks(n *html.Node, links *[]string)
func FindBodyContent(n *html.Node) *html.Node
func FormatURL(href string) string
func GetResponse(url string) *http.Response
func PrintReport(report *PageReport) string
func ProcessPage(url string) *[]string
func RemoveLangReference(url string) string
func ValidateURL(url string) (string, error)
func ValidateWikiURl(url string) (string, error)
type PageReport
- func WideSearch(initialUrl *string, targetUrl *string, maxConcurrency *int, maxDepth *int, ...) (*PageReport, error)
- func (p *PageReport) Process(ctx context.Context, sem chan struct{}, wg *sync.WaitGroup, mu *sync.Mutex, ...)

Constants ¶

View Source

const WikiPrefix string = "/wiki/"

WikiPrefix being present in URL means that the URL points to another Wikipedia page

Variables ¶

This section is empty.

Functions ¶

func FindAllLinks ¶

func FindAllLinks(n *html.Node, links *[]string)

FindAllLinks extracts all <a> tags from HTML Node and gets their href attribute values

func FindBodyContent ¶

func FindBodyContent(n *html.Node) *html.Node

FindBodyContent recursively iterates through HTML until it finds a div with ID = "bodyContent"

func FormatURL ¶

func FormatURL(href string) string

FormatURL joins provided href to Wikipedia host

func GetResponse ¶

func GetResponse(url string) *http.Response

GetResponse prepares request and gets response from Wikipedia. If server returns invalid response status code or throws an error, then fails with fatal log. Otherwise, returns response data

func PrintReport ¶

func PrintReport(report *PageReport) string

PrintReport gets full path to target URL and prepares string report

func ProcessPage ¶

func ProcessPage(url string) *[]string

ProcessPage gets the page data, parsers HTML to extract all links, then clears the slice of URLs from duplicates and returns the resulting slice

func RemoveLangReference ¶

func RemoveLangReference(url string) string

RemoveLangReference uses regular expression to remove language-specific domain from Wikipedia URL

func ValidateURL ¶

func ValidateURL(url string) (string, error)

ValidateURL checks URL to point on Wikipedia page and not to any other content

func ValidateWikiURl ¶

func ValidateWikiURl(url string) (string, error)

ValidateWikiURl checks if URL addresses Wiki page

Types ¶

type PageReport ¶

type PageReport struct {
	Parent   *PageReport
	Url      string
	Children *[]string
	Depth    int
}

func WideSearch ¶

func WideSearch(initialUrl *string, targetUrl *string, maxConcurrency *int, maxDepth *int, ctx context.Context, cancel context.CancelFunc) (*PageReport, error)

WideSearch continually requests pages until it finds the one that has a target URL or reaches max depth

func (*PageReport) Process ¶

func (p *PageReport) Process(
	ctx context.Context,
	sem chan struct{},
	wg *sync.WaitGroup,
	mu *sync.Mutex,
	processedUrls map[string]bool,
	reports *[]*PageReport,
	queue *[]*PageReport,
	finalReport **PageReport,
	targetUrl *string,
	pageCount *int,
	cancel context.CancelFunc,
)

Process requests URLs that are collected in page. If targetUrl is found among the collected URLs, then the process is stopped and final report is compiled. Otherwise, the queue is updated and parsing continues

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL