Documentation
¶
Index ¶
- Constants
- func FindAllLinks(n *html.Node, links *[]string)
- func FindBodyContent(n *html.Node) *html.Node
- func FormatURL(href string) string
- func GetResponse(url string) *http.Response
- func PrintReport(report *PageReport) string
- func ProcessPage(url string) *[]string
- func RemoveLangReference(url string) string
- func ValidateURL(url string) (string, error)
- func ValidateWikiURl(url string) (string, error)
- type PageReport
Constants ¶
const WikiPrefix string = "/wiki/"
WikiPrefix being present in URL means that the URL points to another Wikipedia page
Variables ¶
This section is empty.
Functions ¶
func FindAllLinks ¶
FindAllLinks extracts all <a> tags from HTML Node and gets their href attribute values
func FindBodyContent ¶
FindBodyContent recursively iterates through HTML until it finds a div with ID = "bodyContent"
func GetResponse ¶
GetResponse prepares request and gets response from Wikipedia. If server returns invalid response status code or throws an error, then fails with fatal log. Otherwise, returns response data
func PrintReport ¶
func PrintReport(report *PageReport) string
PrintReport gets full path to target URL and prepares string report
func ProcessPage ¶
ProcessPage gets the page data, parsers HTML to extract all links, then clears the slice of URLs from duplicates and returns the resulting slice
func RemoveLangReference ¶
RemoveLangReference uses regular expression to remove language-specific domain from Wikipedia URL
func ValidateURL ¶
ValidateURL checks URL to point on Wikipedia page and not to any other content
func ValidateWikiURl ¶
ValidateWikiURl checks if URL addresses Wiki page
Types ¶
type PageReport ¶
type PageReport struct { Parent *PageReport Url string Children *[]string Depth int }
func WideSearch ¶
func WideSearch(initialUrl *string, targetUrl *string, maxConcurrency *int, maxDepth *int, ctx context.Context, cancel context.CancelFunc) (*PageReport, error)
WideSearch continually requests pages until it finds the one that has a target URL or reaches max depth
func (*PageReport) Process ¶
func (p *PageReport) Process( ctx context.Context, sem chan struct{}, wg *sync.WaitGroup, mu *sync.Mutex, processedUrls map[string]bool, reports *[]*PageReport, queue *[]*PageReport, finalReport **PageReport, targetUrl *string, pageCount *int, cancel context.CancelFunc, )
Process requests URLs that are collected in page. If targetUrl is found among the collected URLs, then the process is stopped and final report is compiled. Otherwise, the queue is updated and parsing continues