CrawlyColly

command module

v0.0.0-...-9abf145 Latest Latest Go to latest Published: Apr 20, 2020 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/StanGirard/CrawlyColly

Links

Open Source Insights

README ¶

Fast Crawler

The purpose of this crawler is to get all the pages of a website very quickly

It uses the sitemaps of a website to discover the pages The drawback is that if the pages aren't in the sitemap they won't be discovered. However, it is a very fast & efficient way to get thousands of pages in seconds.

Installation

Install Golang

Use the crawler

Create the folder urls with the command mkdir urls
Set the websites your want to crawl in a file like urls_test, one url per line
Compile with go build *.go
Run with cat urls_test | ./crawl or if not compiled cat urls_test | go run *.go
The websites' urls will be writen in urls_<www.yourwebsite.com>.csv

At the end of your crawling if you want to merge all the files just run in the folder urls for filename in $(ls *.csv); do sed 1d $filename >> ../final.csv; done

Disclaimer

Please be advised that even if this crawler doesn't visit every pages of a website, it can be very intensive for large websites. Feel free to make pull request to improve the crawler

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL