httpsyet

package

v0.2.2 Latest Latest Go to latest Published: Jun 30, 2018 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/GoLangsam/pipe

Links

Open Source Insights

README ¶

`v3` - Improve the `traffic` network

Overview

It's not easy to see at first, but there is still an interesting issue hiding in the original implementation.

It may become more obvious when we imagine to crawl huge numbers - e.g. finding a thousand new urls on new each page, and going down a couple of levels.

Look: Each site's feedback spawns a new goroutine in order to queueURLs the (e.g. 1000) urls found.

And most likely these will block!

Almost each and every of these many goroutines will block most of the time, as there are still urls discovered earlier to be fetched and crawled.

And each queueURLs carries the full slice of urls - no matter how many have been sent to the processing yet. (The implementation uses a straightforward range and does not attempt to shrink the slice. Idiomatic go.)

We can do differently. No need to wast many huge slices across plenty of blocked goroutines.

A battery called �djust provides a flexibly buffered pipe. so we use SitePipeAdjust in our network and do not need to have Feed spawn the queueURLs function any more. We may feed synchonously now!

But now we also do not need to bother Feed with the need of registering new traffic (using t.Add(len(urls))) up front. Instead we use SitePipeEnter (a companion of SiteDoneLeave) at the entrance of our network processor.

Thus, the network becomes more flexible and more self-contained and gives less burden to it's surroundings.

Pushing the types site and it's related sites traffic and result into separate sub-packages is just a little more tidying - respecting the original Crawler and it's living space.

Some remarks regarding changes to source files compared with the previous version:

`traffic.go`

Simplify Feed as explained, and add two processes (SitePipeEnter and `SitePipeAdjust) to the network.

`genny.go` in `traffic/`

Just add a line to use adjust.go

`site.go`

Only another package name

`crawling.go`

Just import the new sub-packages, and adjust where need.

`crawler_test.go`

Just the import path.

Changes to `crawler.go`

No need to touch.

Back to Overview

Documentation ¶

Overview ¶

Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	Sites    []string                             // At least one URL.
	Out      io.Writer                            // Required. Writes one detected site per line.
	Log      *log.Logger                          // Required. Errors are reported here.
	Depth    int                                  // Optional. Limit depth. Set to >= 1.
	Parallel int                                  // Optional. Set how many sites to crawl in parallel.
	Delay    time.Duration                        // Optional. Set delay between crawls.
	Get      func(string) (*http.Response, error) // Optional. Defaults to http.Get.
	Verbose  bool                                 // Optional. If set, status updates are written to logger.
}

Crawler is used as configuration for Run. Is validated in Run().

func (Crawler) Run ¶

func (c Crawler) Run() error

Run the crawler. Can return validation errors. All crawling errors are reported via logger. Output is written to writer. Crawls sites recursively and reports all external links that can be changed to HTTPS. Also reports broken links via error logger.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
result
sites

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

v3 - Improve the traffic network

Overview

traffic.go

genny.go in traffic/

site.go

crawling.go

crawler_test.go

Changes to crawler.go

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Crawler ¶

func (Crawler) Run ¶

Source Files ¶

Directories ¶

`v3` - Improve the `traffic` network

`traffic.go`

`genny.go` in `traffic/`

`site.go`

`crawling.go`

`crawler_test.go`

Changes to `crawler.go`