chromedl

package module

v0.1.1 Latest Latest Go to latest Published: May 8, 2021 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rusq/chromedl

Links

Open Source Insights

README ¶

========================
 Chrome File Downloader
========================

.. contents::
   :depth: 2

The sole purpose of this package is to download files from the Internets with
headless Chrome bypassing the Cloudflare and maybe some other annoying browser
checks.

It does so by implementing the solutions posted in "`bypass headless chrome
detection issue`_" for chromedp_.

This library may help you if the other download methods don't work, i.e. curl or
the standard `http.Get()`.

The implementation is based on this `chromedp example`_.

Thanks to `@ZekeLu`_ for huge help in getting this going.

Compatibility
-------------

Tested with:

* Chrome (stable) v90.0.4430.93.
* github.com/chromedp/chromedp v0.6.12
* github.com/chromedp/cdproto v0.0.0-20210323015217-0942afbea50e

Newer versions of Chrome will require some code changes, as described in `this
issue`_, as it uses calls that are deprecated in newer protocol version in order
to be compatible with current stable version of Chrome (see above).

When using headless-shell docker image, please use the following tag::

  FROM chromedp/headless-shell:90.0.4430.93


LICENCES
--------
chromedp_: Copyright (c) 2016-2020 Kenneth Shaw


.. _`this issue`: https://github.com/chromedp/chromedp/issues/807
.. _`chromedp example`: https://github.com/chromedp/examples/tree/master/download_file
.. _`@ZekeLu`: https://github.com/ZekeLu
.. _chromedp: https://github.com/chromedp/chromedp
.. _`bypass headless chrome detection issue`: https://github.com/chromedp/chromedp/issues/396

Documentation ¶

Overview ¶

Package ChromeDL uses chromedp to download the files. It may come handy when one needs to get a file from a protected website that doesn't allow regular methods, such as curl or http.Get().

It is heavily based on https://github.com/chromedp/examples/tree/master/download_file with minor modifications.

Index ¶

Constants
Variables
func Download(ctx context.Context, uri string, opts ...Option) (io.Reader, error)
func Get(url string) (*http.Response, error)
type Instance
- func New(options ...Option) (*Instance, error)
- func NewWithChromeCtx(taskCtx context.Context, options ...Option) (*Instance, error)
type Option
- func OptUserAgent(ua string) Option

Examples ¶

Download

Constants ¶

View Source

const DefaultUA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"

DefaultUA is the default user agent string that will be used by the browser instance. Can be changed

Variables ¶

View Source

var ErrNoChrome = errors.New("no chrome instance in the context")

ErrNoChrome indicates that there's no chrome instance in the context.

Functions ¶

func Download ¶ added in v0.1.1

func Download(ctx context.Context, uri string, opts ...Option) (io.Reader, error)

Download downloads a file from the provided uri using the chromedp capabilities. It will return the reader with the file contents (buffered), and an error if any. If the error is present, reader may not be nil if the file was downloaded and read successfully. It will store the file in the temporary directory once the download is complete, then buffer it and try to cleanup afterwards. Set the timeout on context if required, by default no timeout is set. Optionally one can pass the configuration options for the downloader.

Example ¶

const rbnzRates = "https://www.rbnz.govt.nz/-/media/ReserveBank/Files/Statistics/tables/b1/hb1-daily.xlsx?revision=5fa61401-a877-4607-b7ae-2e060c09935d"
r, err := Download(context.Background(), rbnzRates)
if err != nil {
	log.Fatal(err)
}
data, err := ioutil.ReadAll(r)
if err != nil {
	log.Fatal(err)
}
fmt.Printf("file size > 0: %v\n", len(data) > 0)
fmt.Printf("file signature: %s\n", string(data[0:2]))

Output:

file size > 0: true
file signature: PK

func Get ¶

func Get(url string) (*http.Response, error)

Get is drop-in replacement for http.Get.

Types ¶

type Instance ¶ added in v0.1.0

type Instance struct {
	// contains filtered or unexported fields
}

Instance is the browser instance that will be used for downloading files.

func New ¶ added in v0.1.0

func New(options ...Option) (*Instance, error)

New creates a new Instance, starting up the headless chrome to do the download. Once finished, call Stop to terminate the browser.

func NewWithChromeCtx ¶ added in v0.1.1

func NewWithChromeCtx(taskCtx context.Context, options ...Option) (*Instance, error)

NewWithChromeCtx creates new Instance for existing browser instance. Stop will not terminate the browser, but will cancel the event listener.

func (*Instance) Download ¶ added in v0.1.1

func (bi *Instance) Download(ctx context.Context, uri string) (io.Reader, error)

Download downloads the file returning the reader with contents.

func (*Instance) Get ¶ added in v0.1.0

func (bi *Instance) Get(url string) (*http.Response, error)

Get partly emulates http.Get to some extent and is meant to be drop-in replacement for http.Get in the callers code.

func (*Instance) Stop ¶ added in v0.1.0

func (bi *Instance) Stop() error

type Option ¶ added in v0.1.0

type Option func(*config)

func OptUserAgent ¶ added in v0.1.0

func OptUserAgent(ua string) Option

OptUserAgent allows setting the user agent for the browser.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL