go-scrape

command module
v0.0.0-...-62f7b23 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 30, 2013 License: MIT Imports: 9 Imported by: 0

README

Go-scrape

A simple HTML scraper that can be used to scrape html attributes out of html and other things.

##How to use ./scrape (config) (url) Example: ./scrape configs/ahref.toml http://golang.org/doc/

Say you wanted to download all the URLs you just scraped, you'd cat the output into a file, lets call that file ouputfile, then you'd run something like wget -i outputfile to download all of them.

#####How the config files work The configs are written in toml. They're very simple and straight forward. It comes with some examples, here is an commented version of the imgsrc example included.

Tag = "img" #This is the name of the html tag, <img>, so you'll just put in img
AttributeName = "src" #This is the name of the attribute you want to scrape out of the tag you specified above
IsURL = true # This is done so that if you're scraping a URL it'll attempt to fix them up so you can download them

With IsURL: /images/image.gif gets replaced to http://example.com/images/image.gif But if you're dealing with say the width of images, you don't want something like http://exmaple.com/64 so you'd set it to false. #####Using go-deeper.py or go-deeper.sh (for the bellow, go-deeper.py should be the same as go-deeper.sh, go-deeper.py is deprecated) ./go-deeper.py (config) (file with urls) Example of usage in conjunction with scrape ./scrape configs/ahref.toml https://github.com/ | sort -u > test && ./go-deeper.py configs/imgsrc.toml test That command above will scrape content as per ahref.toml (it'll scrape the links) and then it'll remove any duplicates and then put it into a file called test. Then it'll run go-deeper.py and then for each line in test it'll run ./scrape again with the specified config (this time it'll get the images out of each of those links). the file test in this case is the file with all the urls that need to be scraped. This is useful for batch jobs as well.

Also go-deeper.py is tested in python 3, so it might not work on older versions of python.

###Tips & tricks

  • After running scrape, pipe it into sort -u to remove any duplicate lines.
  • After you've got your list of URL's to download, run wget -i download where download is the file with all the urls
  • If you want to download all the links to images on a site you can run something like ./scrape configs/ahref.toml (URL) | sort -u | grep -iE '.*?\.(jp?g|png|gif|bmp)' > download && wget -i download

##Dependancies If you want to build it on your machine, you'll need to install go, and these dependancies. They're easily installed with the following commands.

go get code.google.com/p/go.net/html
go get github.com/BurntSushi/toml

And to run the program it's as simple as go run scrape.go or if you want to make a binary, go build scrape.go and then run it like you would any other program.

##Todo

  • Add in some more tips to the tips section
  • Test for bugs

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL