goscholar

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2016 License: MIT Imports: 13 Imported by: 1

README

GoDoc Build Status Coverage Status license GitHub version

goscholar

Google Scholar scraper written in Go

Install

$ go get github.com/sotetsuk/goscholar

for command line:

$ go get github.com/sotetsuk/goscholar/cmd/goscholar
$ goscholar -h
Build

Also, you can use build command to build command line tool from the source code.

$ git clone git@github.com:sotetsuk/goscholar.git
$ goscholar/build

Options:

--dev: apply go fmt to all files and save dependencies

After build command executed, you will find corss-compiled binary files in bin directory.

Feature

  • API for Go
  • API for command line
  • search by keywords, title, and author
  • find by <cluster-id>
  • search the articles citing <cluster-id>
  • JSON output
  • recursive crawling is not implemented

Go API

Example
// create Query and generate URL
q := Query{Keywords:"nature 2015", Author:"y bengio", Title:"Deep learning"}
url := q.SearchUrl()

// fetch document sending the request to the URL
doc, err := Fetch(url)
if err != nil {
	log.Error(err)
	return
}

// parse articles
ch := make(chan *Article, 10)
go ParseDocument(ch, doc)
for a := range ch {
	fmt.Println("---")
	fmt.Println(a)
}

Command line API

Example
$ goscholar search --keywords "deep learning nature" --author "y bengio" --after 2015 --num 1 | jq .
[
  {
    "title": {
      "name": "Deep learning",
      "url": "http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html"
    },
    "year": "2015",
    "cluster_id": "5362332738201102290",
    "num_cite": "499",
    "num_ver": "7",
    "info_id": "0qfs6zbVakoJ",
    "link": {
      "name": "psu.edu",
      "url": "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.894&rep=rep1&type=pdf",
      "format": "PDF"
    },
    "bibtex": "@article{lecun2015deep, title={Deep learning}, author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey}, journal={Nature}, volume={521}, number={7553}, pages={436--444}, year={2015}, publisher={Nature Publishing Group}}",
    "author": [
      "LeCun, Yann",
      "Bengio, Yoshua",
      "Hinton, Geoffrey"
    ],
    "journal": "Nature",
    "booktitle": "",
    "volume": "521",
    "number": "7553",
    "pages": "436--444",
    "publisher": "Nature Publishing Group"
  }
]
$ goscholar find 15502119379559163003 | jq .
[
  {
    "title": {
      "name": "Deep learning via Hessian-free optimization",
      "url": "http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf"
    },
    "year": "2010",
    "cluster_id": "15502119379559163003",
    "num_cite": "269",
    "num_ver": "",
    "info_id": "e6RSJHGXItcJ",
    "link": {
      "name": "wustl.edu",
      "url": "http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf",
      "format": "PDF"
    },
    "bibtex": "@inproceedings{martens2010deep, title={Deep learning via Hessian-free optimization}, author={Martens, James}, booktitle={Proceedings of the 27th International Conference on Machine Learning (ICML-10)}, pages={735--742}, year={2010}}",
    "author": [
      "Martens, James"
    ],
    "journal": "",
    "booktitle": "Proceedings of the 27th International Conference on Machine Learning (ICML-10)",
    "volume": "",
    "number": "",
    "pages": "735--742",
    "publisher": ""
  }
]
$ goscholar cite 15502119379559163003 --num 1 | python -mjson.tool
[
  {
    "title": {
      "name": "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups",
      "url": "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6296526"
    },
    "year": "2012",
    "cluster_id": "3674494786452480182",
    "num_cite": "1559",
    "num_ver": "27",
    "info_id": "tmCGO4pt_jIJ",
    "link": {
      "name": "toronto.edu",
      "url": "http://www.cs.toronto.edu/~asamir/papers/SPM_DNN_12.pdf",
      "format": "PDF"
    },
    "bibtex": "@article{hinton2012deep, title={Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups}, author={Hinton, Geoffrey and Deng, Li and Yu, Dong and Dahl, George E and Mohamed, Abdel-rahman and Jaitly, Navdeep and Senior, Andrew and Vanhoucke, Vincent and Nguyen, Patrick and Sainath, Tara N and others}, journal={Signal Processing Magazine, IEEE}, volume={29}, number={6}, pages={82--97}, year={2012}, publisher={IEEE}}",
    "author": [
      "Hinton, Geoffrey",
      "Deng, Li",
      "Yu, Dong",
      "Dahl, George E",
      "Mohamed, Abdel-rahman",
      "Jaitly, Navdeep",
      "Senior, Andrew",
      "Vanhoucke, Vincent",
      "Nguyen, Patrick",
      "Sainath, Tara N",
      "others"
    ],
    "journal": "Signal Processing Magazine, IEEE",
    "booktitle": "",
    "volume": "29",
    "number": "6",
    "pages": "82--97",
    "publisher": "IEEE"
  }
]

(This article cites 15502119379559163003=Deep learning via Hessian-free optimization)

Usage
goscholar: Google Scholar crawler and scraper written in Go

Usage:
  goscholar search [--keywords=<keywords>] [--author=<author>] [--title=<title>]
                   [--after=<year>] [--before=<year>] [--num=<num>] [--start=<start>]
                   [--user-agent=<user-agent>]
  goscholar find <cluster-id> [--user-agent=<user-agent>]
  goscholar cite <cluster-id> [--after=<year>] [--before=<year>] [--num=<num>] [--start=<start>]
                              [--user-agent=<user-agent>]
  goscholar -h | --help
  goscholar --version

Query-options:
  <cluster-id>
  --keywords=<keywords>
  --author=<author>
  --title=<title>

Search-options:
  --after=<year>
  --before=<year>
  --num=<num>
  --start=<start>

Others:
  -h --help
  --version

Dependencies

goscholar is inspired by scholar.py

Contribute

Contritubing is more than welcome! See Issues for what is required.

License

MIT License

Documentation

Overview

Example
// create Query and generate URL
q := Query{Keywords: "nature 2015", Author: "y bengio", Title: "Deep learning"}
url := q.SearchUrl()

// fetch document sending the request to the URL
doc, err := Fetch(url)
if err != nil {
	log.Error(err)
	return
}

// parse articles
ch := make(chan *Article, 10)
go ParseDocument(ch, doc)
for a := range ch {
	fmt.Println("---")
	fmt.Println(a)
}
Output:

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	USER_AGENT = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"
)

Functions

func Fetch added in v0.1.0

func Fetch(url string) (doc *goquery.Document, err error)

Fetch gets a Document from a given URL. For usage, see the example of Overview.

func ParseDocument added in v0.1.0

func ParseDocument(ch chan *Article, doc *goquery.Document)

ParseDocument sends the pointers of parsed Articles to the given channel. The channel will be closed if there are no articles to be sent.

Types

type Article

type Article struct {
	Title     *Title   `json:"title"`
	Year      string   `json:"year"`
	ClusterId string   `json:"cluster_id"`
	NumCite   string   `json:"num_cite"`
	NumVer    string   `json:"num_ver"`
	InfoId    string   `json:"info_id"`
	Link      *Link    `json:"link"`
	BibTeX    string   `json:"bibtex"`
	Author    []string `json:"author"`
	Journal   string   `json:"journal"`
	Booktitle string   `json:"booktitle"`
	Volume    string   `json:"volume"`
	Number    string   `json:"number"`
	Pages     string   `json:"pages"`
	Publisher string   `json:"publisher"`
}

Article stores the parsed results from Google Scholar.

func ParseSelection added in v0.1.0

func ParseSelection(s *goquery.Selection) (a *Article, err error)

ParseSelection returns a parsed Article. If the Article is not valid (e.g., Author profile), it returns error.

func (*Article) Json added in v0.0.2

func (a *Article) Json() string

Json provides JSON formatted Article.

func (*Article) String added in v0.0.2

func (a *Article) String() string

String provides a pretty print.

type Link struct {
	Name   string `json:"name"`
	Url    string `json:"url"`
	Format string `json:"format"`
}

Link is an attribute of Article

type Query added in v0.1.0

type Query struct {
	Keywords  string
	Author    string
	Title     string
	ClusterId string
	InfoId    string
	After     string
	Before    string
	Num       string
	Start     string
}

Query issue an appropriate URL to which Fetch sends a request.

func (*Query) CitePopUpQueryUrl added in v0.1.0

func (q *Query) CitePopUpQueryUrl() (url string)

CitePopUpQueryUrl issues a URL used for getting BibTeX information. This only uses the InfoId. For example:

https://scholar.google.co.jp/scholar?q=info:XOJff8gPiHAJ:scholar.google.com/&output=cite&scirp=0&hl=en

func (*Query) CiteUrl added in v0.1.0

func (q *Query) CiteUrl() (url string)

CiteUrl uses ClusterId and issues the URL whose results include the articles citing the article of the ClusterId. This depends on ClusterId, After, Before, Num and Start. For example:

https://scholar.google.co.jp/scholar?hl=en&cites=5362332738201102290&as_ylo=2012&as_yhi=&num=40&start=20

func (*Query) FindUrl added in v0.1.0

func (q *Query) FindUrl() (url string)

FindUrl uses ClusterId which identify the desired article and spits out URL like:

https://scholar.google.co.jp/scholar?hl=en&cluster=5362332738201102290&num=1

FindUrl depends only on ClusterId

func (*Query) SearchUrl added in v0.1.0

func (q *Query) SearchUrl() (url string)

SearchUrl issues URL whose search query is composed of keywords, author and title. SearchUrl uses Keywords, Author, Title, After, Before, Num and Start Attributes. For example:

https://scholar.google.co.jp/scholar?hl=en&q=deep+learning+author:"y+bengio"&as_ylo=2015&as_yhi=&num=100&start=20

type Title added in v0.1.0

type Title struct {
	Name string `json:"name"`
	Url  string `json:"url"`
}

Title is an attribute of Article.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL