diffbot

package
v0.0.0-...-9e5a85d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 8, 2023 License: MIT Imports: 7 Imported by: 0

README

Diffbot API Go client

This package implements a Diffbot client library.

Instasll

go get github.com/diffbot/diffbot-go-client

The Package documents at godoc.org or gowalker.org.

Please report bugs to chaishushan@gmail.com.

Thanks!

Documentation

Overview

Package diffbot implements a Diffbot client library.

Diffbot using AI, computer vision, machine learning and natural language processing, Diffbot provides developers numerous tools to understand and extract from any web page.

Generic API

The basic API is diffbot.DiffbotServer:

import (
	"github.com/diffbot/diffbot-go-client"
)

var (
	token = `0123456789abcdef0123456789abcdef` // invalid token, just a example
	url   = `http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/`
)

func main() {
	respBody, err := diffbot.DiffbotServer(diffbot.DefaultServer, "article", token, url, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(respBody))
}

The diffbot.Diffbot API use the diffbot.DefaultServer as the server.

Article API

Tha Article API use the diffbot.Diffbot to invoke the "article" method, and convert the reponse body to diffbot.Article struct.

func main() {
	article, err := diffbot.ParseArticle(token, url, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(article)
}

Frontpage API

Tha Frontpage API use the diffbot.Diffbot to invoke the "frontpage" method, and convert the reponse body to diffbot.Frontpage struct.

func main() {
	page, err := diffbot.ParseFrontpage(token, url, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(page)
}

Image API

Tha Image API use the diffbot.Diffbot to invoke the "image" method, and convert the reponse body to diffbot.Image struct.

func main() {
	imgInfo, err := diffbot.ParseImage(token, url, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(imgInfo)
}

Product API

Tha Product API use the diffbot.Diffbot to invoke the "product" method, and convert the reponse body to diffbot.Product struct.

func main() {
	product, err := diffbot.ParseProduct(token, url, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(product)
}

Classification API

Tha Classification API use the diffbot.Diffbot to invoke the "analyze" method, and convert the reponse body to diffbot.Classification struct.

func main() {
	info, err := diffbot.ParseClassification(token, url, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(info)
}

Options

We use `diffbot.Options` to specify the options:

func main() {
	opt := &diffbot.Options{
		Fields:   "meta,querystring,images(*)",
		Timeout:  time.Millisecond * 1000,
		Callback: "",
	}
	fmt.Println(opt.MethodParamString("article"))
	// Output:
	// &fields=meta,querystring,images(*)&timeout=1000
}

You can call Diffbot with custom headers:

func main() {
	opt := diffbot.Options{}
	opt.CustomHeader.Add("X-Forward-Cookie", "abc=123")
	respBody, err := diffbot.Diffbot(token, url, opt)
	...
}

Error handling

If diffbot server return error message, it will be converted to the `diffbot.Error`:

func main() {
	respBody, err := diffbot.Diffbot(token, url, nil)
	if err := nil {
		if apiErr, ok := err.(*diffbot.ApiError); ok {
			// ApiError, e.g. {"error":"Not authorized API token.","errorCode":401}
		}
		log.Fatal(err)
	}
	fmt.Println(string(respBody))
}

Other

Diffbot API Document at http://diffbot.com/dev/docs/ or http://diffbot.com/products/.

Please report bugs to <chaishushan@gmail.com>.

Index

Constants

View Source
const (
	DefaultServer = `http://api.diffbot.com/v2`
)

Variables

This section is empty.

Functions

func Diffbot

func Diffbot(method, token, url string, opt *Options) (body []byte, err error)

Diffbot uses computer vision, natural language processing and machine learning to automatically recognize and structure specific page-types.

func DiffbotServer

func DiffbotServer(server, method, token, url string, opt *Options) (body []byte, err error)

DiffbotServer like Diffbot function, but support custom server.

Types

type Article

type Article struct {
	Url           string                 `json:"url"`
	ResolvedUrl   string                 `json:"resolved_url"`
	Icon          string                 `json:"icon"`
	Meta          map[string]interface{} `json:"meta,omitempty"`        // Returned with fields.
	QueryString   string                 `json:"querystring,omitempty"` // Returned with fields.
	Links         []string               `json:"links,omitempty"`       // Returned with fields.
	Type          string                 `json:"type"`
	Title         string                 `json:"title"`
	Text          string                 `json:"text"`
	Html          string                 `json:"html"`
	NumPages      string                 `json:"numPages"`
	Date          string                 `json:"date"`
	Author        string                 `json:"author"`
	Tags          []string               `json:"tags,omitempty"`          // Returned with fields.
	HumanLanguage string                 `json:"humanLanguage,omitempty"` // Returned with fields.
	Images        []struct {
		Url         string `json:"url"`
		PixelHeight int    `json:"pixelHeight"`
		PixelWidth  int    `json:"pixelWidth"`
		Caption     string `json:"caption"`
		Primary     string `json:"primary"`
	} `json:"images"`
	Videos []struct {
		Url         string `json:"url"`
		PixelHeight int    `json:"pixelHeight"`
		PixelWidth  int    `json:"pixelWidth"`
		Primary     string `json:"primary"`
	} `json:"videos"`
}

Article represents an clean article text.

See http://diffbot.com/dev/docs/article/

func ParseArticle

func ParseArticle(token, url string, opt *Options) (*Article, error)

ParseArticle parse the clean article text from news article web pages.

Request

To use the Article API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v2/article

Provide the following arguments:

+----------+-----------------------------------------------------------------+
| ARGUMENT | DESCRIPTION                                                     |
+----------+-----------------------------------------------------------------+
| token    | Developer token                                                 |
| url      | Article URL to process (URL encoded).                           |
|          | If you wish to POST content, please see POSTing Content, below. |
+----------+-----------------------------------------------------------------+
| Optional arguments                                                         |
+----------+-----------------------------------------------------------------+
| fields   | Used to control which fields are returned by the API.           |
|          | See the Response section below.                                 |
| timeout  | Set a value in milliseconds to terminate the response.          |
|          | By default the Article API has a five second timeout.           |
| callback | Use for jsonp requests. Needed for cross-domain ajax.           |
+----------+-----------------------------------------------------------------+

Response

The Article API returns information about the primary article content on the submitted page.

Use the fields query parameter to limit or expand which fields are returned in the JSON response. For nested arrays, use parentheses to retrieve specific fields, or * to return all sub-fields.

http://api.diffbot.com/v2/article?...&fields=meta,querystring,images(*)

Example Response

This is a simple response:

{
  "type": "article",
  "icon": "http://www.diffbot.com/favicon.ico",
  "title": "Diffbot's New Product API Teaches Robots to Shop Online",
  "author": "John Davi",
  "date": "Wed, 31 Jul 2013 08:00:00 GMT",
  "videos": [
    {
      "primary": "true",
      "url": "http://www.youtube.com/embed/lfcri5ungRo?feature=oembed",
    }
  ],
  "tags": [
    "e-commerce",
    "SaaS"
  ]
  "url": "http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/",
  "humanLanguage": "en",
  "text": "Diffbot's human wranglers are proud today to announce the release of our newest product..."
}

Authentication and Custom Headers

You can supply Diffbot with custom headers, or basic authentication credentials, in order to access intranet pages or other sites that require a login.

Basic Authentication
To access pages that require a login/password (using basic access authentication),
include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com.

Custom Headers
You can supply the Article API with custom values for the user-agent, referer,
or cookie values in the HTTP request. These will be used in place of the Diffbot default values.

To provide custom headers, pass in the following values in your own headers when calling the Diffbot API:

+----------------------+-----------------------------------------------------------------------+
| HEADER               | DESCRIPTION                                                           |
+----------------------+-----------------------------------------------------------------------+
| X-Forward-User-Agent | Will be used as Diffbot's User-Agent header when making your request. |
| X-Forward-Referer    | Will be used as Diffbot's Referer header when making your request.    |
| X-Forward-Cookie     | Will be used as Diffbot's Cookie header when making your request.     |
+----------------------+-----------------------------------------------------------------------+

Posting Content

If your content is not publicly available (e.g., behind a firewall), you can POST markup directly to the Article API endpoint for analysis:

http://api.diffbot.com/v2/article?token=...&url=...

Please note that the url parameter is still required in the endpoint, and will be used to resolve any relative links contained in the markup.

Provide markup to analyze as your POST body, and specify the Content-Type header as text/html.

The following call submits a sample to the API:

curl
    -H "Content-Type:text/html"
    -d 'Now is the time for all good robots to come to the aid of their-- oh never mind, run!'
    http://api.diffbot.com/v2/article?token=...&url=http://www.diffbot.com/products/automatic/article

See http://diffbot.com/dev/docs/article/.

func (*Article) String

func (p *Article) String() string

type Classification

type Classification struct {
	Type          string `json:"type"`
	Title         string `json:"title"`
	Url           string `json:"url"`
	ResolvedUrl   string `json:"resolved_url,omitempty"` // Returned with fields.
	HumanLanguage string `json:"humanLanguage"`
	Stats         struct {
		Types struct {
			Article     float64 `json:"article"`
			Audio       float64 `json:"audio"`
			Chart       float64 `json:"chart"`
			Discussion  float64 `json:"discussion"`
			Document    float64 `json:"document"`
			Download    float64 `json:"download"`
			Error       float64 `json:"error"`
			Event       float64 `json:"event"`
			Faq         float64 `json:"faq"`
			Frontpage   float64 `json:"frontpage"`
			Game        float64 `json:"game"`
			Image       float64 `json:"image"`
			Job         float64 `json:"job"`
			Location    float64 `json:"location"`
			Other       float64 `json:"other"`
			Product     float64 `json:"product"`
			Profile     float64 `json:"profile"`
			Recipe      float64 `json:"recipe"`
			ReviewsList float64 `json:"reviewslist"`
			Serp        float64 `json:"serp"`
			Video       float64 `json:"video"`
		} `json:"types"`
	} `json:"stats"`
}

Article represents an clean article text.

See http://diffbot.com/dev/docs/analyze/

func ParseClassification

func ParseClassification(token, url string, opt *Options) (*Classification, error)

ParseClassification analyzes a web page's layout, structure, markup, text and other components and classifies the page as a particular "type." It also fully extracts the page contents if the page matches an existing Diffbot extraction API.

Please note: The Page Classifier API is currently in beta.

Request

To use the Classifier API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v2/analyze?token=...&url=...

Provide the following arguments:

+----------+----------------------------------------------------------------------------------------------+
| ARGUMENT | DESCRIPTION                                                                                  |
+----------+----------------------------------------------------------------------------------------------+
| token    | Developer token                                                                              |
| url      | URL to classify (URLEncoded)                                                                 |
+----------+----------------------------------------------------------------------------------------------+
| Optional arguments                                                                                      |
+----------+----------------------------------------------------------------------------------------------+
| mode     | By default the Page Classifier API will fully extract                                        |
|          | pages that match an existing Diffbot Automatic API.                                          |
|          | Set mode to a specific page-type (e.g., mode=article)                                        |
|          | to extract content only from that particular page-type.                                      |
|          | All others will simply return the page classification information.                           |
| fields   | You can choose the fields to be returned                                                     |
|          | by the Diffbot extraction API by supplying a comma-separated                                 |
|          | list of fields, e.g.:                                                                        |
|          | http://api.diffbot.com/v2/analyze?token=...&url=http://diffbot.com/company&fields=meta,tags. |
| stats    | Returns statistics on page classification and extraction,                                    |
|          | including an array of individual page-types and                                              |
|          | the Diffbot-determined score (likelihood) for each type.                                     |
+----------+----------------------------------------------------------------------------------------------+
| Basic authentication                                                                                    |
+---------------------------------------------------------------------------------------------------------+
| To access pages that require a login/password                                                           |
| (using basic access authentication), include the username and password                                  |
| in your url parameter,                                                                                  |
| e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com                                                |
+---------------------------------------------------------------------------------------------------------+

Response

The Classifier API returns, depending on parameters, the following:

+----------------+------------------------------------------------------------------+
| FIELD          | DESCRIPTION                                                      |
+----------------+------------------------------------------------------------------+
| type           | Page-type of the submitted URL (from the below enumerated list). |
|                | Always returned.                                                 |
| title          | Page title. Returned by default.                                 |
| url            | Submitted URL. Always returned.                                  |
| resolved_url   | Returned if the resolving URL is different from                  |
|                | the submitted URL (e.g., link shortening services).              |
|                | Returned by default, configurable with fields                    |
| human_language | Returns the (spoken/human) language of the submitted URL,        |
|                | using two-letter ISO 639-1 nomenclature.                         |
|                | Returned by default.                                             |
+----------------+------------------------------------------------------------------+

Example Response

This is a simple response:

{
  "type": "article"
  "stats": {
     "types": {
        "article": 0.46,
        "audio": 0.15,
        "chart": 0.01,
        "discussion": 0.03,
        "document": 0.04,
        "download": 0.01,
        "error": 0.00,
        "event": 0.00,
        "faq": 0.02,
        "frontpage": 0.12,
        "game": 0.01,
        "image": 0.02,
        "job": 0.02,
        "location": 0.08,
        "product": 0.09,
        "profile": 0.09,
        "recipe": 0.08,
        "reviewslist": 0.09,
        "serp": 0.06,
        "video": 0.01
      }
    },
  "resolved_url": "http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/",
  "url": "http://tcrn.ch/Jw7ZKw",
  "human_language": "en"
}

Page Types

Diffbot currently classifies pages into the following types. Please note this list will evolve over time to include additional page types.

+-------------+-----------------------------------------------------------------------------------+
| PAGE TYPE   | DESCRIPTION                                                                       |
+-------------+-----------------------------------------------------------------------------------+
| None        | Returned if Diffbot confidence in the page classification is low.                 |
|             | Use of the stats field will give you the individual scores for each page-type.    |
| article     | A news article, blog post or other primarily-text page.                           |
| audio       | A music or audio player.                                                          |
| chart       | A graph or chart, typically financial.                                            |
| discussion  | Specific forum, group or discussion topic.                                        |
| document    | An embedded or downloadable document or slideshow.                                |
| download    | A downloadable file.                                                              |
| error       | Error page, e.g. 404.                                                             |
| event       | A page detailing specific event information,                                      |
|             | e.g. time/date/location.                                                          |
| faq         | A page of multiple frequently asked questions, or a single FAQ entry.             |
| frontpage   | A news- or blog-style home page, with links to myriad sections and items.         |
| game        | A playable game.                                                                  |
| image       | An image or photo page.                                                           |
| job         | A job posting.                                                                    |
| location    | A page detailing location information, typically including an address and/or map. |
| other       | Returned if the result is below a certain confidence threshold.                   |
| product     | A product page, typically of a product for purchase.                              |
| profile     | A person or user profile page.                                                    |
| recipe      | Page detailing recipe instructions and ingredients.                               |
| reviewslist | A list of user reviews.                                                           |
| serp        | A Search Engine Results Page                                                      |
| video       | An individual video.                                                              |
+-------------+-----------------------------------------------------------------------------------+

func (*Classification) String

func (p *Classification) String() string

type Error

type Error struct {
	ErrCode    int    `json:"errorCode"` // Description of the error
	ErrMessage string `json:"error"`     // Error code per the chart below
	RawString  string `json:"-"`         // Raw json format error string
}

Error represents an Diffbot APIs returns error.

When issues arise, Diffbot APIs return the following fields in a JSON response.

Simple Error code:

{
	"error": "Could not download page (404)",
	"errorCode": 404
}

Possible errors returned:

+------+-----------------------------------------------------------------------------------------------------+
| CODE | DESCRIPTION                                                                                         |
+------+-----------------------------------------------------------------------------------------------------+
| 401  | Unauthorized token                                                                                  |
| 404  | Requested page not found                                                                            |
| 429  | Your token has exceeded the allowed number of calls, or has otherwise been throttled for API abuse. |
| 500  | Error processing the page. Specific information will be returned in the JSON response.              |
+------+-----------------------------------------------------------------------------------------------------+

func (*Error) Error

func (p *Error) Error() string

func (*Error) ParseJson

func (p *Error) ParseJson(s string) error

ParseJson parses the JSON-encoded error data.

type Frontpage

type Frontpage struct {
	Id        int64  `json:"id,string"`
	Title     string `json:"title"`
	SourceURL string `json:"sourceURL"`
	Icon      string `json:"icon"`
	NumItems  int    `json:"numItems"`
	Items     []struct {
		Id          int     `json:"id"`
		Title       string  `json:"title"`
		Description string  `json:"description"`
		XRoot       string  `json:"xroot"`
		PubDate     string  `json:"pubDate"`
		Link        string  `json:"link"`
		Type        string  `json:"type"` // STORY/LINK/...
		Img         string  `json:"img"`
		TextSummary string  `json:"textSummary"`
		Sp          float64 `json:"sp"`
		Sr          float64 `json:"sr"`
		Fresh       float64 `json:"fresh"`
	} `json:"items,omitempty"`
}

Frontpage represents a frontpage information.

See http://diffbot.com/dev/docs/frontpage/

func ParseFrontpage

func ParseFrontpage(token, url string, opt *Options) (*Frontpage, error)

ParseFrontpage parse a multifaceted "homepage" and returns individual page elements.

Request

To use the Frontpage API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v2/frontpage

Provide the following arguments:

+----------+----------------------------------------------------------------------------------------------------------+
| ARGUMENT | DESCRIPTION                                                                                              |
+----------+----------------------------------------------------------------------------------------------------------+
| token    | Developer token                                                                                          |
| url      | Frontpage URL from which to extract items                                                                |
+----------+----------------------------------------------------------------------------------------------------------+
| Optional arguments                                                                                                  |
+----------+----------------------------------------------------------------------------------------------------------+
| timeout  | Specify a value in milliseconds (e.g., &timeout=15000) to override the default API timeout of 5000ms.    |
| format   | Format the response output in xml (default) or json                                                      |
| all      | Returns all content from page, including navigation and similar links that the Diffbot visual processing |
|          | engine considers less important / non-core.                                                              |
+----------+----------------------------------------------------------------------------------------------------------+
| Basic authentication                                                                                                |
+---------------------------------------------------------------------------------------------------------------------+
| To access pages that require a login/password (using basic access authentication),                                  |
| include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com   |
+---------------------------------------------------------------------------------------------------------------------+

Alternatively, you can POST the content to analyze directly to the same endpoint. Specify the Content-Type header as either text/plain or text/html.

Response

DML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.

+-------------+--------+------------------------------------------------------+
| INFO FIELD  | TYPE   | DESCRIPTION                                          |
+-------------+--------+------------------------------------------------------+
| id          | long   | DMLID of the URL                                     |
| title       | string | Extracted title of the page                          |
| sourceURL   | url    | the URL this was extracted from                      |
| icon        | url    | A link to a small icon/favicon representing the page |
| numItems    | int    | The number of items in this DML document             |
+-------------+--------+------------------------------------------------------+

Some of the fields found in Items:

+-------------+--------------------------+--------------------------------------------------------------------+
| ITEM FIELD  | TYPE                     | DESCRIPTION                                                        |
+-------------+--------------------------+--------------------------------------------------------------------+
| id          | long                     | Unique hashcode/id of item                                         |
| title       | string                   | Title of item                                                      |
| description | string                   | innerHTML content of item                                          |
| xroot       | xpath                    | XPATH of where item was found on the page                          |
| pubDate     | timestamp                | Timestamp when item was detected on page                           |
| link        | URL                      | Extracted permalink (if applicable) of item                        |
| type        | {IMAGE,LINK,STORY,CHUNK} | Extracted type of the item, whether the item represents an image,  |
|             |                          | permalink, story (image+summary), or html chunk.                   |
| img         | URL                      | Extracted image from item                                          |
| textSummary | string                   | A plain-text summary of the item                                   |
| sp          | double<-[0,1]            | Spam score - the probability that the item is spam/ad              |
| sr          | double<-[1,5]            | Static rank - the quality score of the item on a 1 to 5 scale      |
| fresh       | double<-[0,1]            | Fresh score - the percentage of the item that has changed          |
|             |                          | compared to the previous crawl                                     |
+-------------+--------------------------+--------------------------------------------------------------------+

See http://diffbot.com/dev/docs/frontpage/.

func (*Frontpage) ParseDML

func (p *Frontpage) ParseDML(dml *FrontpageDML) error

func (*Frontpage) String

func (p *Frontpage) String() string

type FrontpageDML

type FrontpageDML struct {
	Id         int64  `json:"id,string"`
	TagName    string `json:"tagName"` // dml
	ChildNodes []struct {
		TagName          string `json:"tagName"`             // info/item/...
		ItemId           int64  `json:"id,string"`           // item.id, eg. "180194704"
		ItemSp           string `json:"sp"`                  // item.sp, eg. "0.000"
		ItemFresh        string `json:"fresh"`               // item.fresh, eg. "1.000"
		ItemSr           string `json:"sr"`                  // item.sr, eg. "4.000"
		ItemCluster      string `json:"cluster"`             // item.cluster, eg. "/HTML[1]/BODY[1]/DIV[4]/..."
		ItemCommentCount int64  `json:"commentCount,string"` // item.commentCount, eg. "34"
		ItemType         string `json:"type"`                // item.type, eg. "STORY"
		ItemXRoot        string `json:"xroot"`               // item.xroot, eg. "/HTML[1]/BODY[1]/DIV[4]/..."
		ChildNodes       []struct {
			TagName    string   `json:"tagName"` // title/sourceType/...
			ChildNodes []string `json:"childNodes"`
		} `json:"childNodes"`
	} `json:"childNodes"`
}

FrontpageDML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.

See http://diffbot.com/products/automatic/frontpage/

func (*FrontpageDML) ParseJson

func (p *FrontpageDML) ParseJson(data []byte) error

func (*FrontpageDML) String

func (p *FrontpageDML) String() string

type Image

type Image struct {
	Title       string                 `json:"title"`
	NextPage    string                 `json:"nextPage"`
	AlbumUrl    string                 `json:"albumUrl"`
	Url         string                 `json:"url"`
	ResolvedUrl string                 `json:"resolved_url"`
	Meta        map[string]interface{} `json:"meta,omitempty"`        // Returned with fields.
	QueryString string                 `json:"querystring,omitempty"` // Returned with fields.
	Links       []string               `json:"links,omitempty"`       // Returned with fields.
	Images      []struct {
		Url           string   `json:"url"`
		AnchorUrl     string   `json:"anchorUrl"`
		Mime          string   `json:"mime,omitempty"` // Returned with fields.
		Caption       string   `json:"caption"`
		AttrAlt       string   `json:"attrAlt,omitempty"`   // Returned with fields.
		AttrTitle     string   `json:"attrTitle,omitempty"` // Returned with fields.
		Date          string   `json:"date"`
		Size          int      `json:"size"`
		PixelHeight   int      `json:"pixelHeight"`
		PixelWidth    int      `json:"pixelWidth"`
		DisplayHeight int      `json:"displayHeight,omitempty"` // Returned with fields.
		DisplayWidth  int      `json:"displayWidth",omitempty`  // Returned with fields.
		Meta          []string `json:"meta"`
		Faces         []string `json:"faces,omitempty"`  // Returned with fields.
		Ocr           string   `json:"ocr,omitempty"`    // Returned with fields.
		Colors        string   `json:"colors,omitempty"` // Returned with fields.
		XPath         string   `json:"xpath"`
	} `json:"images"`
}

Image represents a page image information.

See http://diffbot.com/dev/docs/image/

func ParseImage

func ParseImage(token, url string, opt *Options) (*Image, error)

ParseImage parse a web page and returns its primary image(s).

Request

To use the Product API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v2/product

Provide the following arguments:

+----------+-------------------------------------------------------------------------+
| ARGUMENT | DESCRIPTION                                                             |
+----------+-------------------------------------------------------------------------+
| token    | Developer token                                                         |
| url      | Product URL to process (URL encoded)                                    |
+----------+-------------------------------------------------------------------------+
| Optional arguments                                                                 |
+----------+-------------------------------------------------------------------------+
| fields   | Used to control which fields are returned by the API.                   |
|          | See the Response section below.                                         |
| timeout  | Set a value in milliseconds to terminate the response.                  |
|          | By default the Product API has no timeout.                              |
| callback | Use for jsonp requests. Needed for cross-domain ajax.                   |
+----------+-------------------------------------------------------------------------+
| Basic authentication                                                               |
+------------------------------------------------------------------------------------+
| To access pages that require a login/password (using basic access authentication), |
| include the username and password in your url parameter,                           |
| e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com                           |
+------------------------------------------------------------------------------------+

Response

The Image API returns basic info about the page submitted, and its primary image(s) in the images array.

Use the fields query parameter to limit or expand which fields are returned in the JSON response. To control the fields returned for images, your desired fields should be contained within the 'images' parentheses:

http://api.diffbot.com/v2/image...&fields=images(mime,pixelWidth)

Response fields:

+---------------+------------------------------------------------------------------------+
| FIELD         | DESCRIPTION                                                            |
+---------------+------------------------------------------------------------------------+
| *             | Returns all fields available.                                          |
| title         | Title of the submitted page. Returned by default.                      |
| nextPage      | Link to next page (if within a gallery or paginated list of images).   |
|               | Returned by default.                                                   |
| albumUrl      | Link to containing album (if image is within an album).                |
|               | Returned by default.                                                   |
| url           | URL submitted. Returned by default.                                    |
| resolved_url  | Returned if the resolving URL is different from the submitted URL      |
|               | (e.g., link shortening services).                                      |
| meta          | Returns the full contents of page meta tags,                           |
|               | including sub-arrays for OpenGraph tags, Twitter Card metadata,        |
|               | schema.org microdata, and -- if available -- oEmbed metadata.          |
|               | Returned with fields.                                                  |
| querystring   | Returns the key/value pairs of the URL querystring, if present.        |
|               | Items without a value will be returned as "true".                      |
|               | Returned with fields.                                                  |
| links         | Returns all links (anchor tag href values) found on the page.          |
|               | Returned with fields.                                                  |
| images        | An array of image(s) contained on the page.                            |
+---------------+------------------------------------------------------------------------+
| For each item in the images array:                                                     |
+---------------+------------------------------------------------------------------------+
| url           | Direct link to image file. Returned by default.                        |
| anchorUrl     | If the image is wrapped by an anchor a tag, the anchor location        |
|               | as defined by the href attribute. Returned by default.                 |
| mime          | MIME type, if available, as specified by "Content-Type" of the image.  |
|               | Returned with fields.                                                  |
| caption       | The best caption for this image. Returned by default.                  |
| attrAlt       | Contents of the alt attribute, if available within the HTML IMG tag.   |
|               | Returned with fields.                                                  |
| attrTitle     | Contents of the title attribute, if available within the HTML IMG tag. |
|               | Returned with fields.                                                  |
| date          | Date of image upload or creation if available in page metadata.        |
|               | Returned by default.                                                   |
| size          | Size in bytes of image file. Returned by default.                      |
| pixelHeight   | Actual height, in pixels, of image file. Returned by default.          |
| pixelWidth    | Actual width, in pixels, of image file. Returned by default.           |
| displayHeight | Height of image as rendered on page, if different from actual          |
|               | (pixel) height. Returned with fields.                                  |
| displayWidth  | Width of image as rendered on page, if different from actual           |
|               | (pixel) width. Returned with fields.                                   |
| meta          | Comma-separated list of image-embedded metadata                        |
|               | (e.g., EXIF, XMP, ICC Profile), if available within the image file.    |
|               | Returned with fields.                                                  |
| faces         | The x, y, height, width of coordinates of human faces.                 |
|               | Null, if no faces were found. Returned with fields.                    |
| ocr           | If text is identified within the image, we will attempt to recognize   |
|               | the text string. Returned with fields.                                 |
| colors        | Returns an array of hex values of the dominant colors                  |
|               | within the image. Returned with fields.                                |
| xpath         | XPath expression identifying the node containing the image.            |
|               | Returned by default.                                                   |
+---------------+------------------------------------------------------------------------+

Example Response

This is a simple response:

{
  "title": "The National Flower - Rose",
  "type": "image",
  "url": "http://www.statesymbolsusa.org/National_Symbols/National_flower.html",
  "images": [
    {
      "attrAlt": "Red rose in full bloom - click to see state flowers",
      "height": 371,
      "width": 300,
      "displayWidth": 300,
      "meta": [
          "[Jpeg] Compression Type - Baseline",
          "[Jpeg] Data Precision - 8 bits",
          "[Jpeg] Image Height - 371 pixels",
          "[Jpeg] Image Width - 300 pixels",
          "[Jpeg] Number of Components - 3",
          "[Jpeg] Component 1 - Y component: Quantization table 0, Sampling factors 2 horiz/2 vert",
          "[Jpeg] Component 2 - Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert",
          "[Jpeg] Component 3 - Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert",
          "[Jfif] Version - 1.2",
          "[Jfif] Resolution Units - none",
          "[Jfif] X Resolution - 100 dots",
          "[Jfif] Y Resolution - 100 dots",
          "[Adobe Jpeg] DCT Encode Version - 1",
          "[Adobe Jpeg] Flags 0 - 192",
          "[Adobe Jpeg] Flags 1 - 0",
          "[Adobe Jpeg] Color Transform - YCbCr"
          ],
      "url": "http://www.statesymbolsusa.org/IMAGES/rose_usda-web.jpg",
      "size": 12328,
      "displayHeight": 371,
      "xpath": "/HTML[1]/BODY[1]/DIV[1]/TABLE[3]/TBODY[1]/TR[2]/..."
    },
    {
      "attrAlt": "Yellow rose - click to see state flowers",
      "pixelHeight": 304,
      "pixelWidth": 380,
      "displayWidth": 380,
      "meta": [
          "[Jpeg] Compression Type - Baseline",
          "[Jpeg] Data Precision - 8 bits",
          "[Jpeg] Image Height - 304 pixels",
          "[Jpeg] Image Width - 380 pixels",
          "[Jpeg] Number of Components - 3",
          "[Jpeg] Component 1 - Y component: Quantization table 0, Sampling factors 2 horiz/2 vert",
          "[Jpeg] Component 2 - Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert",
          "[Jpeg] Component 3 - Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert",
          "[Jfif] Version - 1.2",
          "[Jfif] Resolution Units - none",
          "[Jfif] X Resolution - 100 dots",
          "[Jfif] Y Resolution - 100 dots",
          "[Adobe Jpeg] DCT Encode Version - 1",
          "[Adobe Jpeg] Flags 0 - 192",
          "[Adobe Jpeg] Flags 1 - 0",
          "[Adobe Jpeg] Color Transform - YCbCr"
          ],
      "url": "http://www.statesymbolsusa.org/IMAGES/rose_yellow-380.jpg",
      "size": 12142,
      "displayHeight": 304,
      "xpath": "/HTML[1]/BODY[1]/DIV[1]/TABLE[3]/TBODY[1]/TR[2]/..."
    }
  ]
}

func (*Image) String

func (p *Image) String() string

type Options

type Options struct {
	Fields                 string
	Timeout                time.Duration
	Callback               string
	FrontpageAll           string
	ClassifierMode         string
	ClassifierStats        string
	BulkNotifyEmail        string
	BulkNotifyWebHook      string
	BulkRepeat             string
	BulkMaxRounds          string
	BulkPageProcessPattern string
	CrawlMaxToCrawl        string
	CrawlMaxToProcess      string
	CrawlRestrictDomain    string
	CrawlNotifyEmail       string
	CrawlNotifyWebHook     string
	CrawlDelay             string
	CrawlRepeat            string
	CrawlOnlyProcessIfNew  string
	CrawlMaxRounds         string
	BatchMethod            string
	BatchRelativeUrl       string
	CustomHeader           http.Header
}

Options holds the optional parameters for Diffbot client.

See http://diffbot.com/products/automatic/

func (*Options) MethodParamString

func (p *Options) MethodParamString(method string) string

MethodParamString return string as the url params.

If the Options is not empty, the return string begin with a '&'.

type Product

type Product struct {
	Url         string                 `json:"url"`
	ResolvedUrl string                 `json:"resolved_url"`
	Meta        map[string]interface{} `json:"meta,omitempty"`        // Returned with fields.
	QueryString string                 `json:"querystring,omitempty"` // Returned with fields.
	Links       []string               `json:"links,omitempty"`       // Returned with fields.
	Breadcrumb  []string               `json:"breadcrumb"`
	Products    []struct {
		Title       string `json:"title"`
		Description string `json:"description"`
		Brand       string `json:"brand,omitempty"` // Returned with fields.
		Medias      []struct {
			Type    string `json:"type"`
			Link    string `json:"link"`
			Height  int    `json:"height"`
			Width   int    `json:"width"`
			Caption string `json:"caption"`
			Primary string `json:"primary"`
			XPath   string `json:"xpath"`
		} `json:"media"`
		OfferPrice     string `json:"offerPrice"`
		RegularPrice   string `json:"regularPrice"`
		SaveAmount     string `json:"saveAmount"`
		ShippingAmount string `json:"shippingAmount"`
		ProductId      string `json:"productId"`
		Upc            string `json:"upc"`
		PrefixCode     string `json:"prefixCode"`
		ProductOrigin  string `json:"productOrigin"`
		Isbn           string `json:"isbn"`
		Sku            string `json:"sku,omitempty"` // Returned with fields.
		Mpn            string `json:"mpn,omitempty"` // Returned with fields.
	} `json:"products"`
}

Product represents a shopping or e-commerce product information.

See http://diffbot.com/dev/docs/product/

func ParseProduct

func ParseProduct(token, url string, opt *Options) (*Product, error)

ParseProduct parse a shopping or e-commerce product page and returns information on the product.

Request

To use the Product API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v2/product

Provide the following arguments:

+----------+-------------------------------------------------------------------------+
| ARGUMENT | DESCRIPTION                                                             |
+----------+-------------------------------------------------------------------------+
| token    | Developer token                                                         |
| url      | Product URL to process (URL encoded)                                    |
+----------+-------------------------------------------------------------------------+
| Optional arguments                                                                 |
+----------+-------------------------------------------------------------------------+
| fields   | Used to control which fields are returned by the API.                   |
|          | See the Response section below.                                         |
| timeout  | Set a value in milliseconds to terminate the response.                  |
|          | By default the Product API has no timeout.                              |
| callback | Use for jsonp requests. Needed for cross-domain ajax.                   |
+----------+-------------------------------------------------------------------------+
| Basic authentication                                                               |
| To access pages that require a login/password (using basic access authentication), |
| include the username and password in your url parameter,                           |
| e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com                           |
+------------------------------------------------------------------------------------+

Response

The Product API returns product details in the products array. Currently extracted data will only be returned from a single product. In the future the API will return information from multiple products, if multiple items are available on the same page.

Use the fields query parameter to limit or expand which fields are returned in the JSON response. For product-specific content your desired fields should be contained within the 'products' parentheses:

http://api.diffbot.com/v2/product...&fields=products(offerPrice,sku)

Response fields:

+----------------+-------------------------------------------------------------------+
| FIELD          | DESCRIPTION                                                       |
+----------------+-------------------------------------------------------------------+
| *              | Returns all fields available.                                     |
| url            | URL submitted. Returned by default.                               |
| resolved_url   | Returned if the resolving URL is different from the               |
|                | submitted URL (e.g., link shortening services).                   |
|                | Returned by default.                                              |
| meta           | Returns the full contents of page meta tags,                      |
|                | including sub-arrays for OpenGraph tags,                          |
|                | Twitter Card metadata, schema.org microdata,                      |
|                | and -- if available -- oEmbed metadata.                           |
|                | Returned with fields.                                             |
| querystring    | Returns the key/value pairs of the URL querystring, if present.   |
|                | Items without a value will be returned as "true."                 |
|                | Returned with fields.                                             |
| links          | Returns all links (anchor tag href values) found on the page.     |
|                | Returned with fields.                                             |
| breadcrumb     | If available, an array of link URLs and link text                 |
|                | from page breadcrumbs. Returned by default.                       |
+----------------+-------------------------------------------------------------------+
| For each item in the products array:                                               |
+----------------+-------------------------------------------------------------------+
| title          | Name of the product. Returned by default.                         |
| description    | Description, if available, of the product.                        |
|                | Returned by default.                                              |
| brand          | Experimental Brand, if available, of the product.                 |
|                | Returned with fields.                                             |
| media          | Array of media items (images or videos) of the product.           |
|  |             | Returned by default.                                              |
|  +- type       | Type of media identified (image or video).                        |
|  +- link       | Direct (fully resolved) link to image or video content.           |
|  +- height     | Image height, in pixels.                                          |
|  +- width      | Image width, in pixels.                                           |
|  +- caption    | Diffbot-determined best caption for the image.                    |
|  +- primary    | Only images. Returns "True" if image is identified                |
|  |             | as primary in terms of size or positioning.                       |
|  +- xpath      | Full document Xpath to the media item.                            |
|                                                                                    |
| offerPrice     | Identified offer or actual/'final' price of the product.          |
|                | Returned by default.                                              |
| regularPrice   | Regular or original price of the product, if available.           |
|                | Returned by default.                                              |
| saveAmount     | Discount or amount saved, if available. Returned by default.      |
| shippingAmount | Shipping price, if available. Returned by default.                |
| productId      | A Diffbot-determined unique product ID.                           |
|                | If upc, isbn, mpn or sku are identified on the page,              |
|                | productId will select from these values in the above order.       |
|                | Otherwise Diffbot will attempt to derive the best unique          |
|                | value for the product. Returned by default.                       |
| upc            | Universal Product Code (UPC/EAN), if available.                   |
|                | Returned by default.                                              |
| prefixCode     | GTIN prefix code, typically the country of origin                 |
|                | as identified by UPC/ISBN. Returned by default.                   |
| productOrigin  | If available, the two-character ISO country code where            |
|                | the product was produced. Returned by default.                    |
| isbn           | International Standard Book Number (ISBN), if available.          |
|                | Returned by default.                                              |
| sku            | Stock Keeping Unit -- store/vendor inventory                      |
|                | number -- if available. Returned with fields.                     |
| mpn            | Manufacturer's Product Number, if available.                      |
|                | Returned with fields.                                             |
+----------------+-------------------------------------------------------------------+
| The following fields are in an early beta stage:                                   |
+----------------+-------------------------------------------------------------------+
| availability   | Item's availability, either true or false. Returned by default.   |
| brand          | The item brand, if identified. Returned with fields.              |
| quantityPrices | If a product page includes discounts for quantity purchases,      |
|                | quantityPrices will return an array of quantity and price values. |
|                | Returned with fields.                                             |
+----------------+-------------------------------------------------------------------|

Example Response

This is a simple response:

{
  "type": "product",
  "products": [
    {
      "title": "Before I Go To Sleep",
      "description": "Memories define us...",
      "offerPrice": "$7.99",
      "regularPrice": "$9.99",
      "saveAmount": "$2.00",
      "media": [
        {
          "height": 480,
          "width": 340,
          "link": "http://cdn.shopify.com/s/files/1/0184/6296/products/BeforeIGoToSleep_large.png?946",
          "type": "image",
          "xpath": "/HTML[@class='no-js']/BODY[@id='page-product']..."
        }
      ]
    }
  ],
  "url": "http://store.livrada.com/collections/all/products/before-i-go-to-sleep"
}

func (*Product) String

func (p *Product) String() string

Directories

Path Synopsis
Diffbot Client
Diffbot Client

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL