Documentation
¶
Overview ¶
Package diffbot implements a Diffbot client library.
Diffbot using AI, computer vision, machine learning and natural language processing, Diffbot provides developers numerous tools to understand and extract from any web page.
Generic API ¶
The basic API is diffbot.DiffbotServer:
import ( "github.com/diffbot/diffbot-go-client" ) var ( token = `0123456789abcdef0123456789abcdef` // invalid token, just a example url = `http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/` ) func main() { respBody, err := diffbot.DiffbotServer(diffbot.DefaultServer, "article", token, url, nil) if err != nil { log.Fatal(err) } fmt.Println(string(respBody)) }
The diffbot.Diffbot API use the diffbot.DefaultServer as the server.
Article API ¶
Tha Article API use the diffbot.Diffbot to invoke the "article" method, and convert the reponse body to diffbot.Article struct.
func main() { article, err := diffbot.ParseArticle(token, url, nil) if err != nil { log.Fatal(err) } fmt.Println(article) }
Frontpage API ¶
Tha Frontpage API use the diffbot.Diffbot to invoke the "frontpage" method, and convert the reponse body to diffbot.Frontpage struct.
func main() { page, err := diffbot.ParseFrontpage(token, url, nil) if err != nil { log.Fatal(err) } fmt.Println(page) }
Image API ¶
Tha Image API use the diffbot.Diffbot to invoke the "image" method, and convert the reponse body to diffbot.Image struct.
func main() { imgInfo, err := diffbot.ParseImage(token, url, nil) if err != nil { log.Fatal(err) } fmt.Println(imgInfo) }
Product API ¶
Tha Product API use the diffbot.Diffbot to invoke the "product" method, and convert the reponse body to diffbot.Product struct.
func main() { product, err := diffbot.ParseProduct(token, url, nil) if err != nil { log.Fatal(err) } fmt.Println(product) }
Classification API ¶
Tha Classification API use the diffbot.Diffbot to invoke the "analyze" method, and convert the reponse body to diffbot.Classification struct.
func main() { info, err := diffbot.ParseClassification(token, url, nil) if err != nil { log.Fatal(err) } fmt.Println(info) }
Options ¶
We use `diffbot.Options` to specify the options:
func main() { opt := &diffbot.Options{ Fields: "meta,querystring,images(*)", Timeout: time.Millisecond * 1000, Callback: "", } fmt.Println(opt.MethodParamString("article")) // Output: // &fields=meta,querystring,images(*)&timeout=1000 }
You can call Diffbot with custom headers:
func main() { opt := diffbot.Options{} opt.CustomHeader.Add("X-Forward-Cookie", "abc=123") respBody, err := diffbot.Diffbot(token, url, opt) ... }
Error handling ¶
If diffbot server return error message, it will be converted to the `diffbot.Error`:
func main() { respBody, err := diffbot.Diffbot(token, url, nil) if err := nil { if apiErr, ok := err.(*diffbot.ApiError); ok { // ApiError, e.g. {"error":"Not authorized API token.","errorCode":401} } log.Fatal(err) } fmt.Println(string(respBody)) }
Other ¶
Diffbot API Document at http://diffbot.com/dev/docs/ or http://diffbot.com/products/.
Please report bugs to <chaishushan@gmail.com>.
Index ¶
Constants ¶
const (
DefaultServer = `http://api.diffbot.com/v2`
)
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Article ¶
type Article struct { Url string `json:"url"` ResolvedUrl string `json:"resolved_url"` Icon string `json:"icon"` Meta map[string]interface{} `json:"meta,omitempty"` // Returned with fields. QueryString string `json:"querystring,omitempty"` // Returned with fields. Links []string `json:"links,omitempty"` // Returned with fields. Type string `json:"type"` Title string `json:"title"` Text string `json:"text"` Html string `json:"html"` NumPages string `json:"numPages"` Date string `json:"date"` Author string `json:"author"` Tags []string `json:"tags,omitempty"` // Returned with fields. HumanLanguage string `json:"humanLanguage,omitempty"` // Returned with fields. Images []struct { Url string `json:"url"` PixelHeight int `json:"pixelHeight"` PixelWidth int `json:"pixelWidth"` Caption string `json:"caption"` Primary string `json:"primary"` } `json:"images"` Videos []struct { Url string `json:"url"` PixelHeight int `json:"pixelHeight"` PixelWidth int `json:"pixelWidth"` Primary string `json:"primary"` } `json:"videos"` }
Article represents an clean article text.
See http://diffbot.com/dev/docs/article/
func ParseArticle ¶
ParseArticle parse the clean article text from news article web pages.
Request ¶
To use the Article API, perform a HTTP GET request on the following endpoint:
http://api.diffbot.com/v2/article
Provide the following arguments:
+----------+-----------------------------------------------------------------+ | ARGUMENT | DESCRIPTION | +----------+-----------------------------------------------------------------+ | token | Developer token | | url | Article URL to process (URL encoded). | | | If you wish to POST content, please see POSTing Content, below. | +----------+-----------------------------------------------------------------+ | Optional arguments | +----------+-----------------------------------------------------------------+ | fields | Used to control which fields are returned by the API. | | | See the Response section below. | | timeout | Set a value in milliseconds to terminate the response. | | | By default the Article API has a five second timeout. | | callback | Use for jsonp requests. Needed for cross-domain ajax. | +----------+-----------------------------------------------------------------+
Response ¶
The Article API returns information about the primary article content on the submitted page.
Use the fields query parameter to limit or expand which fields are returned in the JSON response. For nested arrays, use parentheses to retrieve specific fields, or * to return all sub-fields.
http://api.diffbot.com/v2/article?...&fields=meta,querystring,images(*)
Example Response ¶
This is a simple response:
{ "type": "article", "icon": "http://www.diffbot.com/favicon.ico", "title": "Diffbot's New Product API Teaches Robots to Shop Online", "author": "John Davi", "date": "Wed, 31 Jul 2013 08:00:00 GMT", "videos": [ { "primary": "true", "url": "http://www.youtube.com/embed/lfcri5ungRo?feature=oembed", } ], "tags": [ "e-commerce", "SaaS" ] "url": "http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/", "humanLanguage": "en", "text": "Diffbot's human wranglers are proud today to announce the release of our newest product..." }
Authentication and Custom Headers ¶
You can supply Diffbot with custom headers, or basic authentication credentials, in order to access intranet pages or other sites that require a login.
Basic Authentication To access pages that require a login/password (using basic access authentication), include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com. Custom Headers You can supply the Article API with custom values for the user-agent, referer, or cookie values in the HTTP request. These will be used in place of the Diffbot default values.
To provide custom headers, pass in the following values in your own headers when calling the Diffbot API:
+----------------------+-----------------------------------------------------------------------+ | HEADER | DESCRIPTION | +----------------------+-----------------------------------------------------------------------+ | X-Forward-User-Agent | Will be used as Diffbot's User-Agent header when making your request. | | X-Forward-Referer | Will be used as Diffbot's Referer header when making your request. | | X-Forward-Cookie | Will be used as Diffbot's Cookie header when making your request. | +----------------------+-----------------------------------------------------------------------+
Posting Content ¶
If your content is not publicly available (e.g., behind a firewall), you can POST markup directly to the Article API endpoint for analysis:
http://api.diffbot.com/v2/article?token=...&url=...
Please note that the url parameter is still required in the endpoint, and will be used to resolve any relative links contained in the markup.
Provide markup to analyze as your POST body, and specify the Content-Type header as text/html.
The following call submits a sample to the API:
curl -H "Content-Type:text/html" -d 'Now is the time for all good robots to come to the aid of their-- oh never mind, run!' http://api.diffbot.com/v2/article?token=...&url=http://www.diffbot.com/products/automatic/article
type Classification ¶
type Classification struct { Type string `json:"type"` Title string `json:"title"` Url string `json:"url"` ResolvedUrl string `json:"resolved_url,omitempty"` // Returned with fields. HumanLanguage string `json:"humanLanguage"` Stats struct { Types struct { Article float64 `json:"article"` Audio float64 `json:"audio"` Chart float64 `json:"chart"` Discussion float64 `json:"discussion"` Document float64 `json:"document"` Download float64 `json:"download"` Error float64 `json:"error"` Event float64 `json:"event"` Faq float64 `json:"faq"` Frontpage float64 `json:"frontpage"` Game float64 `json:"game"` Image float64 `json:"image"` Job float64 `json:"job"` Location float64 `json:"location"` Other float64 `json:"other"` Product float64 `json:"product"` Profile float64 `json:"profile"` Recipe float64 `json:"recipe"` ReviewsList float64 `json:"reviewslist"` Serp float64 `json:"serp"` Video float64 `json:"video"` } `json:"types"` } `json:"stats"` }
Article represents an clean article text.
See http://diffbot.com/dev/docs/analyze/
func ParseClassification ¶
func ParseClassification(token, url string, opt *Options) (*Classification, error)
ParseClassification analyzes a web page's layout, structure, markup, text and other components and classifies the page as a particular "type." It also fully extracts the page contents if the page matches an existing Diffbot extraction API.
Please note: The Page Classifier API is currently in beta.
Request ¶
To use the Classifier API, perform a HTTP GET request on the following endpoint:
http://api.diffbot.com/v2/analyze?token=...&url=...
Provide the following arguments:
+----------+----------------------------------------------------------------------------------------------+ | ARGUMENT | DESCRIPTION | +----------+----------------------------------------------------------------------------------------------+ | token | Developer token | | url | URL to classify (URLEncoded) | +----------+----------------------------------------------------------------------------------------------+ | Optional arguments | +----------+----------------------------------------------------------------------------------------------+ | mode | By default the Page Classifier API will fully extract | | | pages that match an existing Diffbot Automatic API. | | | Set mode to a specific page-type (e.g., mode=article) | | | to extract content only from that particular page-type. | | | All others will simply return the page classification information. | | fields | You can choose the fields to be returned | | | by the Diffbot extraction API by supplying a comma-separated | | | list of fields, e.g.: | | | http://api.diffbot.com/v2/analyze?token=...&url=http://diffbot.com/company&fields=meta,tags. | | stats | Returns statistics on page classification and extraction, | | | including an array of individual page-types and | | | the Diffbot-determined score (likelihood) for each type. | +----------+----------------------------------------------------------------------------------------------+ | Basic authentication | +---------------------------------------------------------------------------------------------------------+ | To access pages that require a login/password | | (using basic access authentication), include the username and password | | in your url parameter, | | e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com | +---------------------------------------------------------------------------------------------------------+
Response ¶
The Classifier API returns, depending on parameters, the following:
+----------------+------------------------------------------------------------------+ | FIELD | DESCRIPTION | +----------------+------------------------------------------------------------------+ | type | Page-type of the submitted URL (from the below enumerated list). | | | Always returned. | | title | Page title. Returned by default. | | url | Submitted URL. Always returned. | | resolved_url | Returned if the resolving URL is different from | | | the submitted URL (e.g., link shortening services). | | | Returned by default, configurable with fields | | human_language | Returns the (spoken/human) language of the submitted URL, | | | using two-letter ISO 639-1 nomenclature. | | | Returned by default. | +----------------+------------------------------------------------------------------+
Example Response ¶
This is a simple response:
{ "type": "article" "stats": { "types": { "article": 0.46, "audio": 0.15, "chart": 0.01, "discussion": 0.03, "document": 0.04, "download": 0.01, "error": 0.00, "event": 0.00, "faq": 0.02, "frontpage": 0.12, "game": 0.01, "image": 0.02, "job": 0.02, "location": 0.08, "product": 0.09, "profile": 0.09, "recipe": 0.08, "reviewslist": 0.09, "serp": 0.06, "video": 0.01 } }, "resolved_url": "http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/", "url": "http://tcrn.ch/Jw7ZKw", "human_language": "en" }
Page Types ¶
Diffbot currently classifies pages into the following types. Please note this list will evolve over time to include additional page types.
+-------------+-----------------------------------------------------------------------------------+ | PAGE TYPE | DESCRIPTION | +-------------+-----------------------------------------------------------------------------------+ | None | Returned if Diffbot confidence in the page classification is low. | | | Use of the stats field will give you the individual scores for each page-type. | | article | A news article, blog post or other primarily-text page. | | audio | A music or audio player. | | chart | A graph or chart, typically financial. | | discussion | Specific forum, group or discussion topic. | | document | An embedded or downloadable document or slideshow. | | download | A downloadable file. | | error | Error page, e.g. 404. | | event | A page detailing specific event information, | | | e.g. time/date/location. | | faq | A page of multiple frequently asked questions, or a single FAQ entry. | | frontpage | A news- or blog-style home page, with links to myriad sections and items. | | game | A playable game. | | image | An image or photo page. | | job | A job posting. | | location | A page detailing location information, typically including an address and/or map. | | other | Returned if the result is below a certain confidence threshold. | | product | A product page, typically of a product for purchase. | | profile | A person or user profile page. | | recipe | Page detailing recipe instructions and ingredients. | | reviewslist | A list of user reviews. | | serp | A Search Engine Results Page | | video | An individual video. | +-------------+-----------------------------------------------------------------------------------+
func (*Classification) String ¶
func (p *Classification) String() string
type Error ¶
type Error struct { ErrCode int `json:"errorCode"` // Description of the error ErrMessage string `json:"error"` // Error code per the chart below RawString string `json:"-"` // Raw json format error string }
Error represents an Diffbot APIs returns error.
When issues arise, Diffbot APIs return the following fields in a JSON response.
Simple Error code:
{ "error": "Could not download page (404)", "errorCode": 404 }
Possible errors returned:
+------+-----------------------------------------------------------------------------------------------------+ | CODE | DESCRIPTION | +------+-----------------------------------------------------------------------------------------------------+ | 401 | Unauthorized token | | 404 | Requested page not found | | 429 | Your token has exceeded the allowed number of calls, or has otherwise been throttled for API abuse. | | 500 | Error processing the page. Specific information will be returned in the JSON response. | +------+-----------------------------------------------------------------------------------------------------+
type Frontpage ¶
type Frontpage struct { Id int64 `json:"id,string"` Title string `json:"title"` SourceURL string `json:"sourceURL"` Icon string `json:"icon"` NumItems int `json:"numItems"` Items []struct { Id int `json:"id"` Title string `json:"title"` Description string `json:"description"` XRoot string `json:"xroot"` PubDate string `json:"pubDate"` Link string `json:"link"` Type string `json:"type"` // STORY/LINK/... Img string `json:"img"` TextSummary string `json:"textSummary"` Sp float64 `json:"sp"` Sr float64 `json:"sr"` Fresh float64 `json:"fresh"` } `json:"items,omitempty"` }
Frontpage represents a frontpage information.
See http://diffbot.com/dev/docs/frontpage/
func ParseFrontpage ¶
ParseFrontpage parse a multifaceted "homepage" and returns individual page elements.
Request ¶
To use the Frontpage API, perform a HTTP GET request on the following endpoint:
http://api.diffbot.com/v2/frontpage
Provide the following arguments:
+----------+----------------------------------------------------------------------------------------------------------+ | ARGUMENT | DESCRIPTION | +----------+----------------------------------------------------------------------------------------------------------+ | token | Developer token | | url | Frontpage URL from which to extract items | +----------+----------------------------------------------------------------------------------------------------------+ | Optional arguments | +----------+----------------------------------------------------------------------------------------------------------+ | timeout | Specify a value in milliseconds (e.g., &timeout=15000) to override the default API timeout of 5000ms. | | format | Format the response output in xml (default) or json | | all | Returns all content from page, including navigation and similar links that the Diffbot visual processing | | | engine considers less important / non-core. | +----------+----------------------------------------------------------------------------------------------------------+ | Basic authentication | +---------------------------------------------------------------------------------------------------------------------+ | To access pages that require a login/password (using basic access authentication), | | include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com | +---------------------------------------------------------------------------------------------------------------------+
Alternatively, you can POST the content to analyze directly to the same endpoint. Specify the Content-Type header as either text/plain or text/html.
Response ¶
DML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.
+-------------+--------+------------------------------------------------------+ | INFO FIELD | TYPE | DESCRIPTION | +-------------+--------+------------------------------------------------------+ | id | long | DMLID of the URL | | title | string | Extracted title of the page | | sourceURL | url | the URL this was extracted from | | icon | url | A link to a small icon/favicon representing the page | | numItems | int | The number of items in this DML document | +-------------+--------+------------------------------------------------------+
Some of the fields found in Items:
+-------------+--------------------------+--------------------------------------------------------------------+ | ITEM FIELD | TYPE | DESCRIPTION | +-------------+--------------------------+--------------------------------------------------------------------+ | id | long | Unique hashcode/id of item | | title | string | Title of item | | description | string | innerHTML content of item | | xroot | xpath | XPATH of where item was found on the page | | pubDate | timestamp | Timestamp when item was detected on page | | link | URL | Extracted permalink (if applicable) of item | | type | {IMAGE,LINK,STORY,CHUNK} | Extracted type of the item, whether the item represents an image, | | | | permalink, story (image+summary), or html chunk. | | img | URL | Extracted image from item | | textSummary | string | A plain-text summary of the item | | sp | double<-[0,1] | Spam score - the probability that the item is spam/ad | | sr | double<-[1,5] | Static rank - the quality score of the item on a 1 to 5 scale | | fresh | double<-[0,1] | Fresh score - the percentage of the item that has changed | | | | compared to the previous crawl | +-------------+--------------------------+--------------------------------------------------------------------+
func (*Frontpage) ParseDML ¶
func (p *Frontpage) ParseDML(dml *FrontpageDML) error
type FrontpageDML ¶
type FrontpageDML struct { Id int64 `json:"id,string"` TagName string `json:"tagName"` // dml ChildNodes []struct { TagName string `json:"tagName"` // info/item/... ItemId int64 `json:"id,string"` // item.id, eg. "180194704" ItemSp string `json:"sp"` // item.sp, eg. "0.000" ItemFresh string `json:"fresh"` // item.fresh, eg. "1.000" ItemSr string `json:"sr"` // item.sr, eg. "4.000" ItemCluster string `json:"cluster"` // item.cluster, eg. "/HTML[1]/BODY[1]/DIV[4]/..." ItemCommentCount int64 `json:"commentCount,string"` // item.commentCount, eg. "34" ItemType string `json:"type"` // item.type, eg. "STORY" ItemXRoot string `json:"xroot"` // item.xroot, eg. "/HTML[1]/BODY[1]/DIV[4]/..." ChildNodes []struct { TagName string `json:"tagName"` // title/sourceType/... ChildNodes []string `json:"childNodes"` } `json:"childNodes"` } `json:"childNodes"` }
FrontpageDML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.
See http://diffbot.com/products/automatic/frontpage/
func (*FrontpageDML) ParseJson ¶
func (p *FrontpageDML) ParseJson(data []byte) error
func (*FrontpageDML) String ¶
func (p *FrontpageDML) String() string
type Image ¶
type Image struct { Title string `json:"title"` NextPage string `json:"nextPage"` AlbumUrl string `json:"albumUrl"` Url string `json:"url"` ResolvedUrl string `json:"resolved_url"` Meta map[string]interface{} `json:"meta,omitempty"` // Returned with fields. QueryString string `json:"querystring,omitempty"` // Returned with fields. Links []string `json:"links,omitempty"` // Returned with fields. Images []struct { Url string `json:"url"` AnchorUrl string `json:"anchorUrl"` Mime string `json:"mime,omitempty"` // Returned with fields. Caption string `json:"caption"` AttrAlt string `json:"attrAlt,omitempty"` // Returned with fields. AttrTitle string `json:"attrTitle,omitempty"` // Returned with fields. Date string `json:"date"` Size int `json:"size"` PixelHeight int `json:"pixelHeight"` PixelWidth int `json:"pixelWidth"` DisplayHeight int `json:"displayHeight,omitempty"` // Returned with fields. DisplayWidth int `json:"displayWidth",omitempty` // Returned with fields. Meta []string `json:"meta"` Faces []string `json:"faces,omitempty"` // Returned with fields. Ocr string `json:"ocr,omitempty"` // Returned with fields. Colors string `json:"colors,omitempty"` // Returned with fields. XPath string `json:"xpath"` } `json:"images"` }
Image represents a page image information.
See http://diffbot.com/dev/docs/image/
func ParseImage ¶
ParseImage parse a web page and returns its primary image(s).
Request ¶
To use the Product API, perform a HTTP GET request on the following endpoint:
http://api.diffbot.com/v2/product
Provide the following arguments:
+----------+-------------------------------------------------------------------------+ | ARGUMENT | DESCRIPTION | +----------+-------------------------------------------------------------------------+ | token | Developer token | | url | Product URL to process (URL encoded) | +----------+-------------------------------------------------------------------------+ | Optional arguments | +----------+-------------------------------------------------------------------------+ | fields | Used to control which fields are returned by the API. | | | See the Response section below. | | timeout | Set a value in milliseconds to terminate the response. | | | By default the Product API has no timeout. | | callback | Use for jsonp requests. Needed for cross-domain ajax. | +----------+-------------------------------------------------------------------------+ | Basic authentication | +------------------------------------------------------------------------------------+ | To access pages that require a login/password (using basic access authentication), | | include the username and password in your url parameter, | | e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com | +------------------------------------------------------------------------------------+
Response ¶
The Image API returns basic info about the page submitted, and its primary image(s) in the images array.
Use the fields query parameter to limit or expand which fields are returned in the JSON response. To control the fields returned for images, your desired fields should be contained within the 'images' parentheses:
http://api.diffbot.com/v2/image...&fields=images(mime,pixelWidth)
Response fields:
+---------------+------------------------------------------------------------------------+ | FIELD | DESCRIPTION | +---------------+------------------------------------------------------------------------+ | * | Returns all fields available. | | title | Title of the submitted page. Returned by default. | | nextPage | Link to next page (if within a gallery or paginated list of images). | | | Returned by default. | | albumUrl | Link to containing album (if image is within an album). | | | Returned by default. | | url | URL submitted. Returned by default. | | resolved_url | Returned if the resolving URL is different from the submitted URL | | | (e.g., link shortening services). | | meta | Returns the full contents of page meta tags, | | | including sub-arrays for OpenGraph tags, Twitter Card metadata, | | | schema.org microdata, and -- if available -- oEmbed metadata. | | | Returned with fields. | | querystring | Returns the key/value pairs of the URL querystring, if present. | | | Items without a value will be returned as "true". | | | Returned with fields. | | links | Returns all links (anchor tag href values) found on the page. | | | Returned with fields. | | images | An array of image(s) contained on the page. | +---------------+------------------------------------------------------------------------+ | For each item in the images array: | +---------------+------------------------------------------------------------------------+ | url | Direct link to image file. Returned by default. | | anchorUrl | If the image is wrapped by an anchor a tag, the anchor location | | | as defined by the href attribute. Returned by default. | | mime | MIME type, if available, as specified by "Content-Type" of the image. | | | Returned with fields. | | caption | The best caption for this image. Returned by default. | | attrAlt | Contents of the alt attribute, if available within the HTML IMG tag. | | | Returned with fields. | | attrTitle | Contents of the title attribute, if available within the HTML IMG tag. | | | Returned with fields. | | date | Date of image upload or creation if available in page metadata. | | | Returned by default. | | size | Size in bytes of image file. Returned by default. | | pixelHeight | Actual height, in pixels, of image file. Returned by default. | | pixelWidth | Actual width, in pixels, of image file. Returned by default. | | displayHeight | Height of image as rendered on page, if different from actual | | | (pixel) height. Returned with fields. | | displayWidth | Width of image as rendered on page, if different from actual | | | (pixel) width. Returned with fields. | | meta | Comma-separated list of image-embedded metadata | | | (e.g., EXIF, XMP, ICC Profile), if available within the image file. | | | Returned with fields. | | faces | The x, y, height, width of coordinates of human faces. | | | Null, if no faces were found. Returned with fields. | | ocr | If text is identified within the image, we will attempt to recognize | | | the text string. Returned with fields. | | colors | Returns an array of hex values of the dominant colors | | | within the image. Returned with fields. | | xpath | XPath expression identifying the node containing the image. | | | Returned by default. | +---------------+------------------------------------------------------------------------+
Example Response ¶
This is a simple response:
{ "title": "The National Flower - Rose", "type": "image", "url": "http://www.statesymbolsusa.org/National_Symbols/National_flower.html", "images": [ { "attrAlt": "Red rose in full bloom - click to see state flowers", "height": 371, "width": 300, "displayWidth": 300, "meta": [ "[Jpeg] Compression Type - Baseline", "[Jpeg] Data Precision - 8 bits", "[Jpeg] Image Height - 371 pixels", "[Jpeg] Image Width - 300 pixels", "[Jpeg] Number of Components - 3", "[Jpeg] Component 1 - Y component: Quantization table 0, Sampling factors 2 horiz/2 vert", "[Jpeg] Component 2 - Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert", "[Jpeg] Component 3 - Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert", "[Jfif] Version - 1.2", "[Jfif] Resolution Units - none", "[Jfif] X Resolution - 100 dots", "[Jfif] Y Resolution - 100 dots", "[Adobe Jpeg] DCT Encode Version - 1", "[Adobe Jpeg] Flags 0 - 192", "[Adobe Jpeg] Flags 1 - 0", "[Adobe Jpeg] Color Transform - YCbCr" ], "url": "http://www.statesymbolsusa.org/IMAGES/rose_usda-web.jpg", "size": 12328, "displayHeight": 371, "xpath": "/HTML[1]/BODY[1]/DIV[1]/TABLE[3]/TBODY[1]/TR[2]/..." }, { "attrAlt": "Yellow rose - click to see state flowers", "pixelHeight": 304, "pixelWidth": 380, "displayWidth": 380, "meta": [ "[Jpeg] Compression Type - Baseline", "[Jpeg] Data Precision - 8 bits", "[Jpeg] Image Height - 304 pixels", "[Jpeg] Image Width - 380 pixels", "[Jpeg] Number of Components - 3", "[Jpeg] Component 1 - Y component: Quantization table 0, Sampling factors 2 horiz/2 vert", "[Jpeg] Component 2 - Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert", "[Jpeg] Component 3 - Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert", "[Jfif] Version - 1.2", "[Jfif] Resolution Units - none", "[Jfif] X Resolution - 100 dots", "[Jfif] Y Resolution - 100 dots", "[Adobe Jpeg] DCT Encode Version - 1", "[Adobe Jpeg] Flags 0 - 192", "[Adobe Jpeg] Flags 1 - 0", "[Adobe Jpeg] Color Transform - YCbCr" ], "url": "http://www.statesymbolsusa.org/IMAGES/rose_yellow-380.jpg", "size": 12142, "displayHeight": 304, "xpath": "/HTML[1]/BODY[1]/DIV[1]/TABLE[3]/TBODY[1]/TR[2]/..." } ] }
type Options ¶
type Options struct { Fields string Timeout time.Duration Callback string FrontpageAll string ClassifierMode string ClassifierStats string BulkNotifyEmail string BulkNotifyWebHook string BulkRepeat string BulkMaxRounds string BulkPageProcessPattern string CrawlMaxToCrawl string CrawlMaxToProcess string CrawlRestrictDomain string CrawlNotifyEmail string CrawlNotifyWebHook string CrawlDelay string CrawlRepeat string CrawlOnlyProcessIfNew string CrawlMaxRounds string BatchMethod string BatchRelativeUrl string CustomHeader http.Header }
Options holds the optional parameters for Diffbot client.
See http://diffbot.com/products/automatic/
func (*Options) MethodParamString ¶
MethodParamString return string as the url params.
If the Options is not empty, the return string begin with a '&'.
type Product ¶
type Product struct { Url string `json:"url"` ResolvedUrl string `json:"resolved_url"` Meta map[string]interface{} `json:"meta,omitempty"` // Returned with fields. QueryString string `json:"querystring,omitempty"` // Returned with fields. Links []string `json:"links,omitempty"` // Returned with fields. Breadcrumb []string `json:"breadcrumb"` Products []struct { Title string `json:"title"` Description string `json:"description"` Brand string `json:"brand,omitempty"` // Returned with fields. Medias []struct { Type string `json:"type"` Link string `json:"link"` Height int `json:"height"` Width int `json:"width"` Caption string `json:"caption"` Primary string `json:"primary"` XPath string `json:"xpath"` } `json:"media"` OfferPrice string `json:"offerPrice"` RegularPrice string `json:"regularPrice"` SaveAmount string `json:"saveAmount"` ShippingAmount string `json:"shippingAmount"` ProductId string `json:"productId"` Upc string `json:"upc"` PrefixCode string `json:"prefixCode"` ProductOrigin string `json:"productOrigin"` Isbn string `json:"isbn"` Sku string `json:"sku,omitempty"` // Returned with fields. Mpn string `json:"mpn,omitempty"` // Returned with fields. } `json:"products"` }
Product represents a shopping or e-commerce product information.
See http://diffbot.com/dev/docs/product/
func ParseProduct ¶
ParseProduct parse a shopping or e-commerce product page and returns information on the product.
Request ¶
To use the Product API, perform a HTTP GET request on the following endpoint:
http://api.diffbot.com/v2/product
Provide the following arguments:
+----------+-------------------------------------------------------------------------+ | ARGUMENT | DESCRIPTION | +----------+-------------------------------------------------------------------------+ | token | Developer token | | url | Product URL to process (URL encoded) | +----------+-------------------------------------------------------------------------+ | Optional arguments | +----------+-------------------------------------------------------------------------+ | fields | Used to control which fields are returned by the API. | | | See the Response section below. | | timeout | Set a value in milliseconds to terminate the response. | | | By default the Product API has no timeout. | | callback | Use for jsonp requests. Needed for cross-domain ajax. | +----------+-------------------------------------------------------------------------+ | Basic authentication | | To access pages that require a login/password (using basic access authentication), | | include the username and password in your url parameter, | | e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com | +------------------------------------------------------------------------------------+
Response ¶
The Product API returns product details in the products array. Currently extracted data will only be returned from a single product. In the future the API will return information from multiple products, if multiple items are available on the same page.
Use the fields query parameter to limit or expand which fields are returned in the JSON response. For product-specific content your desired fields should be contained within the 'products' parentheses:
http://api.diffbot.com/v2/product...&fields=products(offerPrice,sku)
Response fields:
+----------------+-------------------------------------------------------------------+ | FIELD | DESCRIPTION | +----------------+-------------------------------------------------------------------+ | * | Returns all fields available. | | url | URL submitted. Returned by default. | | resolved_url | Returned if the resolving URL is different from the | | | submitted URL (e.g., link shortening services). | | | Returned by default. | | meta | Returns the full contents of page meta tags, | | | including sub-arrays for OpenGraph tags, | | | Twitter Card metadata, schema.org microdata, | | | and -- if available -- oEmbed metadata. | | | Returned with fields. | | querystring | Returns the key/value pairs of the URL querystring, if present. | | | Items without a value will be returned as "true." | | | Returned with fields. | | links | Returns all links (anchor tag href values) found on the page. | | | Returned with fields. | | breadcrumb | If available, an array of link URLs and link text | | | from page breadcrumbs. Returned by default. | +----------------+-------------------------------------------------------------------+ | For each item in the products array: | +----------------+-------------------------------------------------------------------+ | title | Name of the product. Returned by default. | | description | Description, if available, of the product. | | | Returned by default. | | brand | Experimental Brand, if available, of the product. | | | Returned with fields. | | media | Array of media items (images or videos) of the product. | | | | Returned by default. | | +- type | Type of media identified (image or video). | | +- link | Direct (fully resolved) link to image or video content. | | +- height | Image height, in pixels. | | +- width | Image width, in pixels. | | +- caption | Diffbot-determined best caption for the image. | | +- primary | Only images. Returns "True" if image is identified | | | | as primary in terms of size or positioning. | | +- xpath | Full document Xpath to the media item. | | | | offerPrice | Identified offer or actual/'final' price of the product. | | | Returned by default. | | regularPrice | Regular or original price of the product, if available. | | | Returned by default. | | saveAmount | Discount or amount saved, if available. Returned by default. | | shippingAmount | Shipping price, if available. Returned by default. | | productId | A Diffbot-determined unique product ID. | | | If upc, isbn, mpn or sku are identified on the page, | | | productId will select from these values in the above order. | | | Otherwise Diffbot will attempt to derive the best unique | | | value for the product. Returned by default. | | upc | Universal Product Code (UPC/EAN), if available. | | | Returned by default. | | prefixCode | GTIN prefix code, typically the country of origin | | | as identified by UPC/ISBN. Returned by default. | | productOrigin | If available, the two-character ISO country code where | | | the product was produced. Returned by default. | | isbn | International Standard Book Number (ISBN), if available. | | | Returned by default. | | sku | Stock Keeping Unit -- store/vendor inventory | | | number -- if available. Returned with fields. | | mpn | Manufacturer's Product Number, if available. | | | Returned with fields. | +----------------+-------------------------------------------------------------------+ | The following fields are in an early beta stage: | +----------------+-------------------------------------------------------------------+ | availability | Item's availability, either true or false. Returned by default. | | brand | The item brand, if identified. Returned with fields. | | quantityPrices | If a product page includes discounts for quantity purchases, | | | quantityPrices will return an array of quantity and price values. | | | Returned with fields. | +----------------+-------------------------------------------------------------------|
Example Response ¶
This is a simple response:
{ "type": "product", "products": [ { "title": "Before I Go To Sleep", "description": "Memories define us...", "offerPrice": "$7.99", "regularPrice": "$9.99", "saveAmount": "$2.00", "media": [ { "height": 480, "width": 340, "link": "http://cdn.shopify.com/s/files/1/0184/6296/products/BeforeIGoToSleep_large.png?946", "type": "image", "xpath": "/HTML[@class='no-js']/BODY[@id='page-product']..." } ] } ], "url": "http://store.livrada.com/collections/all/products/before-i-go-to-sleep" }