Documentation
¶
Overview ¶
Package zyte provides primitives to interact with the openapi HTTP API.
Code generated by github.com/oapi-codegen/oapi-codegen/v2 version v2.6.0 DO NOT EDIT.
Index ¶
Constants ¶
const (
// ConfigEnvPrefix is the prefix applied to environment variables for configuring Zyte.
ConfigEnvPrefix = "ZYTE_"
)
Variables ¶
var ErrNotFound = errors.New("not found")
Functions ¶
This section is empty.
Types ¶
type Article ¶
type Article struct {
// ArticleBody is the clean text of the article, including sub-headings, with newline separators.
ArticleBody *string `json:"articleBody"`
// ArticleBodyHtml is the Simplified and standardized HTML of the article body, including sub-headings, image captions and embedded content (videos, tweets, etc.).
ArticleBodyHtml *string `json:"articleBodyHtml"`
// Audios is all audio of the item.
Audios []RemoteMedia `json:"audios,omitempty"`
// Authors is a list of authors of the article.
Authors []Author `json:"authors,omitempty"`
// Breadcrumbs is a list of breadcrumbs (a specific navigation element) with optional name and url.
Breadcrumbs []Breadcrumb `json:"breadcrumbs,omitempty"`
// CanonicalURL is the canonical URL of the article, if available.
CanonicalURL *string `json:"canonicalURL,omitempty" validate:"omitempty,url"`
// DateModified The date when the article was most recently modified. ISO-formatted with 'T' separator, may contain a timezone.
DateModified *string `json:"dateModified"`
// DateModifiedRaw is the same date as "dateModified", but before parsing/normalization, i.e. as it appears on the website.
DateModifiedRaw *string `json:"dateModifiedRaw"`
// DatePublished is the publication date. ISO-formatted with 'T' separator, may contain a timezone. If the actual publication date is not found, "dateModified" value is taken.
DatePublished *string `json:"datePublished"`
// DatePublishedRaw is the same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.
DatePublishedRaw *string `json:"datePublishedRaw"`
// Description is a short summary of the article. It can be either human-provided (if available), or auto-generated.
Description *string `json:"description,omitempty"`
// Headline ia an article headline or title.
Headline *string `json:"headline,omitempty"`
// Images is all images of the item (may include the main image).
Images []RemoteMedia `json:"images,omitempty"`
// InLanguage Language of the article, as an ISO 639-1 language code. Example: "en". Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see "webPageInfo".
InLanguage *string `json:"inLanguage"`
MainImage *RemoteMedia `json:"mainImage"`
// Metadata is extracted item metadata for single-item data types.
Metadata Metadata `json:"metadata"`
// URL is the URL of a page where this article was extracted.
URL string `json:"url" validate:"required,url"`
// Videos is all video of the item.
Videos []RemoteMedia `json:"videos,omitempty"`
}
Article is an extracted article.
func ExtractArticle ¶
ExtractArticle attempts to extract an article from the given URL.
type Author ¶
type Author struct {
// Name is the full name of the author, e.g. "Alice".
Name string `json:"name"`
// NameRaw is the text from which this author name was extracted, e.g. "Alice and Bob".
NameRaw *string `json:"nameRaw,omitempty"`
}
Author is an author attribution.
type Breadcrumb ¶
type Breadcrumb struct {
// Name is the text of the breadcrumb, as it appears on the website.
Name *string `json:"name,omitempty"`
// URL is the absolute URL of the breadcrumb.
URL *string `json:"url,omitempty" validate:"omitempty,url"`
}
Breadcrumb a breadcrumb found on the object
type Config ¶
type Config struct {
// APIKey is the api key used to authorize requests with the zyte API.
APIKey string `koanf:"apikey" validate:"required"`
}
Config is the configuration for Zyte.
type ExtractFrom ¶
type ExtractFrom string
ExtractFrom is the extraction source. httpResponseBody extracts from httpResponseBody. It is usually faster and cheaper. browserHtmlOnly extracts from browserHtml. It typically improves quality over httpResponseBody on JavaScript-heavy web pages. browserHtml extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over browserHtmlOnly, but is not as robust in case of rendering issues. If not specified, browserHtml is currently used by default for AI extraction, while httpResponseBody is used by default for non-AI extraction. In the future, the default value may depend on the target website.
const ( ExtractFromBrowserHtml ExtractFrom = "browserHtml" ExtractFromBrowserHtmlOnly ExtractFrom = "browserHtmlOnly" ExtractFromHttpResponseBody ExtractFrom = "httpResponseBody" )
Defines values for ExtractFrom.
func (ExtractFrom) Valid ¶
func (e ExtractFrom) Valid() bool
Valid indicates whether the value is a known member of the ExtractFrom enum.
type ExtractOptions ¶
type ExtractOptions struct {
// ExtractFrom is the extraction source.
// httpResponseBody extracts from httpResponseBody. It is usually faster and cheaper.
// browserHtmlOnly extracts from browserHtml. It typically improves quality over httpResponseBody on JavaScript-heavy web pages.
// browserHtml extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over browserHtmlOnly, but is not as robust in case of rendering issues.
// If not specified, browserHtml is currently used by default for AI extraction, while httpResponseBody is used by default for non-AI extraction. In the future, the default value may depend on the target website.
ExtractFrom *ExtractFrom `json:"extractFrom" validate:"omitempty,oneof=httpResponseBody browserHtml browserHtmlOnly"`
}
ExtractOptions are options for controlling article extraction.
type Metadata ¶
type Metadata struct {
// DateDownloaded The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"
DateDownloaded string `json:"dateDownloaded" validate:"required"`
// Probability Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low.
Probability float32 `json:"probability" validate:"required,gte=0,lte=1"`
}
Metadata is extracted item metadata for single-item data types.
type RemoteMedia ¶
type RemoteMedia struct {
// URL is a URL to the remote media.
URL string `json:"url" validate:"required,url"`
}
RemoteMedia represents a piece of remote media (i.e., image, video, audio).
type Request ¶
type Request struct {
// Article Set to true to get article data in the article response field.
Article *bool `json:"article,omitempty,omitzero"`
// ArticleList Set to true to get article list data in the articleList response field.
ArticleList *bool `json:"articleList,omitzero"`
ArticleOptions *ExtractOptions `json:"articleOptions,omitzero"`
// BrowserHtml Set to true to get the browser HTML in the browserHtml response field.
BrowserHtml *bool `json:"browserHtml,omitzero" validate:"omitempty,excluded_unless=httpResponseBody false"`
// FollowRedirect indicates whether to follow HTTP redirection or not.
FollowRedirect *bool `json:"followRedirect,omitzero"`
// HttpRequestMethod is the request method.
HttpRequestMethod *RequestMethod `json:"httpRequestMethod,omitempty" validate:"omitempty,oneof=GET POST PUT DELETE OPTIONS TRACE PATCH HEAD"`
// HttpResponseHeaders Set to true to get the HTTP response headers in the httpResponseHeaders response field.
HttpResponseHeaders *bool `json:"httpResponseHeaders,omitzero"`
// HttpResposeBody Set to true to get the HTTP response body in the httpResponseBody response field.
HttpResposeBody *bool `json:"httpResponseBody,omitzero" validate:"omitempty,excluded_unless=browserHtml false"`
// PageContent Set to true to get page content data in the pageContent response field.
PageContent *bool `json:"pageContent,omitzero"`
PageContentOptions *ExtractOptions `json:"pageContentOptions,omitzero"`
// Tags Assign arbitrary key-value pairs to the request that you can use for filtering in the Stats API.
Tags map[string]string `json:"tags,omitempty,omitzero"`
// URL is the URL to process with the API
URL string `json:"url" validate:"required,url"`
}
Request is a request to the Zyte API
func NewRequest ¶
func NewRequest(url string, options ...RequestOption) *Request
NewRequest creates a new Zyte API request with the given options.
type RequestMethod ¶
type RequestMethod string
RequestMethod is the request method.
const ( RequestMethodDELETE RequestMethod = "DELETE" RequestMethodGET RequestMethod = "GET" RequestMethodHEAD RequestMethod = "HEAD" RequestMethodOPTIONS RequestMethod = "OPTIONS" RequestMethodPATCH RequestMethod = "PATCH" RequestMethodPOST RequestMethod = "POST" RequestMethodPUT RequestMethod = "PUT" RequestMethodTRACE RequestMethod = "TRACE" )
Defines values for RequestMethod.
func (RequestMethod) Valid ¶
func (e RequestMethod) Valid() bool
Valid indicates whether the value is a known member of the RequestMethod enum.
type RequestOption ¶
type RequestOption func(*Request)
func WithBrowserHTML ¶ added in v0.148.0
func WithBrowserHTML(value bool) RequestOption
WithBrowserHTML option specifies whether to use a browser to get the page HTML.
func WithFollowRedirects ¶
func WithFollowRedirects(value bool) RequestOption
WithFollowRedirects option specifies whether any redirects should be followed.
func WithRequestMethod ¶
func WithRequestMethod(value RequestMethod) RequestOption
WithRequestMethod option specifies which request method to use. If not specified, this defaults to GET.
func WithResponseBody ¶
func WithResponseBody(value bool) RequestOption
WithResponseBody option specifies whether to include the raw response body in the response.
func WithTag ¶ added in v0.148.0
func WithTag(key, value string) RequestOption
WithTag adds the given tag to the request.
type Response ¶
type Response struct {
// Article is an extracted article.
Article *Article `json:"article,omitempty"`
// BrowserHtml is the HTML representation of the Document Object Model (DOM) of a webpage after it has been rendered in a browser.
BrowserHtml *string `json:"browserHtml"`
// HttpResponseBody Base64-encoded HTTP response body. To get this response field, set the httpResponseBody request field to true.
HttpResponseBody *string `json:"httpResponseBody" validate:"omitempty,base64"`
StatusCode *StatusCode `json:"statusCode"`
// URL is the URL to process with the API
URL string `json:"url" validate:"required,url"`
}
Response is a response from the Zyte API.
func (*Response) ExtractArticle ¶ added in v0.148.0
ExtractArticle extracts an Article from the Response using the readability package. It extracts the content from either the responseHtml or browserHtml field of the response.
func (*Response) GetBrowserResponse ¶ added in v0.148.0
GetBrowserResponse retrieves the response body from a browser request (created by the browserHtml Request option).
func (*Response) GetHTMLResponse ¶ added in v0.148.0
GetHTMLResponse retrieves the response body from a html request (created by the httpResponseBody Request option).
type ResponseError ¶
type ResponseError struct {
Detail string `json:"detail" validate:"required"`
Status StatusCode `json:"status" validate:"required"`
Title string `json:"title" validate:"required"`
Type string `json:"type" validate:"required"`
}
ResponseError is an error response from the Zyte API.
func (*ResponseError) Error ¶
func (e *ResponseError) Error() string
func (*ResponseError) HTTPStatus ¶
func (e *ResponseError) HTTPStatus() int
HTTPStatus returns the status code of the API error.
func (*ResponseError) Unwrap ¶
func (e *ResponseError) Unwrap() error
func (*ResponseError) WriteLog ¶
func (e *ResponseError) WriteLog(ctx context.Context)
WriteLog writes the ResponseError to the log at the appropriate level.
type StatusCode ¶
type StatusCode = int
StatusCode The HTTP status code retrieved from the target page. If redirection is followed, this is the status code of the response after redirection.