zyte

package
v0.171.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 26, 2026 License: AGPL-3.0 Imports: 16 Imported by: 0

Documentation

Overview

Package zyte provides primitives to interact with the openapi HTTP API.

Code generated by github.com/oapi-codegen/oapi-codegen/v2 version v2.6.0 DO NOT EDIT.

Index

Constants

View Source
const (
	// ConfigEnvPrefix is the prefix applied to environment variables for configuring Zyte.
	ConfigEnvPrefix = "ZYTE_"
)

Variables

View Source
var ErrNotFound = errors.New("not found")

Functions

This section is empty.

Types

type Article

type Article struct {
	// ArticleBody is the clean text of the article, including sub-headings, with newline separators.
	ArticleBody *string `json:"articleBody"`

	// ArticleBodyHtml is the Simplified and standardized HTML of the article body, including sub-headings, image captions and embedded content (videos, tweets, etc.).
	ArticleBodyHtml *string `json:"articleBodyHtml"`

	// Audios is all audio of the item.
	Audios []RemoteMedia `json:"audios,omitempty"`

	// Authors is a list of authors of the article.
	Authors []Author `json:"authors,omitempty"`

	// Breadcrumbs is a list of breadcrumbs (a specific navigation element) with optional name and url.
	Breadcrumbs []Breadcrumb `json:"breadcrumbs,omitempty"`

	// CanonicalURL is the canonical URL of the article, if available.
	CanonicalURL *string `json:"canonicalURL,omitempty" validate:"omitempty,url"`

	// DateModified The date when the article was most recently modified. ISO-formatted with 'T' separator, may contain a timezone.
	DateModified *string `json:"dateModified"`

	// DateModifiedRaw is the same date as "dateModified", but before parsing/normalization, i.e. as it appears on the website.
	DateModifiedRaw *string `json:"dateModifiedRaw"`

	// DatePublished is the publication date. ISO-formatted with 'T' separator, may contain a timezone. If the actual publication date is not found, "dateModified" value is taken.
	DatePublished *string `json:"datePublished"`

	// DatePublishedRaw is the same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.
	DatePublishedRaw *string `json:"datePublishedRaw"`

	// Description is a short summary of the article. It can be either human-provided (if available), or auto-generated.
	Description *string `json:"description,omitempty"`

	// Headline ia an article headline or title.
	Headline *string `json:"headline,omitempty"`

	// Images is all images of the item (may include the main image).
	Images []RemoteMedia `json:"images,omitempty"`

	// InLanguage Language of the article, as an ISO 639-1 language code. Example: "en". Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see "webPageInfo".
	InLanguage *string      `json:"inLanguage"`
	MainImage  *RemoteMedia `json:"mainImage"`

	// Metadata is extracted item metadata for single-item data types.
	Metadata Metadata `json:"metadata"`

	// URL is the URL of a page where this article was extracted.
	URL string `json:"url" validate:"required,url"`

	// Videos is all video of the item.
	Videos []RemoteMedia `json:"videos,omitempty"`
}

Article is an extracted article.

func ExtractArticle

func ExtractArticle(ctx context.Context, rawURL string, options ...RequestOption) (*Article, error)

ExtractArticle attempts to extract an article from the given URL.

func (*Article) GetHTML added in v0.148.0

func (a *Article) GetHTML() string

func (*Article) GetText added in v0.148.0

func (a *Article) GetText() string

type Author

type Author struct {
	// Name is the full name of the author, e.g. "Alice".
	Name string `json:"name"`

	// NameRaw is the text from which this author name was extracted, e.g. "Alice and Bob".
	NameRaw *string `json:"nameRaw,omitempty"`
}

Author is an author attribution.

type Breadcrumb struct {
	// Name is the text of the breadcrumb, as it appears on the website.
	Name *string `json:"name,omitempty"`

	// URL is the absolute URL of the breadcrumb.
	URL *string `json:"url,omitempty" validate:"omitempty,url"`
}

Breadcrumb a breadcrumb found on the object

type Config

type Config struct {
	// APIKey is the api key used to authorize requests with the zyte API.
	APIKey string `koanf:"apikey" validate:"required"`
}

Config is the configuration for Zyte.

type ExtractFrom

type ExtractFrom string

ExtractFrom is the extraction source. httpResponseBody extracts from httpResponseBody. It is usually faster and cheaper. browserHtmlOnly extracts from browserHtml. It typically improves quality over httpResponseBody on JavaScript-heavy web pages. browserHtml extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over browserHtmlOnly, but is not as robust in case of rendering issues. If not specified, browserHtml is currently used by default for AI extraction, while httpResponseBody is used by default for non-AI extraction. In the future, the default value may depend on the target website.

const (
	ExtractFromBrowserHtml      ExtractFrom = "browserHtml"
	ExtractFromBrowserHtmlOnly  ExtractFrom = "browserHtmlOnly"
	ExtractFromHttpResponseBody ExtractFrom = "httpResponseBody"
)

Defines values for ExtractFrom.

func (ExtractFrom) Valid

func (e ExtractFrom) Valid() bool

Valid indicates whether the value is a known member of the ExtractFrom enum.

type ExtractOptions

type ExtractOptions struct {
	// ExtractFrom is the extraction source.
	// httpResponseBody extracts from httpResponseBody. It is usually faster and cheaper.
	// browserHtmlOnly extracts from browserHtml. It typically improves quality over httpResponseBody on JavaScript-heavy web pages.
	// browserHtml extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over browserHtmlOnly, but is not as robust in case of rendering issues.
	// If not specified, browserHtml is currently used by default for AI extraction, while httpResponseBody is used by default for non-AI extraction. In the future, the default value may depend on the target website.
	ExtractFrom *ExtractFrom `json:"extractFrom" validate:"omitempty,oneof=httpResponseBody browserHtml browserHtmlOnly"`
}

ExtractOptions are options for controlling article extraction.

type Metadata

type Metadata struct {
	// DateDownloaded The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"
	DateDownloaded string `json:"dateDownloaded" validate:"required"`

	// Probability Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low.
	Probability float32 `json:"probability" validate:"required,gte=0,lte=1"`
}

Metadata is extracted item metadata for single-item data types.

type RemoteMedia

type RemoteMedia struct {
	// URL is a URL to the remote media.
	URL string `json:"url" validate:"required,url"`
}

RemoteMedia represents a piece of remote media (i.e., image, video, audio).

type Request

type Request struct {
	// Article Set to true to get article data in the article response field.
	Article *bool `json:"article,omitempty,omitzero"`

	// ArticleList Set to true to get article list data in the articleList response field.
	ArticleList    *bool           `json:"articleList,omitzero"`
	ArticleOptions *ExtractOptions `json:"articleOptions,omitzero"`

	// BrowserHtml Set to true to get the browser HTML in the browserHtml response field.
	BrowserHtml *bool `json:"browserHtml,omitzero" validate:"omitempty,excluded_unless=httpResponseBody false"`

	// FollowRedirect indicates whether to follow HTTP redirection or not.
	FollowRedirect *bool `json:"followRedirect,omitzero"`

	// HttpRequestMethod is the request method.
	HttpRequestMethod *RequestMethod `json:"httpRequestMethod,omitempty" validate:"omitempty,oneof=GET POST PUT DELETE OPTIONS TRACE PATCH HEAD"`

	// HttpResponseHeaders Set to true to get the HTTP response headers in the httpResponseHeaders response field.
	HttpResponseHeaders *bool `json:"httpResponseHeaders,omitzero"`

	// HttpResposeBody Set to true to get the HTTP response body in the httpResponseBody response field.
	HttpResposeBody *bool `json:"httpResponseBody,omitzero" validate:"omitempty,excluded_unless=browserHtml false"`

	// PageContent Set to true to get page content data in the pageContent response field.
	PageContent        *bool           `json:"pageContent,omitzero"`
	PageContentOptions *ExtractOptions `json:"pageContentOptions,omitzero"`

	// Tags Assign arbitrary key-value pairs to the request that you can use for filtering in the Stats API.
	Tags map[string]string `json:"tags,omitempty,omitzero"`

	// URL is the URL to process with the API
	URL string `json:"url" validate:"required,url"`
}

Request is a request to the Zyte API

func NewRequest

func NewRequest(url string, options ...RequestOption) *Request

NewRequest creates a new Zyte API request with the given options.

type RequestMethod

type RequestMethod string

RequestMethod is the request method.

const (
	RequestMethodDELETE  RequestMethod = "DELETE"
	RequestMethodGET     RequestMethod = "GET"
	RequestMethodHEAD    RequestMethod = "HEAD"
	RequestMethodOPTIONS RequestMethod = "OPTIONS"
	RequestMethodPATCH   RequestMethod = "PATCH"
	RequestMethodPOST    RequestMethod = "POST"
	RequestMethodPUT     RequestMethod = "PUT"
	RequestMethodTRACE   RequestMethod = "TRACE"
)

Defines values for RequestMethod.

func (RequestMethod) Valid

func (e RequestMethod) Valid() bool

Valid indicates whether the value is a known member of the RequestMethod enum.

type RequestOption

type RequestOption func(*Request)

func WithBrowserHTML added in v0.148.0

func WithBrowserHTML(value bool) RequestOption

WithBrowserHTML option specifies whether to use a browser to get the page HTML.

func WithFollowRedirects

func WithFollowRedirects(value bool) RequestOption

WithFollowRedirects option specifies whether any redirects should be followed.

func WithRequestMethod

func WithRequestMethod(value RequestMethod) RequestOption

WithRequestMethod option specifies which request method to use. If not specified, this defaults to GET.

func WithResponseBody

func WithResponseBody(value bool) RequestOption

WithResponseBody option specifies whether to include the raw response body in the response.

func WithTag added in v0.148.0

func WithTag(key, value string) RequestOption

WithTag adds the given tag to the request.

type Response

type Response struct {
	// Article is an extracted article.
	Article *Article `json:"article,omitempty"`

	// BrowserHtml is the HTML representation of the Document Object Model (DOM) of a webpage after it has been rendered in a browser.
	BrowserHtml *string `json:"browserHtml"`

	// HttpResponseBody Base64-encoded HTTP response body. To get this response field, set the httpResponseBody request field to true.
	HttpResponseBody *string     `json:"httpResponseBody" validate:"omitempty,base64"`
	StatusCode       *StatusCode `json:"statusCode"`

	// URL is the URL to process with the API
	URL string `json:"url" validate:"required,url"`
}

Response is a response from the Zyte API.

func Proxy

func Proxy(ctx context.Context, rawURL string, options ...RequestOption) (*Response, error)

Proxy will reverse proxy the given URL through Zyte.

func (*Response) ExtractArticle added in v0.148.0

func (r *Response) ExtractArticle() (*Article, error)

ExtractArticle extracts an Article from the Response using the readability package. It extracts the content from either the responseHtml or browserHtml field of the response.

func (*Response) GetBrowserResponse added in v0.148.0

func (r *Response) GetBrowserResponse() ([]byte, error)

GetBrowserResponse retrieves the response body from a browser request (created by the browserHtml Request option).

func (*Response) GetHTMLResponse added in v0.148.0

func (r *Response) GetHTMLResponse() ([]byte, error)

GetHTMLResponse retrieves the response body from a html request (created by the httpResponseBody Request option).

func (*Response) GetURL added in v0.148.0

func (r *Response) GetURL() (*url.URL, error)

type ResponseError

type ResponseError struct {
	Detail string     `json:"detail" validate:"required"`
	Status StatusCode `json:"status" validate:"required"`
	Title  string     `json:"title" validate:"required"`
	Type   string     `json:"type" validate:"required"`
}

ResponseError is an error response from the Zyte API.

func (*ResponseError) Error

func (e *ResponseError) Error() string

func (*ResponseError) HTTPStatus

func (e *ResponseError) HTTPStatus() int

HTTPStatus returns the status code of the API error.

func (*ResponseError) Unwrap

func (e *ResponseError) Unwrap() error

func (*ResponseError) WriteLog

func (e *ResponseError) WriteLog(ctx context.Context)

WriteLog writes the ResponseError to the log at the appropriate level.

type StatusCode

type StatusCode = int

StatusCode The HTTP status code retrieved from the target page. If redirection is followed, this is the status code of the response after redirection.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL