zyte

package

v0.171.0 Latest Latest Go to latest Published: Jun 26, 2026 License: AGPL-3.0 Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/immanent-tech/foragd

Links

Open Source Insights

Documentation ¶

Overview ¶

Package zyte provides primitives to interact with the openapi HTTP API.

Code generated by github.com/oapi-codegen/oapi-codegen/v2 version v2.6.0 DO NOT EDIT.

Index ¶

Constants
Variables
type Article
- func ExtractArticle(ctx context.Context, rawURL string, options ...RequestOption) (*Article, error)
- func (a *Article) GetHTML() string
- func (a *Article) GetText() string
type Author
type Breadcrumb
type Config
type ExtractFrom
- func (e ExtractFrom) Valid() bool
type ExtractOptions
type Metadata
type RemoteMedia
type Request
- func NewRequest(url string, options ...RequestOption) *Request
type RequestMethod
- func (e RequestMethod) Valid() bool
type RequestOption
- func WithBrowserHTML(value bool) RequestOption
- func WithFollowRedirects(value bool) RequestOption
- func WithRequestMethod(value RequestMethod) RequestOption
- func WithResponseBody(value bool) RequestOption
- func WithTag(key, value string) RequestOption
type Response
- func Proxy(ctx context.Context, rawURL string, options ...RequestOption) (*Response, error)
- func (r *Response) ExtractArticle() (*Article, error)
- func (r *Response) GetBrowserResponse() ([]byte, error)
- func (r *Response) GetHTMLResponse() ([]byte, error)
- func (r *Response) GetURL() (*url.URL, error)
type ResponseError
- func (e *ResponseError) Error() string
- func (e *ResponseError) HTTPStatus() int
- func (e *ResponseError) Unwrap() error
- func (e *ResponseError) WriteLog(ctx context.Context)
type StatusCode

Constants ¶

View Source

const (
	// ConfigEnvPrefix is the prefix applied to environment variables for configuring Zyte.
	ConfigEnvPrefix = "ZYTE_"
)

Variables ¶

View Source

var ErrNotFound = errors.New("not found")

Functions ¶

This section is empty.

Types ¶

func ExtractArticle ¶

func ExtractArticle(ctx context.Context, rawURL string, options ...RequestOption) (*Article, error)

ExtractArticle attempts to extract an article from the given URL.

func (*Article) GetHTML ¶ added in v0.148.0

func (a *Article) GetHTML() string

func (*Article) GetText ¶ added in v0.148.0

func (a *Article) GetText() string

type Author ¶

type Author struct {
	// Name is the full name of the author, e.g. "Alice".
	Name string `json:"name"`

	// NameRaw is the text from which this author name was extracted, e.g. "Alice and Bob".
	NameRaw *string `json:"nameRaw,omitempty"`
}

Author is an author attribution.

type Breadcrumb struct {
	// Name is the text of the breadcrumb, as it appears on the website.
	Name *string `json:"name,omitempty"`

	// URL is the absolute URL of the breadcrumb.
	URL *string `json:"url,omitempty" validate:"omitempty,url"`
}

Breadcrumb a breadcrumb found on the object

type ExtractFrom ¶

type ExtractFrom string

ExtractFrom is the extraction source. httpResponseBody extracts from httpResponseBody. It is usually faster and cheaper. browserHtmlOnly extracts from browserHtml. It typically improves quality over httpResponseBody on JavaScript-heavy web pages. browserHtml extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over browserHtmlOnly, but is not as robust in case of rendering issues. If not specified, browserHtml is currently used by default for AI extraction, while httpResponseBody is used by default for non-AI extraction. In the future, the default value may depend on the target website.

const (
	ExtractFromBrowserHtml      ExtractFrom = "browserHtml"
	ExtractFromBrowserHtmlOnly  ExtractFrom = "browserHtmlOnly"
	ExtractFromHttpResponseBody ExtractFrom = "httpResponseBody"
)

Defines values for ExtractFrom.

func (ExtractFrom) Valid ¶

func (e ExtractFrom) Valid() bool

Valid indicates whether the value is a known member of the ExtractFrom enum.

type ExtractOptions ¶

type ExtractOptions struct {
	// ExtractFrom is the extraction source.
	// httpResponseBody extracts from httpResponseBody. It is usually faster and cheaper.
	// browserHtmlOnly extracts from browserHtml. It typically improves quality over httpResponseBody on JavaScript-heavy web pages.
	// browserHtml extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over browserHtmlOnly, but is not as robust in case of rendering issues.
	// If not specified, browserHtml is currently used by default for AI extraction, while httpResponseBody is used by default for non-AI extraction. In the future, the default value may depend on the target website.
	ExtractFrom *ExtractFrom `json:"extractFrom" validate:"omitempty,oneof=httpResponseBody browserHtml browserHtmlOnly"`
}

ExtractOptions are options for controlling article extraction.

type Metadata ¶

type Metadata struct {
	// DateDownloaded The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"
	DateDownloaded string `json:"dateDownloaded" validate:"required"`

	// Probability Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low.
	Probability float32 `json:"probability" validate:"required,gte=0,lte=1"`
}

Metadata is extracted item metadata for single-item data types.

type RemoteMedia ¶

type RemoteMedia struct {
	// URL is a URL to the remote media.
	URL string `json:"url" validate:"required,url"`
}

RemoteMedia represents a piece of remote media (i.e., image, video, audio).

type Request ¶

type Request struct {
	// Article Set to true to get article data in the article response field.
	Article *bool `json:"article,omitempty,omitzero"`

	// ArticleList Set to true to get article list data in the articleList response field.
	ArticleList    *bool           `json:"articleList,omitzero"`
	ArticleOptions *ExtractOptions `json:"articleOptions,omitzero"`

	// BrowserHtml Set to true to get the browser HTML in the browserHtml response field.
	BrowserHtml *bool `json:"browserHtml,omitzero" validate:"omitempty,excluded_unless=httpResponseBody false"`

	// FollowRedirect indicates whether to follow HTTP redirection or not.
	FollowRedirect *bool `json:"followRedirect,omitzero"`

	// HttpRequestMethod is the request method.
	HttpRequestMethod *RequestMethod `json:"httpRequestMethod,omitempty" validate:"omitempty,oneof=GET POST PUT DELETE OPTIONS TRACE PATCH HEAD"`

	// HttpResponseHeaders Set to true to get the HTTP response headers in the httpResponseHeaders response field.
	HttpResponseHeaders *bool `json:"httpResponseHeaders,omitzero"`

	// HttpResposeBody Set to true to get the HTTP response body in the httpResponseBody response field.
	HttpResposeBody *bool `json:"httpResponseBody,omitzero" validate:"omitempty,excluded_unless=browserHtml false"`

	// PageContent Set to true to get page content data in the pageContent response field.
	PageContent        *bool           `json:"pageContent,omitzero"`
	PageContentOptions *ExtractOptions `json:"pageContentOptions,omitzero"`

	// Tags Assign arbitrary key-value pairs to the request that you can use for filtering in the Stats API.
	Tags map[string]string `json:"tags,omitempty,omitzero"`

	// URL is the URL to process with the API
	URL string `json:"url" validate:"required,url"`
}

Request is a request to the Zyte API

func NewRequest ¶

func NewRequest(url string, options ...RequestOption) *Request

NewRequest creates a new Zyte API request with the given options.

type RequestMethod ¶

type RequestMethod string

RequestMethod is the request method.

const (
	RequestMethodDELETE  RequestMethod = "DELETE"
	RequestMethodGET     RequestMethod = "GET"
	RequestMethodHEAD    RequestMethod = "HEAD"
	RequestMethodOPTIONS RequestMethod = "OPTIONS"
	RequestMethodPATCH   RequestMethod = "PATCH"
	RequestMethodPOST    RequestMethod = "POST"
	RequestMethodPUT     RequestMethod = "PUT"
	RequestMethodTRACE   RequestMethod = "TRACE"
)

Defines values for RequestMethod.

func (RequestMethod) Valid ¶

func (e RequestMethod) Valid() bool

Valid indicates whether the value is a known member of the RequestMethod enum.

type RequestOption ¶

type RequestOption func(*Request)

func WithBrowserHTML ¶ added in v0.148.0

func WithBrowserHTML(value bool) RequestOption

WithBrowserHTML option specifies whether to use a browser to get the page HTML.

func WithFollowRedirects ¶

func WithFollowRedirects(value bool) RequestOption

WithFollowRedirects option specifies whether any redirects should be followed.

func WithRequestMethod ¶

func WithRequestMethod(value RequestMethod) RequestOption

WithRequestMethod option specifies which request method to use. If not specified, this defaults to GET.

func WithResponseBody ¶

func WithResponseBody(value bool) RequestOption

WithResponseBody option specifies whether to include the raw response body in the response.

func WithTag ¶ added in v0.148.0

func WithTag(key, value string) RequestOption

WithTag adds the given tag to the request.

type Response ¶

type Response struct {
	// Article is an extracted article.
	Article *Article `json:"article,omitempty"`

	// BrowserHtml is the HTML representation of the Document Object Model (DOM) of a webpage after it has been rendered in a browser.
	BrowserHtml *string `json:"browserHtml"`

	// HttpResponseBody Base64-encoded HTTP response body. To get this response field, set the httpResponseBody request field to true.
	HttpResponseBody *string     `json:"httpResponseBody" validate:"omitempty,base64"`
	StatusCode       *StatusCode `json:"statusCode"`

	// URL is the URL to process with the API
	URL string `json:"url" validate:"required,url"`
}

Response is a response from the Zyte API.

func Proxy ¶

func Proxy(ctx context.Context, rawURL string, options ...RequestOption) (*Response, error)

Proxy will reverse proxy the given URL through Zyte.

func (*Response) ExtractArticle ¶ added in v0.148.0

func (r *Response) ExtractArticle() (*Article, error)

ExtractArticle extracts an Article from the Response using the readability package. It extracts the content from either the responseHtml or browserHtml field of the response.

func (*Response) GetBrowserResponse ¶ added in v0.148.0

func (r *Response) GetBrowserResponse() ([]byte, error)

GetBrowserResponse retrieves the response body from a browser request (created by the browserHtml Request option).

func (*Response) GetHTMLResponse ¶ added in v0.148.0

func (r *Response) GetHTMLResponse() ([]byte, error)

GetHTMLResponse retrieves the response body from a html request (created by the httpResponseBody Request option).

func (*Response) GetURL ¶ added in v0.148.0

func (r *Response) GetURL() (*url.URL, error)

type ResponseError ¶

type ResponseError struct {
	Detail string     `json:"detail" validate:"required"`
	Status StatusCode `json:"status" validate:"required"`
	Title  string     `json:"title" validate:"required"`
	Type   string     `json:"type" validate:"required"`
}

ResponseError is an error response from the Zyte API.

func (*ResponseError) Error ¶

func (e *ResponseError) Error() string

func (*ResponseError) HTTPStatus ¶

func (e *ResponseError) HTTPStatus() int

HTTPStatus returns the status code of the API error.

func (*ResponseError) Unwrap ¶

func (e *ResponseError) Unwrap() error

func (*ResponseError) WriteLog ¶

func (e *ResponseError) WriteLog(ctx context.Context)

WriteLog writes the ResponseError to the log at the appropriate level.

type StatusCode ¶

type StatusCode = int

StatusCode The HTTP status code retrieved from the target page. If redirection is followed, this is the status code of the response after redirection.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL