vision

package

v1.31.0 Latest Latest Go to latest Published: Mar 28, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package vision provides vision-related neural network layers.

Stability: beta

Index ¶

type CLIPEncoder
- func NewCLIPEncoder[T tensor.Numeric](name string, engine compute.Engine[T], ops numeric.Arithmetic[T], ...) (*CLIPEncoder[T], error)
type CLIPEncoderConfig
- func (c CLIPEncoderConfig) NumPatches() int

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CLIPEncoder ¶

type CLIPEncoder[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

CLIPEncoder implements a CLIP ViT (Vision Transformer) encoder.

Architecture:

PatchEmbed -> [CLS] + PatchEmbeddings + PosEmbed -> [LN -> SelfAttn -> Add -> LN -> FFN(QuickGELU) -> Add] x N -> LN

Input shape: [batch, channels, height, width] (pixel values normalized to [-1, 1]) Output shape: [batch, numPatches+1, hiddenDim]

func NewCLIPEncoder ¶

func NewCLIPEncoder[T tensor.Numeric](
	name string,
	engine compute.Engine[T],
	ops numeric.Arithmetic[T],
	cfg CLIPEncoderConfig,
) (*CLIPEncoder[T], error)

NewCLIPEncoder creates a new CLIP ViT encoder.

func (*CLIPEncoder[T]) Attributes ¶

func (e *CLIPEncoder[T]) Attributes() map[string]interface{}

func (*CLIPEncoder[T]) Backward ¶

func (e *CLIPEncoder[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)

func (*CLIPEncoder[T]) Forward ¶

func (e *CLIPEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

Forward runs the CLIP vision encoder. Input: [batch, channels, height, width] pixel values. Output: [batch, numPatches+1, hiddenDim] vision embeddings.

func (*CLIPEncoder[T]) OpType ¶

func (e *CLIPEncoder[T]) OpType() string

func (*CLIPEncoder[T]) OutputShape ¶

func (e *CLIPEncoder[T]) OutputShape() []int

func (*CLIPEncoder[T]) Parameters ¶

func (e *CLIPEncoder[T]) Parameters() []*graph.Parameter[T]

Parameters returns all trainable parameters from the CLIP encoder.

type CLIPEncoderConfig ¶

type CLIPEncoderConfig struct {
	ImageSize   int // Input image size (square, e.g. 224).
	PatchSize   int // Patch size for patch embedding (e.g. 14).
	HiddenDim   int // Hidden dimension throughout the encoder.
	NumHeads    int // Number of attention heads per transformer block.
	NumLayers   int // Number of transformer encoder blocks.
	NumChannels int // Number of input channels (default 3 for RGB).
}

CLIPEncoderConfig holds configuration for a CLIP vision encoder.

func (CLIPEncoderConfig) NumPatches ¶

func (c CLIPEncoderConfig) NumPatches() int

NumPatches returns the number of patches (excluding class token).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL