vision

package
v1.27.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 27, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Overview

Package vision provides vision-related neural network layers.

Package vision provides vision-related neural network layers.

Stability: beta

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CLIPEncoder

type CLIPEncoder[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

CLIPEncoder implements a CLIP ViT (Vision Transformer) encoder.

Architecture:

PatchEmbed -> [CLS] + PatchEmbeddings + PosEmbed -> [LN -> SelfAttn -> Add -> LN -> FFN(QuickGELU) -> Add] x N -> LN

Input shape: [batch, channels, height, width] (pixel values normalized to [-1, 1]) Output shape: [batch, numPatches+1, hiddenDim]

func NewCLIPEncoder

func NewCLIPEncoder[T tensor.Numeric](
	name string,
	engine compute.Engine[T],
	ops numeric.Arithmetic[T],
	cfg CLIPEncoderConfig,
) (*CLIPEncoder[T], error)

NewCLIPEncoder creates a new CLIP ViT encoder.

func (*CLIPEncoder[T]) Attributes

func (e *CLIPEncoder[T]) Attributes() map[string]interface{}

func (*CLIPEncoder[T]) Backward

func (*CLIPEncoder[T]) Forward

func (e *CLIPEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

Forward runs the CLIP vision encoder. Input: [batch, channels, height, width] pixel values. Output: [batch, numPatches+1, hiddenDim] vision embeddings.

func (*CLIPEncoder[T]) OpType

func (e *CLIPEncoder[T]) OpType() string

func (*CLIPEncoder[T]) OutputShape

func (e *CLIPEncoder[T]) OutputShape() []int

func (*CLIPEncoder[T]) Parameters

func (e *CLIPEncoder[T]) Parameters() []*graph.Parameter[T]

Parameters returns all trainable parameters from the CLIP encoder.

type CLIPEncoderConfig

type CLIPEncoderConfig struct {
	ImageSize   int // Input image size (square, e.g. 224).
	PatchSize   int // Patch size for patch embedding (e.g. 14).
	HiddenDim   int // Hidden dimension throughout the encoder.
	NumHeads    int // Number of attention heads per transformer block.
	NumLayers   int // Number of transformer encoder blocks.
	NumChannels int // Number of input channels (default 3 for RGB).
}

CLIPEncoderConfig holds configuration for a CLIP vision encoder.

func (CLIPEncoderConfig) NumPatches

func (c CLIPEncoderConfig) NumPatches() int

NumPatches returns the number of patches (excluding class token).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL