Documentation
¶
Overview ¶
Package vision provides vision-related neural network layers.
Package vision provides vision-related neural network layers.
Stability: beta
Index ¶
- type CLIPEncoder
- func (e *CLIPEncoder[T]) Attributes() map[string]interface{}
- func (e *CLIPEncoder[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], ...) ([]*tensor.TensorNumeric[T], error)
- func (e *CLIPEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- func (e *CLIPEncoder[T]) OpType() string
- func (e *CLIPEncoder[T]) OutputShape() []int
- func (e *CLIPEncoder[T]) Parameters() []*graph.Parameter[T]
- type CLIPEncoderConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CLIPEncoder ¶
CLIPEncoder implements a CLIP ViT (Vision Transformer) encoder.
Architecture:
PatchEmbed -> [CLS] + PatchEmbeddings + PosEmbed -> [LN -> SelfAttn -> Add -> LN -> FFN(QuickGELU) -> Add] x N -> LN
Input shape: [batch, channels, height, width] (pixel values normalized to [-1, 1]) Output shape: [batch, numPatches+1, hiddenDim]
func NewCLIPEncoder ¶
func NewCLIPEncoder[T tensor.Numeric]( name string, engine compute.Engine[T], ops numeric.Arithmetic[T], cfg CLIPEncoderConfig, ) (*CLIPEncoder[T], error)
NewCLIPEncoder creates a new CLIP ViT encoder.
func (*CLIPEncoder[T]) Attributes ¶
func (e *CLIPEncoder[T]) Attributes() map[string]interface{}
func (*CLIPEncoder[T]) Backward ¶
func (e *CLIPEncoder[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)
func (*CLIPEncoder[T]) Forward ¶
func (e *CLIPEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
Forward runs the CLIP vision encoder. Input: [batch, channels, height, width] pixel values. Output: [batch, numPatches+1, hiddenDim] vision embeddings.
func (*CLIPEncoder[T]) OpType ¶
func (e *CLIPEncoder[T]) OpType() string
func (*CLIPEncoder[T]) OutputShape ¶
func (e *CLIPEncoder[T]) OutputShape() []int
func (*CLIPEncoder[T]) Parameters ¶
func (e *CLIPEncoder[T]) Parameters() []*graph.Parameter[T]
Parameters returns all trainable parameters from the CLIP encoder.
type CLIPEncoderConfig ¶
type CLIPEncoderConfig struct {
ImageSize int // Input image size (square, e.g. 224).
PatchSize int // Patch size for patch embedding (e.g. 14).
HiddenDim int // Hidden dimension throughout the encoder.
NumHeads int // Number of attention heads per transformer block.
NumLayers int // Number of transformer encoder blocks.
NumChannels int // Number of input channels (default 3 for RGB).
}
CLIPEncoderConfig holds configuration for a CLIP vision encoder.
func (CLIPEncoderConfig) NumPatches ¶
func (c CLIPEncoderConfig) NumPatches() int
NumPatches returns the number of patches (excluding class token).