multimodal

package

v1.31.0 Latest Latest Go to latest Published: Mar 28, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package multimodal provides audio preprocessing for audio-language model inference.

Package multimodal provides vision, audio, and multi-modal inference support.

Stability: alpha

Package multimodal provides GGUF metadata loading for vision and multimodal models.

Package multimodal provides image preprocessing and embedding merge for vision-language model inference.

Package multimodal provides image preprocessing for vision-language model inference.

Index ¶

func HannWindow(size int) []float32
func MelFilterbank(numMels, fftSize, sampleRate int, fMin, fMax float32) [][]float32
func NormalizeAudio(samples []float32) []float32
func NumAudioTokens(tokenIDs []int, audioTokenID int) int
func NumImageTokens(tokenIDs []int, imageTokenID int) int
func NumPatches(cfg PatchConfig) int
func PatchDim(cfg PatchConfig) int
func PreprocessImage(data []byte, format ImageFormat, cfg PatchConfig) ([]float32, error)
type AudioConfig
- func DefaultAudioConfig() AudioConfig
type AudioSessionConfig
type AudioTextInput
type AudioTextOutput
type AudioTextSession
- func NewAudioTextSession[T tensor.Numeric](cfg AudioSessionConfig, engine compute.Engine[T], ops numeric.Arithmetic[T], ...) (*AudioTextSession[T], error)
- func (s *AudioTextSession[T]) Run(ctx context.Context, input AudioTextInput) (*AudioTextOutput, error)
type ConnectorConfig
type EncoderConfig
type ImageFormat
type MelSpectrogram
- func ExtractMelSpectrogram(samples []float32, cfg AudioConfig) (*MelSpectrogram, error)
type MergeConfig
type MergeResult
- func MergeEmbeddings(textEmbeds []float32, visionEmbeds []float32, tokenIDs []int, cfg MergeConfig) (MergeResult, error)
type MultiModalConfig
- func LoadMultiModalConfig(r io.ReadSeeker) (*MultiModalConfig, error)
- func LoadMultiModalConfigFromFile(path string) (*MultiModalConfig, error)
- func MultiModalConfigFromMetadata(metadata map[string]any) (*MultiModalConfig, error)
type PatchConfig
type ProjectionConnector
- func NewProjectionConnector[T tensor.Numeric](cfg ConnectorConfig, e compute.Engine[T]) *ProjectionConnector[T]
- func (p *ProjectionConnector[T]) LoadWeights(weights []float32) error
- func (p *ProjectionConnector[T]) Project(visionEmbeds []T, numTokens int) ([]T, error)
- func (p *ProjectionConnector[T]) TextDim() int
- func (p *ProjectionConnector[T]) VisionDim() int
type SigLIPEncoder
- func NewSigLIPEncoder[T tensor.Numeric](cfg EncoderConfig, e compute.Engine[T]) *SigLIPEncoder[T]
- func (s *SigLIPEncoder[T]) Encode(patches []float32, cfg PatchConfig) ([]T, error)
- func (s *SigLIPEncoder[T]) HiddenSize() int
- func (s *SigLIPEncoder[T]) NumLayers() int
type VisionEncoder

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func HannWindow ¶

func HannWindow(size int) []float32

HannWindow returns Hann window coefficients of the given size. w[n] = 0.5 * (1 - cos(2*pi*n / (size-1)))

func MelFilterbank ¶

func MelFilterbank(numMels, fftSize, sampleRate int, fMin, fMax float32) [][]float32

MelFilterbank computes a bank of triangular mel-scale filters. Returns a matrix of shape [numMels, fftSize/2+1] where each row is a triangular filter spanning from its lower to upper mel-band edge.

func NormalizeAudio ¶

func NormalizeAudio(samples []float32) []float32

NormalizeAudio normalizes audio samples to the range [-1, 1] by dividing by the maximum absolute value. Returns a new slice without modifying the input.

func NumAudioTokens ¶

func NumAudioTokens(tokenIDs []int, audioTokenID int) int

NumAudioTokens counts how many entries in tokenIDs equal audioTokenID.

func NumImageTokens ¶

func NumImageTokens(tokenIDs []int, imageTokenID int) int

NumImageTokens counts how many entries in tokenIDs equal imageTokenID.

func NumPatches ¶

func NumPatches(cfg PatchConfig) int

NumPatches returns the number of patches the image is divided into.

func PatchDim ¶

func PatchDim(cfg PatchConfig) int

PatchDim returns the dimensionality of each patch embedding (PatchSize*PatchSize*3).

func PreprocessImage ¶

func PreprocessImage(data []byte, format ImageFormat, cfg PatchConfig) ([]float32, error)

PreprocessImage decodes an image from raw bytes, resizes it to cfg.ImageSize x cfg.ImageSize using bilinear interpolation, normalizes pixel values per channel, and returns the result as flattened patch embeddings of shape [num_patches, patch_dim].

Types ¶

type AudioConfig ¶

type AudioConfig struct {
	SampleRate int     // Audio sample rate in Hz (default 16000).
	NumMels    int     // Number of mel filterbank channels (default 80).
	FFTSize    int     // FFT window size in samples (default 400).
	HopLength  int     // Hop length between frames in samples (default 160).
	FMin       float32 // Minimum frequency for mel filterbank (default 0).
	FMax       float32 // Maximum frequency for mel filterbank (default 8000).
}

AudioConfig specifies parameters for mel-spectrogram extraction.

func DefaultAudioConfig ¶

func DefaultAudioConfig() AudioConfig

DefaultAudioConfig returns an AudioConfig with standard Whisper-style defaults.

type AudioSessionConfig ¶

type AudioSessionConfig struct {
	// AudioCfg controls mel-spectrogram extraction parameters.
	AudioCfg AudioConfig

	// AudioTokenID is the token ID used as a placeholder for audio frames
	// in the text token sequence.
	AudioTokenID int

	// MaxAudioTokens is the maximum number of audio tokens allowed in a
	// single sequence. Zero means no limit.
	MaxAudioTokens int

	// EmbedDim is the embedding dimension of the language model.
	EmbedDim int
}

AudioSessionConfig holds configuration for an audio+text inference session.

type AudioTextInput ¶

type AudioTextInput struct {
	// PCMAudio is raw PCM audio samples (mono, float32, at AudioCfg.SampleRate).
	PCMAudio []float32

	// TokenIDs is the tokenized text prompt, with AudioTokenID placeholders
	// where audio embeddings should be inserted.
	TokenIDs []int

	// TextEmbeddings is the text token embedding matrix of shape
	// [len(TokenIDs), EmbedDim]. Positions with AudioTokenID will be
	// replaced by projected audio embeddings.
	TextEmbeddings []float32
}

AudioTextInput holds the inputs for a single audio+text inference call.

type AudioTextOutput ¶

type AudioTextOutput struct {
	// MergedEmbeddings is the merged [SeqLen, EmbedDim] embedding sequence
	// with audio embeddings replacing AudioTokenID positions.
	MergedEmbeddings []float32

	// SeqLen is the sequence length of the merged output.
	SeqLen int

	// EmbedDim is the embedding dimension of the merged output.
	EmbedDim int

	// AudioFrames is the number of audio frames (downsampled) produced by
	// the encoder.
	AudioFrames int
}

AudioTextOutput holds the result of an audio+text inference session.

type AudioTextSession ¶

type AudioTextSession[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

AudioTextSession orchestrates audio+text inference: it accepts raw PCM audio and a text prompt, runs mel-spectrogram extraction, encodes audio through a Whisper-style encoder, projects audio embeddings into text space, merges them with text embeddings, and produces a unified embedding sequence ready for language model decoding.

func NewAudioTextSession ¶

func NewAudioTextSession[T tensor.Numeric](
	cfg AudioSessionConfig,
	engine compute.Engine[T],
	ops numeric.Arithmetic[T],
	encoder *audio.WhisperEncoder[T],
	connector *ProjectionConnector[T],
) (*AudioTextSession[T], error)

NewAudioTextSession creates a new audio+text inference session. The encoder produces embeddings of dimension audioDim, which the connector projects into the language model's embedding space (cfg.EmbedDim).

func (*AudioTextSession[T]) Run ¶

func (s *AudioTextSession[T]) Run(ctx context.Context, input AudioTextInput) (*AudioTextOutput, error)

Run executes the full audio+text inference pipeline:

Normalize raw PCM audio
Extract mel-spectrogram
Run Whisper encoder to produce audio embeddings
Project audio embeddings into text embedding space
Merge audio embeddings with text embeddings at AudioTokenID positions

type ConnectorConfig ¶

type ConnectorConfig struct {
	VisionDim int
	TextDim   int
	WeightKey string // GGUF key for projection matrix; default "mm.projector.weight"
}

ConnectorConfig holds parameters for the vision-to-text projection.

type EncoderConfig ¶

type EncoderConfig struct {
	HiddenDim int
	NumHeads  int
	NumLayers int
	PatchCfg  PatchConfig
}

EncoderConfig holds the hyperparameters for a vision encoder.

type ImageFormat ¶

type ImageFormat int

ImageFormat identifies the encoding format of an input image.

const (
	JPEG ImageFormat = 1
	PNG  ImageFormat = 2
)

type MelSpectrogram ¶

type MelSpectrogram struct {
	Data      []float32
	NumFrames int
	NumMels   int
}

MelSpectrogram holds a log-mel spectrogram with shape [NumFrames, NumMels]. Data is stored in row-major order: Data[frame*NumMels + mel].

func ExtractMelSpectrogram ¶

func ExtractMelSpectrogram(samples []float32, cfg AudioConfig) (*MelSpectrogram, error)

ExtractMelSpectrogram computes a log-mel spectrogram from raw audio samples. It applies a Hann window to each frame, computes the magnitude spectrum via DFT, applies a mel filterbank, and returns log-scaled mel energies. Output shape: [NumFrames, NumMels].

type MergeConfig ¶

type MergeConfig struct {
	// ImageTokenID is the token ID used as a placeholder for image patches
	// in the text token sequence.
	ImageTokenID int
	// MaxImageTokens is the maximum number of image tokens allowed in a
	// single sequence. Zero means no limit.
	MaxImageTokens int
	// EmbedDim is the embedding dimension shared by text and vision embeddings.
	EmbedDim int
}

MergeConfig controls how text and vision embeddings are merged.

type MergeResult ¶

type MergeResult struct {
	// Embeddings is a flat [SeqLen, EmbedDim] float32 slice containing the
	// merged text and vision embeddings.
	Embeddings []float32
	// SeqLen is the sequence length of the merged output.
	SeqLen int
	// EmbedDim is the embedding dimension of the merged output.
	EmbedDim int
}

MergeResult holds the merged embedding sequence.

func MergeEmbeddings ¶

func MergeEmbeddings(textEmbeds []float32, visionEmbeds []float32, tokenIDs []int, cfg MergeConfig) (MergeResult, error)

MergeEmbeddings replaces image-token positions in the text embedding sequence with consecutive vision embeddings. textEmbeds has shape [seqLen, EmbedDim], visionEmbeds has shape [numVisionTokens, EmbedDim] (already projected to text dimension space), and tokenIDs has length seqLen. Each position where tokenIDs[i] == cfg.ImageTokenID is replaced by the next vision embedding vector.

type MultiModalConfig ¶

type MultiModalConfig struct {
	EncoderType      string    // vision.encoder.type ("siglip", "clip")
	HiddenSize       int       // vision.hidden_size
	PatchSize        int       // vision.patch_size
	ImageSize        int       // vision.image_size
	NumHeads         int       // vision.attention.head_count
	NumLayers        int       // vision.block_count
	ProjectorWeights []float32 // mm.projector.weight (flattened tensor)
}

MultiModalConfig holds vision encoder and projector parameters loaded from GGUF metadata.

func LoadMultiModalConfig ¶

func LoadMultiModalConfig(r io.ReadSeeker) (*MultiModalConfig, error)

LoadMultiModalConfig reads GGUF from r and extracts multimodal config.

func LoadMultiModalConfigFromFile ¶

func LoadMultiModalConfigFromFile(path string) (*MultiModalConfig, error)

LoadMultiModalConfigFromFile opens a GGUF file at path and loads multimodal config from it.

func MultiModalConfigFromMetadata ¶

func MultiModalConfigFromMetadata(metadata map[string]any) (*MultiModalConfig, error)

MultiModalConfigFromMetadata extracts a MultiModalConfig from a pre-parsed GGUF metadata map.

type PatchConfig ¶

type PatchConfig struct {
	PatchSize int
	ImageSize int
	NormMean  [3]float32
	NormStd   [3]float32
}

PatchConfig specifies how an image should be resized, normalized, and divided into patches for a vision encoder.

type ProjectionConnector ¶

type ProjectionConnector[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

ProjectionConnector projects vision encoder output into the text model's embedding space via a learned linear projection.

func NewProjectionConnector ¶

func NewProjectionConnector[T tensor.Numeric](cfg ConnectorConfig, e compute.Engine[T]) *ProjectionConnector[T]

NewProjectionConnector creates a ProjectionConnector. The projection matrix is zero-initialized; call LoadWeights to populate it from model weights.

func (*ProjectionConnector[T]) LoadWeights ¶

func (p *ProjectionConnector[T]) LoadWeights(weights []float32) error

LoadWeights sets the projection matrix from a flat []float32 of shape [VisionDim, TextDim].

func (*ProjectionConnector[T]) Project ¶

func (p *ProjectionConnector[T]) Project(visionEmbeds []T, numTokens int) ([]T, error)

Project applies linear projection: [numTokens, VisionDim] x [VisionDim, TextDim] -> [numTokens, TextDim].

func (*ProjectionConnector[T]) TextDim ¶

func (p *ProjectionConnector[T]) TextDim() int

TextDim returns the output dimension (text model embedding size).

func (*ProjectionConnector[T]) VisionDim ¶

func (p *ProjectionConnector[T]) VisionDim() int

VisionDim returns the input dimension (vision encoder hidden size).

type SigLIPEncoder ¶

type SigLIPEncoder[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

SigLIPEncoder implements VisionEncoder using a SigLIP-style linear projection from patch embeddings into the hidden dimension.

func NewSigLIPEncoder ¶

func NewSigLIPEncoder[T tensor.Numeric](cfg EncoderConfig, e compute.Engine[T]) *SigLIPEncoder[T]

NewSigLIPEncoder creates a SigLIPEncoder with randomly initialized weights.

func (*SigLIPEncoder[T]) Encode ¶

func (s *SigLIPEncoder[T]) Encode(patches []float32, cfg PatchConfig) ([]T, error)

Encode projects patch embeddings through a linear layer. patches is a flat []float32 of shape [num_patches, patch_dim]. Returns []T of length num_patches * HiddenDim.

func (*SigLIPEncoder[T]) HiddenSize ¶

func (s *SigLIPEncoder[T]) HiddenSize() int

HiddenSize returns the hidden dimension of the encoder output.

func (*SigLIPEncoder[T]) NumLayers ¶

func (s *SigLIPEncoder[T]) NumLayers() int

NumLayers returns the number of encoder layers.

type VisionEncoder ¶

type VisionEncoder[T tensor.Numeric] interface {
	Encode(patches []float32, cfg PatchConfig) ([]T, error)
	HiddenSize() int
	NumLayers() int
}

VisionEncoder encodes image patches into hidden representations for vision-language model inference.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL