multimodal

package
v1.31.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 28, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

Documentation

Overview

Package multimodal provides audio preprocessing for audio-language model inference.

Package multimodal provides vision, audio, and multi-modal inference support.

Stability: alpha

Package multimodal provides GGUF metadata loading for vision and multimodal models.

Package multimodal provides image preprocessing and embedding merge for vision-language model inference.

Package multimodal provides image preprocessing for vision-language model inference.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HannWindow

func HannWindow(size int) []float32

HannWindow returns Hann window coefficients of the given size. w[n] = 0.5 * (1 - cos(2*pi*n / (size-1)))

func MelFilterbank

func MelFilterbank(numMels, fftSize, sampleRate int, fMin, fMax float32) [][]float32

MelFilterbank computes a bank of triangular mel-scale filters. Returns a matrix of shape [numMels, fftSize/2+1] where each row is a triangular filter spanning from its lower to upper mel-band edge.

func NormalizeAudio

func NormalizeAudio(samples []float32) []float32

NormalizeAudio normalizes audio samples to the range [-1, 1] by dividing by the maximum absolute value. Returns a new slice without modifying the input.

func NumAudioTokens

func NumAudioTokens(tokenIDs []int, audioTokenID int) int

NumAudioTokens counts how many entries in tokenIDs equal audioTokenID.

func NumImageTokens

func NumImageTokens(tokenIDs []int, imageTokenID int) int

NumImageTokens counts how many entries in tokenIDs equal imageTokenID.

func NumPatches

func NumPatches(cfg PatchConfig) int

NumPatches returns the number of patches the image is divided into.

func PatchDim

func PatchDim(cfg PatchConfig) int

PatchDim returns the dimensionality of each patch embedding (PatchSize*PatchSize*3).

func PreprocessImage

func PreprocessImage(data []byte, format ImageFormat, cfg PatchConfig) ([]float32, error)

PreprocessImage decodes an image from raw bytes, resizes it to cfg.ImageSize x cfg.ImageSize using bilinear interpolation, normalizes pixel values per channel, and returns the result as flattened patch embeddings of shape [num_patches, patch_dim].

Types

type AudioConfig

type AudioConfig struct {
	SampleRate int     // Audio sample rate in Hz (default 16000).
	NumMels    int     // Number of mel filterbank channels (default 80).
	FFTSize    int     // FFT window size in samples (default 400).
	HopLength  int     // Hop length between frames in samples (default 160).
	FMin       float32 // Minimum frequency for mel filterbank (default 0).
	FMax       float32 // Maximum frequency for mel filterbank (default 8000).
}

AudioConfig specifies parameters for mel-spectrogram extraction.

func DefaultAudioConfig

func DefaultAudioConfig() AudioConfig

DefaultAudioConfig returns an AudioConfig with standard Whisper-style defaults.

type AudioSessionConfig

type AudioSessionConfig struct {
	// AudioCfg controls mel-spectrogram extraction parameters.
	AudioCfg AudioConfig

	// AudioTokenID is the token ID used as a placeholder for audio frames
	// in the text token sequence.
	AudioTokenID int

	// MaxAudioTokens is the maximum number of audio tokens allowed in a
	// single sequence. Zero means no limit.
	MaxAudioTokens int

	// EmbedDim is the embedding dimension of the language model.
	EmbedDim int
}

AudioSessionConfig holds configuration for an audio+text inference session.

type AudioTextInput

type AudioTextInput struct {
	// PCMAudio is raw PCM audio samples (mono, float32, at AudioCfg.SampleRate).
	PCMAudio []float32

	// TokenIDs is the tokenized text prompt, with AudioTokenID placeholders
	// where audio embeddings should be inserted.
	TokenIDs []int

	// TextEmbeddings is the text token embedding matrix of shape
	// [len(TokenIDs), EmbedDim]. Positions with AudioTokenID will be
	// replaced by projected audio embeddings.
	TextEmbeddings []float32
}

AudioTextInput holds the inputs for a single audio+text inference call.

type AudioTextOutput

type AudioTextOutput struct {
	// MergedEmbeddings is the merged [SeqLen, EmbedDim] embedding sequence
	// with audio embeddings replacing AudioTokenID positions.
	MergedEmbeddings []float32

	// SeqLen is the sequence length of the merged output.
	SeqLen int

	// EmbedDim is the embedding dimension of the merged output.
	EmbedDim int

	// AudioFrames is the number of audio frames (downsampled) produced by
	// the encoder.
	AudioFrames int
}

AudioTextOutput holds the result of an audio+text inference session.

type AudioTextSession

type AudioTextSession[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

AudioTextSession orchestrates audio+text inference: it accepts raw PCM audio and a text prompt, runs mel-spectrogram extraction, encodes audio through a Whisper-style encoder, projects audio embeddings into text space, merges them with text embeddings, and produces a unified embedding sequence ready for language model decoding.

func NewAudioTextSession

func NewAudioTextSession[T tensor.Numeric](
	cfg AudioSessionConfig,
	engine compute.Engine[T],
	ops numeric.Arithmetic[T],
	encoder *audio.WhisperEncoder[T],
	connector *ProjectionConnector[T],
) (*AudioTextSession[T], error)

NewAudioTextSession creates a new audio+text inference session. The encoder produces embeddings of dimension audioDim, which the connector projects into the language model's embedding space (cfg.EmbedDim).

func (*AudioTextSession[T]) Run

Run executes the full audio+text inference pipeline:

  1. Normalize raw PCM audio
  2. Extract mel-spectrogram
  3. Run Whisper encoder to produce audio embeddings
  4. Project audio embeddings into text embedding space
  5. Merge audio embeddings with text embeddings at AudioTokenID positions

type ConnectorConfig

type ConnectorConfig struct {
	VisionDim int
	TextDim   int
	WeightKey string // GGUF key for projection matrix; default "mm.projector.weight"
}

ConnectorConfig holds parameters for the vision-to-text projection.

type EncoderConfig

type EncoderConfig struct {
	HiddenDim int
	NumHeads  int
	NumLayers int
	PatchCfg  PatchConfig
}

EncoderConfig holds the hyperparameters for a vision encoder.

type ImageFormat

type ImageFormat int

ImageFormat identifies the encoding format of an input image.

const (
	JPEG ImageFormat = 1
	PNG  ImageFormat = 2
)

type MelSpectrogram

type MelSpectrogram struct {
	Data      []float32
	NumFrames int
	NumMels   int
}

MelSpectrogram holds a log-mel spectrogram with shape [NumFrames, NumMels]. Data is stored in row-major order: Data[frame*NumMels + mel].

func ExtractMelSpectrogram

func ExtractMelSpectrogram(samples []float32, cfg AudioConfig) (*MelSpectrogram, error)

ExtractMelSpectrogram computes a log-mel spectrogram from raw audio samples. It applies a Hann window to each frame, computes the magnitude spectrum via DFT, applies a mel filterbank, and returns log-scaled mel energies. Output shape: [NumFrames, NumMels].

type MergeConfig

type MergeConfig struct {
	// ImageTokenID is the token ID used as a placeholder for image patches
	// in the text token sequence.
	ImageTokenID int
	// MaxImageTokens is the maximum number of image tokens allowed in a
	// single sequence. Zero means no limit.
	MaxImageTokens int
	// EmbedDim is the embedding dimension shared by text and vision embeddings.
	EmbedDim int
}

MergeConfig controls how text and vision embeddings are merged.

type MergeResult

type MergeResult struct {
	// Embeddings is a flat [SeqLen, EmbedDim] float32 slice containing the
	// merged text and vision embeddings.
	Embeddings []float32
	// SeqLen is the sequence length of the merged output.
	SeqLen int
	// EmbedDim is the embedding dimension of the merged output.
	EmbedDim int
}

MergeResult holds the merged embedding sequence.

func MergeEmbeddings

func MergeEmbeddings(textEmbeds []float32, visionEmbeds []float32, tokenIDs []int, cfg MergeConfig) (MergeResult, error)

MergeEmbeddings replaces image-token positions in the text embedding sequence with consecutive vision embeddings. textEmbeds has shape [seqLen, EmbedDim], visionEmbeds has shape [numVisionTokens, EmbedDim] (already projected to text dimension space), and tokenIDs has length seqLen. Each position where tokenIDs[i] == cfg.ImageTokenID is replaced by the next vision embedding vector.

type MultiModalConfig

type MultiModalConfig struct {
	EncoderType      string    // vision.encoder.type ("siglip", "clip")
	HiddenSize       int       // vision.hidden_size
	PatchSize        int       // vision.patch_size
	ImageSize        int       // vision.image_size
	NumHeads         int       // vision.attention.head_count
	NumLayers        int       // vision.block_count
	ProjectorWeights []float32 // mm.projector.weight (flattened tensor)
}

MultiModalConfig holds vision encoder and projector parameters loaded from GGUF metadata.

func LoadMultiModalConfig

func LoadMultiModalConfig(r io.ReadSeeker) (*MultiModalConfig, error)

LoadMultiModalConfig reads GGUF from r and extracts multimodal config.

func LoadMultiModalConfigFromFile

func LoadMultiModalConfigFromFile(path string) (*MultiModalConfig, error)

LoadMultiModalConfigFromFile opens a GGUF file at path and loads multimodal config from it.

func MultiModalConfigFromMetadata

func MultiModalConfigFromMetadata(metadata map[string]any) (*MultiModalConfig, error)

MultiModalConfigFromMetadata extracts a MultiModalConfig from a pre-parsed GGUF metadata map.

type PatchConfig

type PatchConfig struct {
	PatchSize int
	ImageSize int
	NormMean  [3]float32
	NormStd   [3]float32
}

PatchConfig specifies how an image should be resized, normalized, and divided into patches for a vision encoder.

type ProjectionConnector

type ProjectionConnector[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

ProjectionConnector projects vision encoder output into the text model's embedding space via a learned linear projection.

func NewProjectionConnector

func NewProjectionConnector[T tensor.Numeric](cfg ConnectorConfig, e compute.Engine[T]) *ProjectionConnector[T]

NewProjectionConnector creates a ProjectionConnector. The projection matrix is zero-initialized; call LoadWeights to populate it from model weights.

func (*ProjectionConnector[T]) LoadWeights

func (p *ProjectionConnector[T]) LoadWeights(weights []float32) error

LoadWeights sets the projection matrix from a flat []float32 of shape [VisionDim, TextDim].

func (*ProjectionConnector[T]) Project

func (p *ProjectionConnector[T]) Project(visionEmbeds []T, numTokens int) ([]T, error)

Project applies linear projection: [numTokens, VisionDim] x [VisionDim, TextDim] -> [numTokens, TextDim].

func (*ProjectionConnector[T]) TextDim

func (p *ProjectionConnector[T]) TextDim() int

TextDim returns the output dimension (text model embedding size).

func (*ProjectionConnector[T]) VisionDim

func (p *ProjectionConnector[T]) VisionDim() int

VisionDim returns the input dimension (vision encoder hidden size).

type SigLIPEncoder

type SigLIPEncoder[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

SigLIPEncoder implements VisionEncoder using a SigLIP-style linear projection from patch embeddings into the hidden dimension.

func NewSigLIPEncoder

func NewSigLIPEncoder[T tensor.Numeric](cfg EncoderConfig, e compute.Engine[T]) *SigLIPEncoder[T]

NewSigLIPEncoder creates a SigLIPEncoder with randomly initialized weights.

func (*SigLIPEncoder[T]) Encode

func (s *SigLIPEncoder[T]) Encode(patches []float32, cfg PatchConfig) ([]T, error)

Encode projects patch embeddings through a linear layer. patches is a flat []float32 of shape [num_patches, patch_dim]. Returns []T of length num_patches * HiddenDim.

func (*SigLIPEncoder[T]) HiddenSize

func (s *SigLIPEncoder[T]) HiddenSize() int

HiddenSize returns the hidden dimension of the encoder output.

func (*SigLIPEncoder[T]) NumLayers

func (s *SigLIPEncoder[T]) NumLayers() int

NumLayers returns the number of encoder layers.

type VisionEncoder

type VisionEncoder[T tensor.Numeric] interface {
	Encode(patches []float32, cfg PatchConfig) ([]T, error)
	HiddenSize() int
	NumLayers() int
}

VisionEncoder encodes image patches into hidden representations for vision-language model inference.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL