Documentation
¶
Overview ¶
Package multimodal provides audio preprocessing for audio-language model inference.
Package multimodal provides vision, audio, and multi-modal inference support.
Stability: alpha
Package multimodal provides GGUF metadata loading for vision and multimodal models.
Package multimodal provides image preprocessing and embedding merge for vision-language model inference.
Package multimodal provides image preprocessing for vision-language model inference.
Index ¶
- func HannWindow(size int) []float32
- func MelFilterbank(numMels, fftSize, sampleRate int, fMin, fMax float32) [][]float32
- func NormalizeAudio(samples []float32) []float32
- func NumAudioTokens(tokenIDs []int, audioTokenID int) int
- func NumImageTokens(tokenIDs []int, imageTokenID int) int
- func NumPatches(cfg PatchConfig) int
- func PatchDim(cfg PatchConfig) int
- func PreprocessImage(data []byte, format ImageFormat, cfg PatchConfig) ([]float32, error)
- type AudioConfig
- type AudioSessionConfig
- type AudioTextInput
- type AudioTextOutput
- type AudioTextSession
- type ConnectorConfig
- type EncoderConfig
- type ImageFormat
- type MelSpectrogram
- type MergeConfig
- type MergeResult
- type MultiModalConfig
- type PatchConfig
- type ProjectionConnector
- type SigLIPEncoder
- type VisionEncoder
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func HannWindow ¶
HannWindow returns Hann window coefficients of the given size. w[n] = 0.5 * (1 - cos(2*pi*n / (size-1)))
func MelFilterbank ¶
MelFilterbank computes a bank of triangular mel-scale filters. Returns a matrix of shape [numMels, fftSize/2+1] where each row is a triangular filter spanning from its lower to upper mel-band edge.
func NormalizeAudio ¶
NormalizeAudio normalizes audio samples to the range [-1, 1] by dividing by the maximum absolute value. Returns a new slice without modifying the input.
func NumAudioTokens ¶
NumAudioTokens counts how many entries in tokenIDs equal audioTokenID.
func NumImageTokens ¶
NumImageTokens counts how many entries in tokenIDs equal imageTokenID.
func NumPatches ¶
func NumPatches(cfg PatchConfig) int
NumPatches returns the number of patches the image is divided into.
func PatchDim ¶
func PatchDim(cfg PatchConfig) int
PatchDim returns the dimensionality of each patch embedding (PatchSize*PatchSize*3).
func PreprocessImage ¶
func PreprocessImage(data []byte, format ImageFormat, cfg PatchConfig) ([]float32, error)
PreprocessImage decodes an image from raw bytes, resizes it to cfg.ImageSize x cfg.ImageSize using bilinear interpolation, normalizes pixel values per channel, and returns the result as flattened patch embeddings of shape [num_patches, patch_dim].
Types ¶
type AudioConfig ¶
type AudioConfig struct {
SampleRate int // Audio sample rate in Hz (default 16000).
NumMels int // Number of mel filterbank channels (default 80).
FFTSize int // FFT window size in samples (default 400).
HopLength int // Hop length between frames in samples (default 160).
FMin float32 // Minimum frequency for mel filterbank (default 0).
FMax float32 // Maximum frequency for mel filterbank (default 8000).
}
AudioConfig specifies parameters for mel-spectrogram extraction.
func DefaultAudioConfig ¶
func DefaultAudioConfig() AudioConfig
DefaultAudioConfig returns an AudioConfig with standard Whisper-style defaults.
type AudioSessionConfig ¶
type AudioSessionConfig struct {
// AudioCfg controls mel-spectrogram extraction parameters.
AudioCfg AudioConfig
// AudioTokenID is the token ID used as a placeholder for audio frames
// in the text token sequence.
AudioTokenID int
// MaxAudioTokens is the maximum number of audio tokens allowed in a
// single sequence. Zero means no limit.
MaxAudioTokens int
// EmbedDim is the embedding dimension of the language model.
EmbedDim int
}
AudioSessionConfig holds configuration for an audio+text inference session.
type AudioTextInput ¶
type AudioTextInput struct {
// PCMAudio is raw PCM audio samples (mono, float32, at AudioCfg.SampleRate).
PCMAudio []float32
// TokenIDs is the tokenized text prompt, with AudioTokenID placeholders
// where audio embeddings should be inserted.
TokenIDs []int
// TextEmbeddings is the text token embedding matrix of shape
// [len(TokenIDs), EmbedDim]. Positions with AudioTokenID will be
// replaced by projected audio embeddings.
TextEmbeddings []float32
}
AudioTextInput holds the inputs for a single audio+text inference call.
type AudioTextOutput ¶
type AudioTextOutput struct {
// MergedEmbeddings is the merged [SeqLen, EmbedDim] embedding sequence
// with audio embeddings replacing AudioTokenID positions.
MergedEmbeddings []float32
// SeqLen is the sequence length of the merged output.
SeqLen int
// EmbedDim is the embedding dimension of the merged output.
EmbedDim int
// AudioFrames is the number of audio frames (downsampled) produced by
// the encoder.
AudioFrames int
}
AudioTextOutput holds the result of an audio+text inference session.
type AudioTextSession ¶
AudioTextSession orchestrates audio+text inference: it accepts raw PCM audio and a text prompt, runs mel-spectrogram extraction, encodes audio through a Whisper-style encoder, projects audio embeddings into text space, merges them with text embeddings, and produces a unified embedding sequence ready for language model decoding.
func NewAudioTextSession ¶
func NewAudioTextSession[T tensor.Numeric]( cfg AudioSessionConfig, engine compute.Engine[T], ops numeric.Arithmetic[T], encoder *audio.WhisperEncoder[T], connector *ProjectionConnector[T], ) (*AudioTextSession[T], error)
NewAudioTextSession creates a new audio+text inference session. The encoder produces embeddings of dimension audioDim, which the connector projects into the language model's embedding space (cfg.EmbedDim).
func (*AudioTextSession[T]) Run ¶
func (s *AudioTextSession[T]) Run(ctx context.Context, input AudioTextInput) (*AudioTextOutput, error)
Run executes the full audio+text inference pipeline:
- Normalize raw PCM audio
- Extract mel-spectrogram
- Run Whisper encoder to produce audio embeddings
- Project audio embeddings into text embedding space
- Merge audio embeddings with text embeddings at AudioTokenID positions
type ConnectorConfig ¶
type ConnectorConfig struct {
VisionDim int
TextDim int
WeightKey string // GGUF key for projection matrix; default "mm.projector.weight"
}
ConnectorConfig holds parameters for the vision-to-text projection.
type EncoderConfig ¶
type EncoderConfig struct {
HiddenDim int
NumHeads int
NumLayers int
PatchCfg PatchConfig
}
EncoderConfig holds the hyperparameters for a vision encoder.
type ImageFormat ¶
type ImageFormat int
ImageFormat identifies the encoding format of an input image.
const ( JPEG ImageFormat = 1 PNG ImageFormat = 2 )
type MelSpectrogram ¶
MelSpectrogram holds a log-mel spectrogram with shape [NumFrames, NumMels]. Data is stored in row-major order: Data[frame*NumMels + mel].
func ExtractMelSpectrogram ¶
func ExtractMelSpectrogram(samples []float32, cfg AudioConfig) (*MelSpectrogram, error)
ExtractMelSpectrogram computes a log-mel spectrogram from raw audio samples. It applies a Hann window to each frame, computes the magnitude spectrum via DFT, applies a mel filterbank, and returns log-scaled mel energies. Output shape: [NumFrames, NumMels].
type MergeConfig ¶
type MergeConfig struct {
// ImageTokenID is the token ID used as a placeholder for image patches
// in the text token sequence.
ImageTokenID int
// MaxImageTokens is the maximum number of image tokens allowed in a
// single sequence. Zero means no limit.
MaxImageTokens int
// EmbedDim is the embedding dimension shared by text and vision embeddings.
EmbedDim int
}
MergeConfig controls how text and vision embeddings are merged.
type MergeResult ¶
type MergeResult struct {
// Embeddings is a flat [SeqLen, EmbedDim] float32 slice containing the
// merged text and vision embeddings.
Embeddings []float32
// SeqLen is the sequence length of the merged output.
SeqLen int
// EmbedDim is the embedding dimension of the merged output.
EmbedDim int
}
MergeResult holds the merged embedding sequence.
func MergeEmbeddings ¶
func MergeEmbeddings(textEmbeds []float32, visionEmbeds []float32, tokenIDs []int, cfg MergeConfig) (MergeResult, error)
MergeEmbeddings replaces image-token positions in the text embedding sequence with consecutive vision embeddings. textEmbeds has shape [seqLen, EmbedDim], visionEmbeds has shape [numVisionTokens, EmbedDim] (already projected to text dimension space), and tokenIDs has length seqLen. Each position where tokenIDs[i] == cfg.ImageTokenID is replaced by the next vision embedding vector.
type MultiModalConfig ¶
type MultiModalConfig struct {
EncoderType string // vision.encoder.type ("siglip", "clip")
HiddenSize int // vision.hidden_size
PatchSize int // vision.patch_size
ImageSize int // vision.image_size
NumHeads int // vision.attention.head_count
NumLayers int // vision.block_count
ProjectorWeights []float32 // mm.projector.weight (flattened tensor)
}
MultiModalConfig holds vision encoder and projector parameters loaded from GGUF metadata.
func LoadMultiModalConfig ¶
func LoadMultiModalConfig(r io.ReadSeeker) (*MultiModalConfig, error)
LoadMultiModalConfig reads GGUF from r and extracts multimodal config.
func LoadMultiModalConfigFromFile ¶
func LoadMultiModalConfigFromFile(path string) (*MultiModalConfig, error)
LoadMultiModalConfigFromFile opens a GGUF file at path and loads multimodal config from it.
func MultiModalConfigFromMetadata ¶
func MultiModalConfigFromMetadata(metadata map[string]any) (*MultiModalConfig, error)
MultiModalConfigFromMetadata extracts a MultiModalConfig from a pre-parsed GGUF metadata map.
type PatchConfig ¶
PatchConfig specifies how an image should be resized, normalized, and divided into patches for a vision encoder.
type ProjectionConnector ¶
ProjectionConnector projects vision encoder output into the text model's embedding space via a learned linear projection.
func NewProjectionConnector ¶
func NewProjectionConnector[T tensor.Numeric](cfg ConnectorConfig, e compute.Engine[T]) *ProjectionConnector[T]
NewProjectionConnector creates a ProjectionConnector. The projection matrix is zero-initialized; call LoadWeights to populate it from model weights.
func (*ProjectionConnector[T]) LoadWeights ¶
func (p *ProjectionConnector[T]) LoadWeights(weights []float32) error
LoadWeights sets the projection matrix from a flat []float32 of shape [VisionDim, TextDim].
func (*ProjectionConnector[T]) Project ¶
func (p *ProjectionConnector[T]) Project(visionEmbeds []T, numTokens int) ([]T, error)
Project applies linear projection: [numTokens, VisionDim] x [VisionDim, TextDim] -> [numTokens, TextDim].
func (*ProjectionConnector[T]) TextDim ¶
func (p *ProjectionConnector[T]) TextDim() int
TextDim returns the output dimension (text model embedding size).
func (*ProjectionConnector[T]) VisionDim ¶
func (p *ProjectionConnector[T]) VisionDim() int
VisionDim returns the input dimension (vision encoder hidden size).
type SigLIPEncoder ¶
SigLIPEncoder implements VisionEncoder using a SigLIP-style linear projection from patch embeddings into the hidden dimension.
func NewSigLIPEncoder ¶
func NewSigLIPEncoder[T tensor.Numeric](cfg EncoderConfig, e compute.Engine[T]) *SigLIPEncoder[T]
NewSigLIPEncoder creates a SigLIPEncoder with randomly initialized weights.
func (*SigLIPEncoder[T]) Encode ¶
func (s *SigLIPEncoder[T]) Encode(patches []float32, cfg PatchConfig) ([]T, error)
Encode projects patch embeddings through a linear layer. patches is a flat []float32 of shape [num_patches, patch_dim]. Returns []T of length num_patches * HiddenDim.
func (*SigLIPEncoder[T]) HiddenSize ¶
func (s *SigLIPEncoder[T]) HiddenSize() int
HiddenSize returns the hidden dimension of the encoder output.
func (*SigLIPEncoder[T]) NumLayers ¶
func (s *SigLIPEncoder[T]) NumLayers() int
NumLayers returns the number of encoder layers.
type VisionEncoder ¶
type VisionEncoder[T tensor.Numeric] interface { Encode(patches []float32, cfg PatchConfig) ([]T, error) HiddenSize() int NumLayers() int }
VisionEncoder encodes image patches into hidden representations for vision-language model inference.