audio

package
v1.38.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 31, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Overview

Package audio provides audio-related neural network layers.

Stability: alpha

Package audio provides audio-related neural network layers.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ChunkAudio added in v1.36.0

func ChunkAudio(samples []float32, sampleRate int, maxSeconds float64) [][]float32

ChunkAudio splits audio samples into fixed-length chunks. The last chunk is padded with silence (zeros) to maxSeconds.

func ParseWAV added in v1.36.0

func ParseWAV(data []byte) (samples []float32, sampleRate int, err error)

ParseWAV reads a 16-bit PCM WAV file and returns float32 samples in [-1, 1]. Only supports mono 16-bit PCM. Returns samples and sample rate.

Types

type MelConfig added in v1.36.0

type MelConfig struct {
	SampleRate int // audio sample rate in Hz (default 16000)
	FFTSize    int // FFT window size (default 400)
	HopLength  int // hop between windows (default 160)
	NumMels    int // number of mel filter banks (default 128)
}

MelConfig configures mel spectrogram extraction.

func DefaultMelConfig added in v1.36.0

func DefaultMelConfig() MelConfig

DefaultMelConfig returns defaults matching Whisper/Voxtral.

type MelExtractor added in v1.36.0

type MelExtractor struct {
	// contains filtered or unexported fields
}

MelExtractor extracts log mel spectrograms from raw audio samples.

func NewMelExtractor added in v1.36.0

func NewMelExtractor(cfg MelConfig) *MelExtractor

NewMelExtractor creates a mel spectrogram extractor.

func (*MelExtractor) Config added in v1.36.0

func (m *MelExtractor) Config() MelConfig

Config returns the extractor's configuration.

func (*MelExtractor) Extract added in v1.36.0

func (m *MelExtractor) Extract(samples []float32) (*tensor.TensorNumeric[float32], error)

Extract computes a log mel spectrogram from raw float32 audio samples. samples should be in [-1, 1] range at the configured sample rate. Returns a tensor of shape [numMels, numFrames].

type WhisperEncoder

type WhisperEncoder[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

WhisperEncoder implements a Whisper-style audio encoder with a 2-layer Conv1D frontend (stride 2 for temporal downsampling) followed by N transformer encoder blocks (self-attention + FFN + layer norm).

Input shape: [batch, num_mels, T_frames] Output shape: [T_downsampled, hidden_dim]

func NewWhisperEncoder

func NewWhisperEncoder[T tensor.Numeric](
	name string,
	engine compute.Engine[T],
	ops numeric.Arithmetic[T],
	cfg WhisperEncoderConfig,
) (*WhisperEncoder[T], error)

NewWhisperEncoder creates a new WhisperEncoder.

func (*WhisperEncoder[T]) Attributes

func (e *WhisperEncoder[T]) Attributes() map[string]interface{}

func (*WhisperEncoder[T]) Backward

func (*WhisperEncoder[T]) Forward

func (e *WhisperEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

Forward runs the Whisper encoder. Input: [batch, num_mels, T_frames] Output: [T_downsampled, hidden_dim]

func (*WhisperEncoder[T]) HasAttentionBias added in v1.35.0

func (e *WhisperEncoder[T]) HasAttentionBias() bool

HasAttentionBias returns true if the encoder was configured with attention biases.

func (*WhisperEncoder[T]) OpType

func (e *WhisperEncoder[T]) OpType() string

func (*WhisperEncoder[T]) OutputShape

func (e *WhisperEncoder[T]) OutputShape() []int

func (*WhisperEncoder[T]) Parameters

func (e *WhisperEncoder[T]) Parameters() []*graph.Parameter[T]

Parameters returns all trainable parameters from the encoder. The order is: conv1 params, conv2 params, [posEnc], then per block: ln1, qProj, kProj, vProj, oProj, [qBias, kBias, vBias], ln2, ffn1, ffn2, then lnPost params.

type WhisperEncoderConfig

type WhisperEncoderConfig struct {
	NumMels          int  // Number of mel channels (input channels for conv frontend).
	HiddenDim        int  // Hidden dimension throughout the encoder.
	NumHeads         int  // Number of attention heads per transformer block.
	NumLayers        int  // Number of transformer encoder blocks.
	KernelSize       int  // Kernel size for the conv1d frontend layers.
	IntermediateSize int  // FFN intermediate size (0 = 4*HiddenDim for backward compatibility).
	AttentionBias    bool // If true, Q/K/V projections include bias terms.
}

WhisperEncoderConfig holds configuration for a WhisperEncoder.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL