Documentation
¶
Overview ¶
Package audio provides audio-related neural network layers.
Stability: alpha
Package audio provides audio-related neural network layers.
Index ¶
- func ChunkAudio(samples []float32, sampleRate int, maxSeconds float64) [][]float32
- func ParseWAV(data []byte) (samples []float32, sampleRate int, err error)
- type MelConfig
- type MelExtractor
- type WhisperEncoder
- func (e *WhisperEncoder[T]) Attributes() map[string]interface{}
- func (e *WhisperEncoder[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], ...) ([]*tensor.TensorNumeric[T], error)
- func (e *WhisperEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- func (e *WhisperEncoder[T]) HasAttentionBias() bool
- func (e *WhisperEncoder[T]) OpType() string
- func (e *WhisperEncoder[T]) OutputShape() []int
- func (e *WhisperEncoder[T]) Parameters() []*graph.Parameter[T]
- type WhisperEncoderConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ChunkAudio ¶ added in v1.36.0
ChunkAudio splits audio samples into fixed-length chunks. The last chunk is padded with silence (zeros) to maxSeconds.
Types ¶
type MelConfig ¶ added in v1.36.0
type MelConfig struct {
SampleRate int // audio sample rate in Hz (default 16000)
FFTSize int // FFT window size (default 400)
HopLength int // hop between windows (default 160)
NumMels int // number of mel filter banks (default 128)
}
MelConfig configures mel spectrogram extraction.
func DefaultMelConfig ¶ added in v1.36.0
func DefaultMelConfig() MelConfig
DefaultMelConfig returns defaults matching Whisper/Voxtral.
type MelExtractor ¶ added in v1.36.0
type MelExtractor struct {
// contains filtered or unexported fields
}
MelExtractor extracts log mel spectrograms from raw audio samples.
func NewMelExtractor ¶ added in v1.36.0
func NewMelExtractor(cfg MelConfig) *MelExtractor
NewMelExtractor creates a mel spectrogram extractor.
func (*MelExtractor) Config ¶ added in v1.36.0
func (m *MelExtractor) Config() MelConfig
Config returns the extractor's configuration.
func (*MelExtractor) Extract ¶ added in v1.36.0
func (m *MelExtractor) Extract(samples []float32) (*tensor.TensorNumeric[float32], error)
Extract computes a log mel spectrogram from raw float32 audio samples. samples should be in [-1, 1] range at the configured sample rate. Returns a tensor of shape [numMels, numFrames].
type WhisperEncoder ¶
WhisperEncoder implements a Whisper-style audio encoder with a 2-layer Conv1D frontend (stride 2 for temporal downsampling) followed by N transformer encoder blocks (self-attention + FFN + layer norm).
Input shape: [batch, num_mels, T_frames] Output shape: [T_downsampled, hidden_dim]
func NewWhisperEncoder ¶
func NewWhisperEncoder[T tensor.Numeric]( name string, engine compute.Engine[T], ops numeric.Arithmetic[T], cfg WhisperEncoderConfig, ) (*WhisperEncoder[T], error)
NewWhisperEncoder creates a new WhisperEncoder.
func (*WhisperEncoder[T]) Attributes ¶
func (e *WhisperEncoder[T]) Attributes() map[string]interface{}
func (*WhisperEncoder[T]) Backward ¶
func (e *WhisperEncoder[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)
func (*WhisperEncoder[T]) Forward ¶
func (e *WhisperEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
Forward runs the Whisper encoder. Input: [batch, num_mels, T_frames] Output: [T_downsampled, hidden_dim]
func (*WhisperEncoder[T]) HasAttentionBias ¶ added in v1.35.0
func (e *WhisperEncoder[T]) HasAttentionBias() bool
HasAttentionBias returns true if the encoder was configured with attention biases.
func (*WhisperEncoder[T]) OpType ¶
func (e *WhisperEncoder[T]) OpType() string
func (*WhisperEncoder[T]) OutputShape ¶
func (e *WhisperEncoder[T]) OutputShape() []int
func (*WhisperEncoder[T]) Parameters ¶
func (e *WhisperEncoder[T]) Parameters() []*graph.Parameter[T]
Parameters returns all trainable parameters from the encoder. The order is: conv1 params, conv2 params, [posEnc], then per block: ln1, qProj, kProj, vProj, oProj, [qBias, kBias, vBias], ln2, ffn1, ffn2, then lnPost params.
type WhisperEncoderConfig ¶
type WhisperEncoderConfig struct {
NumMels int // Number of mel channels (input channels for conv frontend).
HiddenDim int // Hidden dimension throughout the encoder.
NumHeads int // Number of attention heads per transformer block.
NumLayers int // Number of transformer encoder blocks.
KernelSize int // Kernel size for the conv1d frontend layers.
IntermediateSize int // FFN intermediate size (0 = 4*HiddenDim for backward compatibility).
AttentionBias bool // If true, Q/K/V projections include bias terms.
}
WhisperEncoderConfig holds configuration for a WhisperEncoder.