audio

package

v1.38.3 Latest Latest Go to latest Published: Mar 31, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package audio provides audio-related neural network layers.

Stability: alpha

Package audio provides audio-related neural network layers.

Index ¶

func ChunkAudio(samples []float32, sampleRate int, maxSeconds float64) [][]float32
func ParseWAV(data []byte) (samples []float32, sampleRate int, err error)
type MelConfig
- func DefaultMelConfig() MelConfig
type MelExtractor
- func NewMelExtractor(cfg MelConfig) *MelExtractor
- func (m *MelExtractor) Config() MelConfig
- func (m *MelExtractor) Extract(samples []float32) (*tensor.TensorNumeric[float32], error)
type WhisperEncoder
- func NewWhisperEncoder[T tensor.Numeric](name string, engine compute.Engine[T], ops numeric.Arithmetic[T], ...) (*WhisperEncoder[T], error)
type WhisperEncoderConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ChunkAudio ¶ added in v1.36.0

func ChunkAudio(samples []float32, sampleRate int, maxSeconds float64) [][]float32

ChunkAudio splits audio samples into fixed-length chunks. The last chunk is padded with silence (zeros) to maxSeconds.

func ParseWAV ¶ added in v1.36.0

func ParseWAV(data []byte) (samples []float32, sampleRate int, err error)

ParseWAV reads a 16-bit PCM WAV file and returns float32 samples in [-1, 1]. Only supports mono 16-bit PCM. Returns samples and sample rate.

Types ¶

type MelConfig ¶ added in v1.36.0

type MelConfig struct {
	SampleRate int // audio sample rate in Hz (default 16000)
	FFTSize    int // FFT window size (default 400)
	HopLength  int // hop between windows (default 160)
	NumMels    int // number of mel filter banks (default 128)
}

MelConfig configures mel spectrogram extraction.

func DefaultMelConfig ¶ added in v1.36.0

func DefaultMelConfig() MelConfig

DefaultMelConfig returns defaults matching Whisper/Voxtral.

type MelExtractor ¶ added in v1.36.0

type MelExtractor struct {
	// contains filtered or unexported fields
}

MelExtractor extracts log mel spectrograms from raw audio samples.

func NewMelExtractor ¶ added in v1.36.0

func NewMelExtractor(cfg MelConfig) *MelExtractor

NewMelExtractor creates a mel spectrogram extractor.

func (*MelExtractor) Config ¶ added in v1.36.0

func (m *MelExtractor) Config() MelConfig

Config returns the extractor's configuration.

func (*MelExtractor) Extract ¶ added in v1.36.0

func (m *MelExtractor) Extract(samples []float32) (*tensor.TensorNumeric[float32], error)

Extract computes a log mel spectrogram from raw float32 audio samples. samples should be in [-1, 1] range at the configured sample rate. Returns a tensor of shape [numMels, numFrames].

type WhisperEncoder ¶

type WhisperEncoder[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

WhisperEncoder implements a Whisper-style audio encoder with a 2-layer Conv1D frontend (stride 2 for temporal downsampling) followed by N transformer encoder blocks (self-attention + FFN + layer norm).

Input shape: [batch, num_mels, T_frames] Output shape: [T_downsampled, hidden_dim]

func NewWhisperEncoder ¶

func NewWhisperEncoder[T tensor.Numeric](
	name string,
	engine compute.Engine[T],
	ops numeric.Arithmetic[T],
	cfg WhisperEncoderConfig,
) (*WhisperEncoder[T], error)

NewWhisperEncoder creates a new WhisperEncoder.

func (*WhisperEncoder[T]) Attributes ¶

func (e *WhisperEncoder[T]) Attributes() map[string]interface{}

func (*WhisperEncoder[T]) Backward ¶

func (e *WhisperEncoder[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)

func (*WhisperEncoder[T]) Forward ¶

func (e *WhisperEncoder[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

Forward runs the Whisper encoder. Input: [batch, num_mels, T_frames] Output: [T_downsampled, hidden_dim]

func (*WhisperEncoder[T]) HasAttentionBias ¶ added in v1.35.0

func (e *WhisperEncoder[T]) HasAttentionBias() bool

HasAttentionBias returns true if the encoder was configured with attention biases.

func (*WhisperEncoder[T]) OpType ¶

func (e *WhisperEncoder[T]) OpType() string

func (*WhisperEncoder[T]) OutputShape ¶

func (e *WhisperEncoder[T]) OutputShape() []int

func (*WhisperEncoder[T]) Parameters ¶

func (e *WhisperEncoder[T]) Parameters() []*graph.Parameter[T]

Parameters returns all trainable parameters from the encoder. The order is: conv1 params, conv2 params, [posEnc], then per block: ln1, qProj, kProj, vProj, oProj, [qBias, kBias, vBias], ln2, ffn1, ffn2, then lnPost params.

type WhisperEncoderConfig ¶

type WhisperEncoderConfig struct {
	NumMels          int  // Number of mel channels (input channels for conv frontend).
	HiddenDim        int  // Hidden dimension throughout the encoder.
	NumHeads         int  // Number of attention heads per transformer block.
	NumLayers        int  // Number of transformer encoder blocks.
	KernelSize       int  // Kernel size for the conv1d frontend layers.
	IntermediateSize int  // FFN intermediate size (0 = 4*HiddenDim for backward compatibility).
	AttentionBias    bool // If true, Q/K/V projections include bias terms.
}

WhisperEncoderConfig holds configuration for a WhisperEncoder.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL