Documentation
¶
Overview ¶
Package govad provides voice activity detection (VAD) for 16 kHz mono audio.
It is a pure Go inference implementation of the Silero VAD neural network — no C dependencies, no ONNX runtime, no CGO required.
Quick start ¶
The package embeds default model weights, so getting started is a single call:
v, _ := govad.New() prob := v.Process(samples512) // returns speech probability [0, 1]
For custom weights exported from a different Silero VAD ONNX model, use NewFromFile or NewFromReader.
Input format ¶
Each call to VAD.Process expects exactly 512 float32 samples of 16 kHz mono audio (32 ms per frame). Samples should be normalised to the range [−1, 1]. The detector maintains internal LSTM state across calls; use VAD.Reset to start a new audio stream.
Model architecture ¶
Conv-STFT (n_fft=256) → magnitude → 4× Conv1d+ReLU → LSTMCell(128) → Conv1d(1) → Sigmoid
Weights are derived from silero_vad_half.onnx (Silero VAD v5, 16 kHz, MIT-licensed). See https://github.com/snakers4/silero-vad for the original model.
Index ¶
Constants ¶
const (
// SamplesPerFrame is the number of float32 audio samples per inference frame.
SamplesPerFrame = 512
)
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type VAD ¶
type VAD struct {
// contains filtered or unexported fields
}
VAD performs voice activity detection on 16 kHz mono audio.
A VAD instance is not safe for concurrent use. Create one per goroutine, or protect calls with a mutex.
func NewFromFile ¶
NewFromFile creates a VAD detector by loading model weights from a file.
func NewFromReader ¶
NewFromReader creates a VAD detector by reading model weights from r. The binary format is a sequence of little-endian float32 values in the order produced by the export_for_go.py script.
func (*VAD) Process ¶
Process runs inference on exactly SamplesPerFrame (512) float32 samples of 16 kHz mono audio and returns the speech probability in [0.0, 1.0].
The detector maintains LSTM state across calls, so frames should be fed in chronological order. A probability above 0.5 typically indicates speech; tune the threshold to your use case.
Process panics if len(samples) != SamplesPerFrame.
func (*VAD) Reset ¶
func (v *VAD) Reset()
Reset clears the LSTM state so the next VAD.Process call starts a fresh audio stream. Call this between unrelated audio segments.
Directories
¶
| Path | Synopsis |
|---|---|
|
examples
|
|
|
live-vad
command
Command live-vad captures audio from the default microphone at 16 kHz and prints real-time voice activity detection results using the Silero VAD model.
|
Command live-vad captures audio from the default microphone at 16 kHz and prints real-time voice activity detection results using the Silero VAD model. |