govad

package module

v0.0.0-...-74750ea Latest Latest Go to latest Published: Mar 30, 2026 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zserge/govad

Links

Open Source Insights

README ¶

govad

Pure Go voice activity detection using the Silero VAD neural network.

No CGo. No ONNX runtime. No external dependencies.

The model weights are embedded in the binary.

Features

Pure Go inference (~300 lines), zero C dependencies
Processes 512-sample frames (32 ms at 16 kHz)
Stateful LSTM — feed frames sequentially, get speech probabilities
Embedded model weights — no extra files to ship
Validated against the ONNX reference (max diff < 0.001)

Installation

go get github.com/zserge/govad@latest

Usage

package main

import (
	"fmt"
	"github.com/zserge/govad"
)

func main() {
	// Create a VAD detector (uses embedded weights)
	v, err := govad.New()
	if err != nil {
		panic(err)
	}

	// Feed 512 float32 samples at 16 kHz per call
	samples := make([]float32, govad.SamplesPerFrame)
	// ... fill samples from your audio source ...

	prob := v.Process(samples)
	if prob > 0.5 {
		fmt.Println("Speech detected!")
	}

	// Call Reset() between unrelated audio streams
	v.Reset()
}

Live microphone example

The examples/live-vad directory contains a complete real-time VAD demo using malgo (miniaudio bindings):

cd examples/live-vad
go run . -threshold 0.5

It captures audio from your default microphone and prints speech/silence transitions in real time.

API

Function	Description
`govad.New()`	Create a detector with embedded default weights
`govad.NewFromFile(path)`	Load weights from a file
`govad.NewFromReader(r)`	Load weights from an `io.Reader`
`v.Process(samples)`	Run inference on 512 samples, returns probability `[0, 1]`
`v.Reset()`	Clear LSTM state for a new audio stream

Performance

On Apple M1:

BenchmarkProcess-8    1911    632370 ns/op    10112 B/op    7 allocs/op

~632 µs per 32 ms frame — roughly 50× faster than real time.

Model

The weights are exported from silero_vad_half.onnx (Silero VAD v5, 16 kHz only). The architecture is:

Audio (512 samples, 16 kHz)
  → Reflect pad (64 right)
  → Conv-STFT (n_fft=256, hop=128)
  → Magnitude spectrum
  → Conv1d(129→128, k=3) + ReLU
  → Conv1d(128→64,  k=3, stride=2) + ReLU
  → Conv1d(64→64,   k=3, stride=2) + ReLU
  → Conv1d(64→128,  k=3) + ReLU
  → LSTMCell(128)
  → ReLU → Linear(128→1) → Sigmoid
  → Speech probability

License

The Go code is MIT licensed. The model weights are from Silero VAD, also MIT licensed.

Documentation ¶

Overview ¶

Package govad provides voice activity detection (VAD) for 16 kHz mono audio.

It is a pure Go inference implementation of the Silero VAD neural network — no C dependencies, no ONNX runtime, no CGO required.

Quick start ¶

The package embeds default model weights, so getting started is a single call:

v, _ := govad.New()
prob := v.Process(samples512) // returns speech probability [0, 1]

For custom weights exported from a different Silero VAD ONNX model, use NewFromFile or NewFromReader.

Input format ¶

Each call to VAD.Process expects exactly 512 float32 samples of 16 kHz mono audio (32 ms per frame). Samples should be normalised to the range [−1, 1]. The detector maintains internal LSTM state across calls; use VAD.Reset to start a new audio stream.

Model architecture ¶

Conv-STFT (n_fft=256) → magnitude → 4× Conv1d+ReLU → LSTMCell(128) → Conv1d(1) → Sigmoid

Weights are derived from silero_vad_half.onnx (Silero VAD v5, 16 kHz, MIT-licensed). See https://github.com/snakers4/silero-vad for the original model.

Index ¶

Constants
type VAD
- func (v *VAD) Process(samples []float32) float32
- func (v *VAD) Reset()

Constants ¶

View Source

const (
	// SamplesPerFrame is the number of float32 audio samples per inference frame.
	SamplesPerFrame = 512
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type VAD ¶

type VAD struct {
	// contains filtered or unexported fields
}

VAD performs voice activity detection on 16 kHz mono audio.

A VAD instance is not safe for concurrent use. Create one per goroutine, or protect calls with a mutex.

func New ¶

func New() (*VAD, error)

New creates a VAD detector using the embedded default model weights.

func NewFromFile ¶

func NewFromFile(path string) (*VAD, error)

NewFromFile creates a VAD detector by loading model weights from a file.

func NewFromReader ¶

func NewFromReader(r io.Reader) (*VAD, error)

NewFromReader creates a VAD detector by reading model weights from r. The binary format is a sequence of little-endian float32 values in the order produced by the export_for_go.py script.

func (*VAD) Process ¶

func (v *VAD) Process(samples []float32) float32

Process runs inference on exactly SamplesPerFrame (512) float32 samples of 16 kHz mono audio and returns the speech probability in [0.0, 1.0].

The detector maintains LSTM state across calls, so frames should be fed in chronological order. A probability above 0.5 typically indicates speech; tune the threshold to your use case.

Process panics if len(samples) != SamplesPerFrame.

func (*VAD) Reset ¶

func (v *VAD) Reset()

Reset clears the LSTM state so the next VAD.Process call starts a fresh audio stream. Call this between unrelated audio segments.

Source Files ¶

View all Source files

vad.go

Directories ¶

Path	Synopsis
examples
live-vad command Command live-vad captures audio from the default microphone at 16 kHz and prints real-time voice activity detection results using the Silero VAD model.	Command live-vad captures audio from the default microphone at 16 kHz and prints real-time voice activity detection results using the Silero VAD model.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL