govad

package module
v0.0.0-...-74750ea Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: MIT Imports: 7 Imported by: 0

README

govad

CI GoDoc Go Report Card

Pure Go voice activity detection using the Silero VAD neural network.

No CGo. No ONNX runtime. No external dependencies.

The model weights are embedded in the binary.

Features

  • Pure Go inference (~300 lines), zero C dependencies
  • Processes 512-sample frames (32 ms at 16 kHz)
  • Stateful LSTM — feed frames sequentially, get speech probabilities
  • Embedded model weights — no extra files to ship
  • Validated against the ONNX reference (max diff < 0.001)

Installation

go get github.com/zserge/govad@latest

Usage

package main

import (
	"fmt"
	"github.com/zserge/govad"
)

func main() {
	// Create a VAD detector (uses embedded weights)
	v, err := govad.New()
	if err != nil {
		panic(err)
	}

	// Feed 512 float32 samples at 16 kHz per call
	samples := make([]float32, govad.SamplesPerFrame)
	// ... fill samples from your audio source ...

	prob := v.Process(samples)
	if prob > 0.5 {
		fmt.Println("Speech detected!")
	}

	// Call Reset() between unrelated audio streams
	v.Reset()
}

Live microphone example

The examples/live-vad directory contains a complete real-time VAD demo using malgo (miniaudio bindings):

cd examples/live-vad
go run . -threshold 0.5

It captures audio from your default microphone and prints speech/silence transitions in real time.

API

Function Description
govad.New() Create a detector with embedded default weights
govad.NewFromFile(path) Load weights from a file
govad.NewFromReader(r) Load weights from an io.Reader
v.Process(samples) Run inference on 512 samples, returns probability [0, 1]
v.Reset() Clear LSTM state for a new audio stream

Performance

On Apple M1:

BenchmarkProcess-8    1911    632370 ns/op    10112 B/op    7 allocs/op

~632 µs per 32 ms frame — roughly 50× faster than real time.

Model

The weights are exported from silero_vad_half.onnx (Silero VAD v5, 16 kHz only). The architecture is:

Audio (512 samples, 16 kHz)
  → Reflect pad (64 right)
  → Conv-STFT (n_fft=256, hop=128)
  → Magnitude spectrum
  → Conv1d(129→128, k=3) + ReLU
  → Conv1d(128→64,  k=3, stride=2) + ReLU
  → Conv1d(64→64,   k=3, stride=2) + ReLU
  → Conv1d(64→128,  k=3) + ReLU
  → LSTMCell(128)
  → ReLU → Linear(128→1) → Sigmoid
  → Speech probability

License

The Go code is MIT licensed. The model weights are from Silero VAD, also MIT licensed.

Documentation

Overview

Package govad provides voice activity detection (VAD) for 16 kHz mono audio.

It is a pure Go inference implementation of the Silero VAD neural network — no C dependencies, no ONNX runtime, no CGO required.

Quick start

The package embeds default model weights, so getting started is a single call:

v, _ := govad.New()
prob := v.Process(samples512) // returns speech probability [0, 1]

For custom weights exported from a different Silero VAD ONNX model, use NewFromFile or NewFromReader.

Input format

Each call to VAD.Process expects exactly 512 float32 samples of 16 kHz mono audio (32 ms per frame). Samples should be normalised to the range [−1, 1]. The detector maintains internal LSTM state across calls; use VAD.Reset to start a new audio stream.

Model architecture

Conv-STFT (n_fft=256) → magnitude → 4× Conv1d+ReLU → LSTMCell(128) → Conv1d(1) → Sigmoid

Weights are derived from silero_vad_half.onnx (Silero VAD v5, 16 kHz, MIT-licensed). See https://github.com/snakers4/silero-vad for the original model.

Index

Constants

View Source
const (
	// SamplesPerFrame is the number of float32 audio samples per inference frame.
	SamplesPerFrame = 512
)

Variables

This section is empty.

Functions

This section is empty.

Types

type VAD

type VAD struct {
	// contains filtered or unexported fields
}

VAD performs voice activity detection on 16 kHz mono audio.

A VAD instance is not safe for concurrent use. Create one per goroutine, or protect calls with a mutex.

func New

func New() (*VAD, error)

New creates a VAD detector using the embedded default model weights.

func NewFromFile

func NewFromFile(path string) (*VAD, error)

NewFromFile creates a VAD detector by loading model weights from a file.

func NewFromReader

func NewFromReader(r io.Reader) (*VAD, error)

NewFromReader creates a VAD detector by reading model weights from r. The binary format is a sequence of little-endian float32 values in the order produced by the export_for_go.py script.

func (*VAD) Process

func (v *VAD) Process(samples []float32) float32

Process runs inference on exactly SamplesPerFrame (512) float32 samples of 16 kHz mono audio and returns the speech probability in [0.0, 1.0].

The detector maintains LSTM state across calls, so frames should be fed in chronological order. A probability above 0.5 typically indicates speech; tune the threshold to your use case.

Process panics if len(samples) != SamplesPerFrame.

func (*VAD) Reset

func (v *VAD) Reset()

Reset clears the LSTM state so the next VAD.Process call starts a fresh audio stream. Call this between unrelated audio segments.

Directories

Path Synopsis
examples
live-vad command
Command live-vad captures audio from the default microphone at 16 kHz and prints real-time voice activity detection results using the Silero VAD model.
Command live-vad captures audio from the default microphone at 16 kHz and prints real-time voice activity detection results using the Silero VAD model.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL