go-whisper-ct2

module

v1.1.0 Latest Latest Go to latest Published: Jan 10, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/xPrimeTime/go-whisper-ct2

Links

Open Source Insights

README ¶

go-whisper-ct2

Go bindings to CTranslate2 for high-quality Whisper speech-to-text inference — without Python.

This library provides the same transcription quality as faster-whisper with performance 1.23x slower than the Python implementation (with proper configuration). It uses the same CTranslate2 inference engine and model format, accessed directly from Go/C++ instead of Python.

Features

High-quality Whisper transcription via CTranslate2
Performance 1.23x slower than faster-whisper (with OMP_NUM_THREADS configured)
No Python dependency — pure Go + C++ implementation
Support for all Whisper model sizes (tiny, base, small, medium, large-v3)
Multiple audio formats (WAV, MP3, FLAC, OGG, AIFF, AU)
Automatic language detection (99 languages supported)
Translation to English from any supported language
Multiple output formats (text, JSON, SRT, VTT)
Quantization support (int8, float16, float32)
Thread-safe concurrent transcription
Advanced optimizations: silent chunk filtering, context conditioning, quality checks, temperature fallback

Quick Start (100% Python-Free)

# 1. Build the project
git clone https://github.com/xPrimeTime/go-whisper-ct2.git
cd go-whisper-ct2
make

# 2. Download a model (using git, no Python needed)
git clone https://huggingface.co/Systran/faster-whisper-small whisper-small-ct2

# 3. Set optimal threading for best performance
export OMP_NUM_THREADS=12  # Adjust for your CPU (see Performance section)

# 4. Transcribe audio
./bin/whisper-ct2 -model ./whisper-small-ct2 audio.wav

No Python required for download, build, or runtime! Python is only needed if you want to convert custom models with specific quantization.

Requirements

Note: Python is NOT required for building or running this library. Python is only needed if you want to convert custom Whisper models (optional - pre-converted models are available).

System Dependencies

Arch Linux:

sudo pacman -S cmake base-devel pkgconf libsndfile libsamplerate openblas

Ubuntu/Debian:

sudo apt install cmake build-essential pkg-config \
    libsndfile1-dev libsamplerate0-dev libopenblas-dev

Fedora:

sudo dnf install cmake gcc-c++ pkg-config libsndfile-devel libsamplerate-devel openblas-devel

macOS:

brew install cmake pkg-config libsndfile libsamplerate openblas

CTranslate2

CTranslate2 must be installed on your system. Build from source:

git clone --recursive https://github.com/OpenNMT/CTranslate2.git
cd CTranslate2
mkdir build && cd build

# For CPU-only (recommended for most users):
cmake .. -DWITH_MKL=OFF -DWITH_OPENBLAS=ON -DWITH_CUDA=OFF -DCMAKE_BUILD_TYPE=Release

# For CUDA GPU support:
# cmake .. -DWITH_CUDA=ON -DWITH_CUDNN=ON -DCMAKE_BUILD_TYPE=Release

make -j$(nproc)
sudo make install
sudo ldconfig

Verify installation:

pkg-config --libs ctranslate2
# Should output: -lctranslate2

Installation

Building from Source

git clone https://github.com/xPrimeTime/go-whisper-ct2.git
cd go-whisper-ct2
make

This builds:

C++ shared library (csrc/build/libwhisper_ct2.so)
Go package (pkg/whisper)
CLI binary (bin/whisper-ct2)

Installing System-Wide

# Install C++ library (requires sudo)
sudo make install-cpp

# Install CLI to your Go bin directory
make install

As a Go Library

Important: This package uses cgo and requires the C++ library. You cannot simply go get it - you must build from source first.

# Clone and build the C++ library
git clone https://github.com/xPrimeTime/go-whisper-ct2.git
cd go-whisper-ct2
make build-cpp

# Install C++ library system-wide (recommended)
sudo make install-cpp

# Now you can import in your Go code

Then in your Go project:

go get github.com/xPrimeTime/go-whisper-ct2/pkg/whisper

What users need installed:

CTranslate2 (build from source - see Requirements)
System libraries: libsndfile, libsamplerate, openblas
This package's C++ library: libwhisper_ct2.so (built via make build-cpp)

The package will link against these libraries at compile time and runtime.

Model Setup

Whisper models must be in CTranslate2 format. Pre-converted models are available - no Python required for download or runtime!

Download Pre-Converted Models (Python-Free)

Pre-converted models are available on Hugging Face. Choose your preferred download method:

Method 1: Git LFS (Recommended, No Python)

# Install git-lfs if not already installed
# Arch: sudo pacman -S git-lfs
# Ubuntu: sudo apt install git-lfs
# macOS: brew install git-lfs

git lfs install

# Clone a model (downloads all files)
git clone https://huggingface.co/Systran/faster-whisper-small whisper-small-ct2

# Or for faster download, clone without history:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Systran/faster-whisper-small whisper-small-ct2
cd whisper-small-ct2
git lfs pull

Method 2: Direct Download with wget/curl (No Python)

# Create directory
mkdir -p whisper-small-ct2 && cd whisper-small-ct2

# Download required files
wget https://huggingface.co/Systran/faster-whisper-small/resolve/main/config.json
wget https://huggingface.co/Systran/faster-whisper-small/resolve/main/model.bin
wget https://huggingface.co/Systran/faster-whisper-small/resolve/main/tokenizer.json
wget https://huggingface.co/Systran/faster-whisper-small/resolve/main/vocabulary.txt

Method 3: Browser Download (No Python)

Visit https://huggingface.co/Systran/faster-whisper-small/tree/main
Download these files: config.json, model.bin, tokenizer.json, vocabulary.txt
Place all files in a directory (e.g., whisper-small-ct2/)

Method 4: Using huggingface-hub CLI (Optional, Requires Python)

pip install huggingface-hub
huggingface-cli download Systran/faster-whisper-small --local-dir whisper-small-ct2

Available Models

Model	Size	Speed	Accuracy	HuggingFace URL
tiny	~75 MB	Fastest	Lower	https://huggingface.co/Systran/faster-whisper-tiny
base	~145 MB	Fast	Good	https://huggingface.co/Systran/faster-whisper-base
small	~486 MB	Medium	Better	https://huggingface.co/Systran/faster-whisper-small
medium	~1.5 GB	Slow	High	https://huggingface.co/Systran/faster-whisper-medium
large-v3	~3.1 GB	Slowest	Best	https://huggingface.co/Systran/faster-whisper-large-v3

Note: Pre-converted models use float16 precision. On CPUs without float16 support (most CPUs), CTranslate2 will automatically convert to float32 at runtime. You'll see a warning message, but this is normal and doesn't affect transcription quality.

Convert Custom Models (Optional, Requires Python)

Only needed if you want custom quantization (int8, float32) or specific model variants:

# Install conversion tools (one-time)
pip install ctranslate2 transformers[torch]

# Convert with int8 quantization (fastest on CPU, smallest size)
ct2-transformers-converter --model openai/whisper-small \
    --output_dir whisper-small-ct2-int8 \
    --quantization int8

# Convert with float32 (no runtime conversion warning)
ct2-transformers-converter --model openai/whisper-small \
    --output_dir whisper-small-ct2-fp32 \
    --quantization float32

CLI Usage

Basic Usage

# Transcribe an audio file
whisper-ct2 -model ./whisper-small-ct2 audio.wav

# Specify language (faster than auto-detection)
whisper-ct2 -model ./whisper-small-ct2 -language en audio.wav

# Translate foreign language to English
whisper-ct2 -model ./whisper-small-ct2 -task translate german_audio.wav

Output Formats

# Plain text (default)
whisper-ct2 -model ./model audio.wav

# JSON with metadata
whisper-ct2 -model ./model -output json audio.wav

# SRT subtitles
whisper-ct2 -model ./model -output srt audio.wav > subtitles.srt

# WebVTT subtitles
whisper-ct2 -model ./model -output vtt audio.wav > subtitles.vtt

CLI Options

Required:
  -model string       Path to CTranslate2 model directory

Audio Options:
  -language string    Language code (e.g., "en", "es", "zh") or "auto" (default "auto")
  -task string        "transcribe" or "translate" to English (default "transcribe")

Output Options:
  -output string      Output format: text, json, srt, vtt (default "text")

Performance Options:
  -beam-size int      Beam search width, higher = more accurate but slower (default 5)
  -compute-type string  Compute precision: int8, float16, float32, default
  -threads int        CPU threads per operation, 0 = auto (default 0)

Other:
  -verbose            Show progress and timing information
  -version            Print version and exit

Examples

# Fast transcription with int8
whisper-ct2 -model ./whisper-small-ct2 -compute-type int8 audio.wav

# High accuracy with larger beam
whisper-ct2 -model ./whisper-large-v3-ct2 -beam-size 10 audio.wav

# Process multiple files
for f in *.wav; do
    whisper-ct2 -model ./model -output srt "$f" > "${f%.wav}.srt"
done

Go Library Usage

Basic Transcription

package main

import (
    "fmt"
    "log"

    "github.com/xPrimeTime/go-whisper-ct2/pkg/whisper"
)

func main() {
    // Load model with default config
    model, err := whisper.LoadModel("./whisper-small-ct2", whisper.DefaultModelConfig())
    if err != nil {
        log.Fatal(err)
    }
    defer model.Close()

    // Transcribe audio file
    result, err := model.TranscribeFile("audio.wav")
    if err != nil {
        log.Fatal(err)
    }

    // Print transcription
    fmt.Println(result.Text)
}

With Options

result, err := model.TranscribeFile("audio.wav",
    whisper.WithLanguage("en"),           // Skip auto-detection
    whisper.WithTask("transcribe"),       // or "translate"
    whisper.WithBeamSize(5),              // Beam search width
    whisper.WithScores(true),             // Include confidence scores
)

Language Detection

// Detect language without full transcription
probs, err := model.DetectLanguage("audio.wav")
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Detected: %s (%.1f%% confidence)\n",
    probs[0].Language,
    probs[0].Probability * 100)

Working with Segments

result, err := model.TranscribeFile("audio.wav")
if err != nil {
    log.Fatal(err)
}

// Access individual segments with timestamps
for _, seg := range result.Segments {
    fmt.Printf("[%v -> %v] %s\n", seg.Start, seg.End, seg.Text)
}

// Generate subtitle formats
fmt.Print(result.SRT())  // SRT format
fmt.Print(result.VTT())  // WebVTT format

Model Configuration

config := whisper.ModelConfig{
    Device:       "cpu",       // "cpu" or "cuda"
    ComputeType:  "int8",      // "int8", "float16", "float32", "default"
    InterThreads: 1,           // Parallel batch processing
    IntraThreads: 4,           // Threads per operation (0 = auto)
}

model, err := whisper.LoadModel("./model", config)

Raw PCM Audio

// Transcribe raw PCM samples (16kHz, mono, float32)
samples := []float32{...} // Your audio data
result, err := model.TranscribePCM(samples, whisper.WithLanguage("en"))

Benchmarking

A dedicated benchmark tool is included for performance testing and comparing configurations.

Build the Benchmark Tool

make build-benchmark
# Binary: bin/whisper-benchmark

Basic Usage

# Benchmark with 3 iterations (default)
./bin/whisper-benchmark --model ./whisper-base-ct2 audio.wav

# Multiple iterations for better statistics
./bin/whisper-benchmark --model ./whisper-base-ct2 --iterations 10 audio.wav

# Benchmark multiple files
./bin/whisper-benchmark --model ./whisper-base-ct2 audio1.wav audio2.wav audio3.mp3

Export Results

# Save results to JSON and CSV
./bin/whisper-benchmark \
  --model ./whisper-base-ct2 \
  --iterations 5 \
  --output-json results.json \
  --output-csv results.csv \
  audio.wav

Reported Metrics

The benchmark tool provides comprehensive statistics:

Transcription time: min, max, mean, median, standard deviation
Real-Time Factor (RTF): Automatic calculation (audio_duration / transcription_time)
- Higher RTF = faster (e.g., 3.5x means processing 3.5 seconds of audio per second)
Language detection results
Segment and text statistics

Example Output

Benchmarking: audio.wav
================================================================================

Audio: audio.wav
  Duration:           4.82s
  Iterations:         3

Transcription Time:
  Min:                1.234s
  Max:                1.298s
  Mean:               1.267s
  Median:             1.271s
  Std Dev:            0.027s

Real-Time Factor:
  Min RTF:            3.71x (fastest)
  Max RTF:            3.91x (slowest)
  Mean RTF:           3.81x
  Median RTF:         3.79x

Transcription Info:
  Language:           en
  Segments:           2
  Text length:        142 chars
================================================================================

Comparison Benchmarks

# Compare different quantization levels
./bin/whisper-benchmark --model ./whisper-base-ct2-int8 --compute-type int8 --output-json int8.json audio.wav
./bin/whisper-benchmark --model ./whisper-base-ct2 --output-json default.json audio.wav

# Compare beam sizes
for beam in 1 5 10; do
  ./bin/whisper-benchmark --model ./whisper-base-ct2 --beam-size $beam --output-json beam-$beam.json audio.wav
done

See cmd/benchmark/README.md for complete documentation.

Comparing with faster-whisper

Automated scripts are provided to benchmark go-whisper-ct2 against faster-whisper:

# Install faster-whisper
pip install faster-whisper

# Run automated comparison (builds both, runs benchmarks, compares results)
./scripts/run-comparison.sh --model ./whisper-base-ct2 --iterations 10 audio.wav

This will run both implementations with identical settings and display a detailed comparison:

================================================================================
PERFORMANCE COMPARISON: go-whisper-ct2 vs faster-whisper
================================================================================

Metric               Go              Python          Difference
--------------------------------------------------------------------------------
Mean Time            5.51s           4.47s           +23.3%
Mean RTF             3.34x           4.11x           -18.8%

⚠️  Go is 1.23x slower than Python (requires OMP_NUM_THREADS configuration)

See scripts/README.md and BENCHMARKING.md for detailed comparison guides.

Compute Types & Performance

Understanding Compute Types

Type	Size	Speed	Accuracy	Best For
`int8`	Smallest	Fastest	Slightly lower	CPU inference, real-time
`float16`	Medium	Fast	Full	GPU inference
`float32`	Largest	Baseline	Full	CPU without float16 support

CPU Users

Most CPUs don't have native float16 support. When using float16 models, you'll see:

[warning] The compute type inferred from the saved model is float16, but the target device
or backend do not support efficient float16 computation. The model weights have been
automatically converted to use the float32 compute type instead.

This is normal and harmless. The transcription works correctly. To avoid the warning:

Use int8 quantized models (recommended - faster too!)
Convert models with --quantization float32
Set compute type explicitly: -compute-type float32

Recommended Setup by Use Case

Use Case	Model	Compute Type
Real-time/streaming	whisper-tiny or base	int8
General transcription	whisper-small	int8
High accuracy	whisper-medium or large-v3	int8 or float32
GPU inference	Any	float16

Performance Benchmarks

Real-world performance comparison with faster-whisper (Python):

Test Setup:

Model: whisper-small (float16 → auto-converted to float32)
Hardware: AMD Ryzen 7 5800X3D (8 cores, 16 threads)
Audio: harvard.wav, 18.4 seconds

Results:

Implementation	Time	Real-Time Factor	vs Python
faster-whisper (Python, default)	4.47s	4.11x	Baseline
go-whisper-ct2 (with OMP_NUM_THREADS=12)	5.51s	3.34x	1.23x slower
go-whisper-ct2 (without OMP config)	10.5s	1.75x	2.35x slower

⚠️ IMPORTANT: Setting OMP_NUM_THREADS is critical for performance!

# Without OMP_NUM_THREADS: ~10s (2.3x slower than Python)
./bin/whisper-ct2 -model ./whisper-small-ct2 audio.wav

# With optimal OMP_NUM_THREADS: ~5.5s (1.23x slower than Python)
export OMP_NUM_THREADS=12
./bin/whisper-ct2 -model ./whisper-small-ct2 audio.wav

Optimal OMP_NUM_THREADS by CPU:

16-thread CPU (8 cores): OMP_NUM_THREADS=12
8-thread CPU (4 cores): OMP_NUM_THREADS=6
4-thread CPU (2 cores): OMP_NUM_THREADS=3
Rule of thumb: Use 75% of your total thread count

Key Takeaways:

✅ Performance 1.23x slower than faster-whisper (with proper threading)
✅ Both use identical CTranslate2 inference engine
✅ Both implement same optimizations (silent chunk filtering, context conditioning, etc.)
✅ Go version has zero Python runtime overhead
✅ Single binary deployment vs Python environment
⚠️ Must set OMP_NUM_THREADS for optimal performance

See PERFORMANCE.md for detailed analysis and optimization guide.

Optimization Features (Enabled by Default):

Silent chunk filtering - Automatically skips silent audio (2-3x faster on typical audio)
Context conditioning - Uses previous text for better accuracy
Compression ratio checks - Detects and retries hallucinated/repetitive text
Log probability thresholds - Identifies low-confidence segments
Temperature fallback - Automatically retries poor quality segments

Fine-Tuning Performance:

// Faster (more aggressive filtering, may miss some speech)
result, err := model.TranscribeFile("audio.wav",
    whisper.WithNoSpeechThreshold(0.8),        // Skip more silence
    whisper.WithBeamSize(1),                   // Greedy decoding
    whisper.WithCompressionRatioThreshold(2.0), // Stricter quality
)

// More accurate (slower, processes everything)
result, err := model.TranscribeFile("audio.wav",
    whisper.WithNoSpeechThreshold(0.0),        // Process all chunks
    whisper.WithBeamSize(10),                  // Wider beam search
    whisper.WithConditionOnPreviousText(true), // Full context
)

Troubleshooting

Library not found

error while loading shared libraries: libwhisper_ct2.so: cannot open shared object file

Solutions:

# Option 1: Set library path temporarily
export LD_LIBRARY_PATH=/path/to/go-whisper-ct2/csrc/build:$LD_LIBRARY_PATH

# Option 2: Install system-wide
sudo make install-cpp
sudo ldconfig

# Option 3: Add to your shell profile (~/.bashrc or ~/.zshrc)
echo 'export LD_LIBRARY_PATH=/path/to/go-whisper-ct2/csrc/build:$LD_LIBRARY_PATH' >> ~/.bashrc

CTranslate2 not found during build

Could not find a package configuration file provided by "ctranslate2"

Solutions:

# Check if CTranslate2 is installed
pkg-config --libs ctranslate2

# If not found, set PKG_CONFIG_PATH
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH

# Or reinstall CTranslate2
cd CTranslate2/build && sudo make install && sudo ldconfig

Model loading errors

whisper: failed to load model: ...

Check:

Model directory contains model.bin and config.json
Path is correct (use absolute path if unsure)
Model was converted for CTranslate2 (not raw PyTorch)

Audio loading errors

whisper: failed to load audio: ...

Supported formats: WAV, MP3, FLAC, OGG, AIFF, AU

Check:

File exists and is readable
Audio is not corrupted
libsndfile is installed: pkg-config --libs sndfile

Slow transcription

First: Check if OMP_NUM_THREADS is set! This is the #1 cause of slow performance.

# Set optimal threading (critical for performance!)
export OMP_NUM_THREADS=12  # Adjust for your CPU

# Verify it's set
echo $OMP_NUM_THREADS

# Now transcribe
./bin/whisper-ct2 -model ./whisper-small-ct2 audio.wav

Additional optimization tips:

Set OMP_NUM_THREADS to ~75% of your CPU thread count (most important!)
Use int8_float32 compute type (if supported): -compute-type int8_float32
Use smaller model (tiny or base for real-time)
Specify language instead of auto-detect: -language en
Reduce beam size: -beam-size 1

Performance troubleshooting:

# Check current performance
time OMP_NUM_THREADS=12 ./bin/whisper-ct2 -model ./model audio.wav

# Compare with different thread counts
for threads in 4 8 12 16; do
  echo "Testing OMP_NUM_THREADS=$threads"
  OMP_NUM_THREADS=$threads time ./bin/whisper-ct2 -model ./model audio.wav
done

See PERFORMANCE.md for detailed optimization guide.

Project Structure

go-whisper-ct2/
├── csrc/                       # C++ implementation
│   ├── include/
│   │   └── whisper_ct2.h      # Public C API
│   ├── src/
│   │   ├── whisper_ct2.cpp    # Main implementation
│   │   ├── audio_processor.*  # Audio loading & preprocessing
│   │   ├── mel_filters.*      # Mel spectrogram filterbank
│   │   └── stft.*             # Short-time Fourier transform
│   ├── third_party/
│   │   └── pocketfft/         # FFT library (header-only)
│   └── CMakeLists.txt
├── pkg/whisper/                # Go package
│   ├── whisper.go             # Main API & cgo bindings
│   ├── model.go               # Model loading
│   ├── transcribe.go          # Transcription functions
│   ├── options.go             # Functional options
│   ├── result.go              # Result types & formatting
│   └── errors.go              # Error handling
├── cmd/whisper-ct2/           # CLI application
│   └── main.go
├── Makefile                   # Build orchestration
├── go.mod
├── README.md
├── DESIGN.md                  # Technical design document
└── LICENSE

How It Works

Audio Loading: libsndfile loads audio files, libsamplerate resamples to 16kHz
Preprocessing: STFT computed with PocketFFT, converted to log mel spectrogram
Inference: CTranslate2 runs the Whisper encoder-decoder model
Decoding: Beam search generates text tokens with timestamps
Output: Tokens cleaned (BPE artifacts removed) and formatted

The mel spectrogram computation matches OpenAI's original implementation and faster-whisper exactly, ensuring identical transcription quality.

Comparison with faster-whisper

Feature	faster-whisper	go-whisper-ct2
Language	Python	Go + C++
Runtime dependency	Python + packages	None (single binary)
Model format	CTranslate2	CTranslate2 (same)
Transcription quality	Reference	Identical
Performance	Baseline	1.23x slower
Performance (if OMP not set)	Baseline	2.3x slower ⚠️
Silent chunk filtering	✓	✓
Context conditioning	✓	✓
Compression ratio checks	✓	✓
Log probability thresholds	✓	✓
Temperature fallback	✓	✓
INT8 quantization	✓	Limited (backend dependent)
Word-level timestamps	✓	Planned
Silero VAD preprocessing	✓	Not implemented
Streaming transcription	✓	File-based only

Summary: Core optimization features are fully implemented with excellent performance (1.23x of Python). The Go implementation offers easier deployment (single binary, no Python) while maintaining the same transcription quality. Remember to set OMP_NUM_THREADS for optimal performance!

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

MIT License - see LICENSE file.

Acknowledgments

CTranslate2 - Fast inference engine
faster-whisper - Reference implementation
OpenAI Whisper - Original model
PocketFFT - FFT implementation
libsndfile - Audio file I/O
libsamplerate - Sample rate conversion

Directories ¶

Path	Synopsis
cmd
benchmark command
pkg
whisper Package whisper provides Go bindings to CTranslate2's Whisper speech recognition.	Package whisper provides Go bindings to CTranslate2's Whisper speech recognition.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

go-whisper-ct2

Features

Quick Start (100% Python-Free)

Table of Contents

Requirements

System Dependencies

CTranslate2

Installation

Building from Source

Installing System-Wide

As a Go Library

Model Setup

Download Pre-Converted Models (Python-Free)

Method 1: Git LFS (Recommended, No Python)

Method 2: Direct Download with wget/curl (No Python)

Method 3: Browser Download (No Python)

Method 4: Using huggingface-hub CLI (Optional, Requires Python)

Available Models

Convert Custom Models (Optional, Requires Python)

CLI Usage

Basic Usage

Output Formats

CLI Options

Examples

Go Library Usage

Basic Transcription

With Options

Language Detection

Working with Segments

Model Configuration

Raw PCM Audio

Benchmarking

Build the Benchmark Tool

Basic Usage

Export Results

Reported Metrics

Example Output

Comparison Benchmarks

Comparing with faster-whisper

Compute Types & Performance

Understanding Compute Types

CPU Users

Recommended Setup by Use Case

Performance Benchmarks

Troubleshooting

Library not found

CTranslate2 not found during build

Model loading errors

Audio loading errors

Slow transcription

Project Structure

How It Works

Comparison with faster-whisper

Contributing

License

Acknowledgments

Directories ¶