omnivoice

module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 24, 2026 License: MIT

README

OmniVoice

Build Status Lint Status Go Report Card Docs License

Voice abstraction layer for AgentPlexus supporting TTS, STT, and Voice Agents across multiple providers and transport protocols.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              OmniVoice                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────────────┐  │
│  │     TTS     │    │     STT     │    │          Voice Agent            │  │
│  │             │    │             │    │                                 │  │
│  │ Text → Audio│    │ Audio → Text│    │  Real-time bidirectional voice  │  │
│  └──────┬──────┘    └──────┬──────┘    └───────────────┬─────────────────┘  │
│         │                  │                           │                    │
│         ▼                  ▼                           ▼                    │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Provider Layer                              │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │ ElevenLabs  │  Deepgram   │ Google Cloud│    AWS      │   Azure     │    │
│  │ Cartesia    │  Whisper    │ AssemblyAI  │   Polly     │   Speech    │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Transport Layer                             │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │   WebRTC    │     SIP     │    PSTN     │  WebSocket  │    HTTP     │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Call System Integration                        │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │   Twilio    │ RingCentral │    Zoom     │   LiveKit   │   Daily     │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Package Structure

omnivoice/
├── tts/                    # Text-to-Speech
│   ├── tts.go              # Interface definitions
│   ├── elevenlabs/         # ElevenLabs provider
│   ├── polly/              # AWS Polly provider
│   ├── google/             # Google Cloud TTS
│   ├── azure/              # Azure Speech
│   └── cartesia/           # Cartesia provider
│
├── stt/                    # Speech-to-Text
│   ├── stt.go              # Interface definitions
│   ├── whisper/            # OpenAI Whisper
│   ├── deepgram/           # Deepgram provider
│   ├── google/             # Google Speech-to-Text
│   ├── azure/              # Azure Speech
│   └── assemblyai/         # AssemblyAI provider
│
├── agent/                  # Voice Agent orchestration
│   ├── agent.go            # Interface definitions
│   ├── session.go          # Conversation session management
│   ├── elevenlabs/         # ElevenLabs Agents
│   ├── vapi/               # Vapi.ai
│   ├── retell/             # Retell AI
│   └── custom/             # Custom agent (STT + LLM + TTS)
│
├── transport/              # Audio transport protocols
│   ├── transport.go        # Interface definitions
│   ├── webrtc/             # WebRTC transport
│   ├── websocket/          # WebSocket streaming
│   ├── sip/                # SIP protocol
│   └── http/               # HTTP-based (batch)
│
├── callsystem/             # Call system integrations
│   ├── callsystem.go       # Interface definitions
│   ├── twilio/             # Twilio ConversationRelay
│   ├── ringcentral/        # RingCentral Voice API
│   ├── zoom/               # Zoom SDK integration
│   ├── livekit/            # LiveKit rooms
│   └── daily/              # Daily.co
│
└── examples/
    ├── simple-tts/         # Basic TTS example
    ├── voice-agent/        # Voice agent with Twilio
    └── multi-provider/     # Provider fallback example

Call System Integration

How Voice Agents Connect to Phone/Video Calls

Voice AI agents need a transport layer to receive and send audio. The choice depends on the use case:

┌───────────────────────────────────────────────────────────────────────┐
│                        Call System Options                            │
├────────────────┬───────────────┬─────────────────┬────────────────────┤
│    Platform    │   Protocol    │   Best For      │   Complexity       │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Twilio         │ WebRTC/SIP/   │ Phone calls,    │ Medium - managed   │
│ Conversation-  │ PSTN          │ IVR, call       │ infrastructure     │
│ Relay          │               │ centers         │                    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ RingCentral    │ WebRTC/SIP    │ Enterprise PBX, │ Medium - native    │
│ Voice API      │               │ business phones │ AI receptionist    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Zoom SDK       │ Proprietary   │ Video meetings  │ High - requires    │
│                │ (via SDK)     │ with voice bots │ native SDK         │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ LiveKit        │ WebRTC        │ Custom apps,    │ Low - open source  │
│                │               │ real-time AI    │ WebRTC rooms       │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Daily.co       │ WebRTC        │ Embedded video, │ Low - simple API   │
│                │               │ browser-based   │                    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ WebSocket      │ WS/WSS        │ Web apps,       │ Low - direct       │
│ (Direct)       │               │ custom UIs      │ streaming          │
└────────────────┴───────────────┴─────────────────┴────────────────────┘
Wiring Diagram: Voice Agent in a Phone Call
┌────────────────────────────────────────────────────────────────────────────────┐
│                     PSTN/WebRTC Call Flow                                      │
│                                                                                │
│   ┌─────────┐         ┌─────────────┐          ┌───────────────────────────┐   │
│   │  User   │◄───────►│   Twilio    │◄────────►│        OmniVoice          │   │
│   │ (Phone) │  PSTN   │ Conversation│ WebSocket│                           │   │
│   │         │         │   Relay     │          │  ┌─────────────────────┐  │   │
│   └─────────┘         └─────────────┘          │  │   Voice Agent       │  │   │
│                                                │  │                     │  │   │
│                                                │  │  ┌───────┐          │  │   │
│                         Audio In ─────────────►│  │  │  STT  │──┐       │  │   │
│                                                │  │  └───────┘  │       │  │   │
│                                                │  │             ▼       │  │   │
│                                                │  │  ┌───────────────┐  │  │   │
│                                                │  │  │  LLM / Agent  │  │  │   │
│                                                │  │  │  (Eino, etc.) │  │  │   │
│                                                │  │  └───────────────┘  │  │   │
│                                                │  │             │       │  │   │
│                                                │  │             ▼       │  │   │
│                                                │  │  ┌───────┐          │  │   │
│                         Audio Out ◄────────────│  │  │  TTS  │◄─┘       │  │   │
│                                                │  │  └───────┘          │  │   │
│                                                │  └─────────────────────┘  │   │
│                                                └───────────────────────────┘   │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘
Wiring Diagram: Voice Agent in a Zoom Meeting
┌────────────────────────────────────────────────────────────────────────────┐
│                     Zoom Meeting Flow                                      │
│                                                                            │
│   ┌────────────────────────────────────────────────────────────────────┐   │
│   │                         Zoom Meeting                               │   │
│   │                                                                    │   │
│   │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────────┐   │   │
│   │   │ User 1  │  │ User 2  │  │ User 3  │  │     Bot Client      │   │   │
│   │   │ (Human) │  │ (Human) │  │ (Human) │  │   (Zoom SDK)        │   │   │
│   │   └─────────┘  └─────────┘  └─────────┘  └──────────┬──────────┘   │   │
│   │                                                     │              │   │
│   └─────────────────────────────────────────────────────┼──────────────┘   │
│                                                         │                  │
│                                        Raw Audio Stream │                  │
│                                                         ▼                  │
│   ┌────────────────────────────────────────────────────────────────────┐   │
│   │                        OmniVoice Agent                             │   │
│   │                                                                    │   │
│   │   Option A: Use Recall.ai (recommended)                            │   │
│   │   ┌─────────────┐                                                  │   │
│   │   │  Recall.ai  │──► Handles Zoom SDK complexity                   │   │
│   │   │     Bot     │──► Provides audio stream via WebSocket           │   │
│   │   └─────────────┘                                                  │   │
│   │                                                                    │   │
│   │   Option B: Self-hosted Zoom SDK Bot                               │   │
│   │   ┌─────────────┐                                                  │   │
│   │   │ Zoom Linux  │──► Complex: requires native SDK                  │   │
│   │   │   SDK Bot   │──► One instance per meeting                      │   │
│   │   └─────────────┘──► Months of engineering                         │   │
│   │                                                                    │   │
│   └────────────────────────────────────────────────────────────────────┘   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Use Case Recommendations

Use Case Call System Transport Notes
IVR / Call Center Twilio ConversationRelay PSTN/SIP Best managed solution
Business Phone RingCentral WebRTC/SIP Native AI Receptionist available
Custom Web App LiveKit or Daily WebRTC Open source, flexible
Zoom Meetings Recall.ai + Zoom SDK → WebSocket Avoid building Zoom bot yourself
Browser Widget Direct WebSocket WebSocket ElevenLabs widget or custom
Mobile App LiveKit WebRTC Cross-platform support

Latency Considerations

For natural conversation, total round-trip latency should be under 500ms:

User speaks → STT (100-300ms) → LLM (200-500ms) → TTS (100-200ms) → User hears

Target: < 500ms total for "instant" feel
Acceptable: < 1000ms for natural conversation
Poor: > 1500ms feels laggy
Optimization Strategies
  1. Streaming STT: Start processing before user finishes speaking
  2. Streaming TTS: Start playing audio before full response generated
  3. Edge inference: Use providers with edge nodes (Deepgram, ElevenLabs)
  4. Turn detection: Use voice activity detection (VAD) for quick turn-taking

Provider Comparison

TTS Providers
Provider Latency Quality Voices Streaming Price
ElevenLabs Low Excellent 5000+ Yes $$$
Cartesia Very Low Good 100+ Yes $$
AWS Polly Low Good 60+ Yes $
Google TTS Low Good 200+ Yes $
Azure Speech Low Excellent 400+ Yes $$
STT Providers
Provider Latency Accuracy Streaming Languages Price
Deepgram Very Low Excellent Yes 30+ $$
Whisper (OpenAI) Medium Excellent No* 50+ $
Google Speech Low Excellent Yes 125+ $$
AssemblyAI Low Excellent Yes 20+ $$
Azure Speech Low Excellent Yes 100+ $$

*Whisper requires self-hosting for streaming (e.g., faster-whisper)

Voice Agent Platforms
Provider Customization Latency Telephony Price
ElevenLabs Agents Medium Low Via Twilio $$$
Vapi High Low Built-in $$
Retell AI High Low Built-in $$
Custom (OmniVoice) Full Variable Via integration Variable

Resources

Call Systems
Voice AI Providers

Directories

Path Synopsis
Package agent provides voice agent orchestration for real-time conversations.
Package agent provides voice agent orchestration for real-time conversations.
audio
codec
Package codec provides audio codec implementations for telephony.
Package codec provides audio codec implementations for telephony.
Package callsystem provides integrations with telephony and meeting platforms.
Package callsystem provides integrations with telephony and meeting platforms.
examples
simple-tts command
Example: Simple TTS with provider fallback
Example: Simple TTS with provider fallback
twilio-agent command
Example: Voice agent handling inbound Twilio calls
Example: Voice agent handling inbound Twilio calls
zoom-agent command
Example: Voice agent in Zoom meetings
Example: Voice agent in Zoom meetings
Package mcp provides an MCP (Model Context Protocol) server for voice interactions.
Package mcp provides an MCP (Model Context Protocol) server for voice interactions.
Package pipeline provides components for connecting voice processing stages.
Package pipeline provides components for connecting voice processing stages.
stt
Package stt provides a unified interface for Speech-to-Text providers.
Package stt provides a unified interface for Speech-to-Text providers.
providertest
Package providertest provides conformance tests for STT provider implementations.
Package providertest provides conformance tests for STT provider implementations.
Package transport provides audio transport protocols for voice agents.
Package transport provides audio transport protocols for voice agents.
tts
Package tts provides a unified interface for Text-to-Speech providers.
Package tts provides a unified interface for Text-to-Speech providers.
providertest
Package providertest provides conformance tests for TTS provider implementations.
Package providertest provides conformance tests for TTS provider implementations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL