olla

command module

v0.0.22 Latest Latest Go to latest Published: Dec 10, 2025 License: Apache-2.0 Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/thushan/olla

Links

Open Source Insights

README ¶

Recorded with VHS - see demo tape

[!IMPORTANT]
Olla is currently in active-development. While it is usable, we are still finalising some features and optimisations. Your feedback is invaluable! Open an issue and let us know features you'd like to see in the future.

Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a wide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.

Olla Single OpenAI

Single CLI application and config file is all you need to go Olla!

Key Features

🔄 Smart Load Balancing: Priority-based routing with automatic failover and connection retry
🔍 Smart Model Unification: Per-provider unification + OpenAI-compatible cross-provider routing
⚡ Dual Proxy Engines: Sherpa (simple) and Olla (high-performance)
🎯 Advanced Filtering: Profile and model filtering with glob patterns for precise control
💊 Health Monitoring: Continuous endpoint health checks with circuit breakers and automatic recovery
🔁 Intelligent Retry: Automatic retry on connection failures with immediate transparent endpoint failover
🔧 Self-Healing: Automatic model discovery refresh when endpoints recover
📊 Request Tracking: Detailed response headers and statistics
🛡️ Production Ready: Rate limiting, request size limits, graceful shutdown
⚡ High Performance: Sub-millisecond endpoint selection with lock-free atomic stats
🎯 LLM-Optimised: Streaming-first design with optimised timeouts for long inference
⚙️ High Performance: Designed to be very lightweight & efficient, runs on less than 50Mb RAM.

Platform Support

Olla runs on multiple platforms and architectures:

Platform	AMD64	ARM64	Notes
Linux	✅	✅	Full support including Raspberry Pi 4+
macOS	✅	✅	Intel and Apple Silicon (M1/M2/M3/M4)
Windows	✅	✅	Windows 10/11 and Windows on ARM
Docker	✅	✅	Multi-architecture images (amd64/arm64)

Quick Start

Installation

# Download latest release (auto-detects your platform)
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)

# Docker (automatically pulls correct architecture)
docker run -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest

# Or explicitly specify platform (e.g., for ARM64)
docker run --platform linux/arm64 -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest

# Install via Go
go install github.com/thushan/olla@latest

# Build from source
git clone https://github.com/thushan/olla.git && cd olla && make build-release
# Run Olla
./bin/olla

When you have everything running, you can check it's all working with:

# Check health of Olla
curl http://localhost:40114/internal/health

# Check endpoints
curl http://localhost:40114/internal/status/endpoints

# Check models available
curl http://localhost:40114/internal/status/models

For detailed installation and deployment options, see Getting Started Guide.

Examples

We've also got ready-to-use Docker Compose setups for common scenarios:

Common Architectures

Home Lab: Olla → Multiple Ollama (or OpenAI Compatible - eg. vLLM) instances across your machines
Hybrid Cloud: Olla → Local endpoints + LiteLLM → Cloud APIs (OpenAI, Anthropic, Bedrock, etc.)
Enterprise: Olla → GPUStack cluster + vLLM servers + LiteLLM (cloud overflow)
Development: Olla → Local + Shared team endpoints + LiteLLM (API access)

See integration patterns for detailed architectures.

🌐 OpenWebUI Integration

Complete setup with OpenWebUI + Olla load balancing multiple Ollama instances or unify all OpenAI compatible models.

See: examples/ollama-openwebui/
Services: OpenWebUI (web UI) + Olla (proxy/load balancer)
Use Case: Web interface with intelligent load balancing across multiple Ollama servers with Olla

Quick Start:

cd examples/ollama-openwebui
# Edit olla.yaml to configure your Ollama endpoints
docker compose up -d
# Access OpenWebUI at http://localhost:3000

You can learn more about OpenWebUI Ollama with Olla or see OpenWebUI OpenAI with Olla.

🤖 Anthropic Message API / CLI Tools - Claude Code, OpenCode, Crush

[!CAUTION] Introduced in v0.0.20+, the Anthropic implementation is experimental and should be used with caution.

You can use CLI tools with Olla by using the new Anthropic Message API at /olla/anthropic to run Claude Code with Local AI models you have on your machine.

We have examples for:

Learn more about Anthropic API Translation.

Documentation

Full documentation is available at https://thushan.github.io/olla/

Getting Started - Getting Started with Olla
Integrations - See which LLM backends are supported by Olla
Comparisons - Compare with LiteLLM, GPUStack, LocalAI
Olla Concepts - Understand Key Olla concepts
Configuration - Extensive configuration documentation
API Reference - Olla System API Reference
Development - Contributing and development guide

🤝 Contributing

We welcome contributions! Please open an issue first to discuss major changes.

🤖 AI Disclosure

This project has been built with the assistance of AI tools for documentation, test refinement, and code reviews.

We've utilised GitHub Copilot, Anthropic Claude, Jetbrains Junie and OpenAI ChatGPT for documentation, code reviews, test refinement and troubleshooting.

🙏 Acknowledgements

@pterm/pterm - Terminal UI framework
@puzpuzpuz/xsync - High-performance concurrent maps
@golangci/golangci-lint - Go linting
@dkorunic/betteralign - Struct alignment optimisation

📄 License

Licensed under the Apache License 2.0. See LICENSE for details.

🎯 Roadmap

Circuit breakers: Advanced fault tolerance (Olla engine)
Connection pooling: Per-endpoint connection management (Olla engine)
Object pooling: Reduced GC pressure for high throughput (Olla engine)
Model routing: Route based on model requested
Authenticated Endpoints: Support calling authenticated endpoints (bearer) like OpenAI/Groq/OpenRouter as endpoints
Auto endpoint discovery: Add endpoints, let Olla determine the type
Model benchmarking: Benchmark models across multiple endpoints easily
Metrics export: Prometheus/OpenTelemetry integration
Dynamic configuration: API-driven endpoint management
TLS termination: Built-in SSL support
Olla Admin Panel: View Olla metrics easily within the browser
Model caching: Intelligent model preloading
Advanced Connection Management: Authenticated endpoints (via SSH tunnels, OAuth, Tokens)
OpenRouter Support: Support OpenRouter calls within Olla (divert to free models on OpenRouter etc)

Let us know what you want to see!

Made with ❤️ for the LLM community

🏠 Homepage • 📖 Documentation • 🐛 Issues • 🚀 Releases

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
internal
adapter/balancer
adapter/converter
adapter/discovery
adapter/factory
adapter/filter
adapter/health
adapter/inspector
adapter/metrics
adapter/proxy
adapter/proxy/common
adapter/proxy/config
adapter/proxy/core
adapter/proxy/olla
adapter/proxy/sherpa
adapter/registry
adapter/registry/profile
adapter/registry/routing
adapter/security
adapter/stats
adapter/translator
adapter/translator/anthropic
adapter/unifier
app
app/handlers
app/middleware
app/services
config
core/constants
core/domain
core/ports
env
logger
router
util
util/pattern
version
pkg
container
eventbus
format
nerdstats
pool
profiler
theme

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL