π PromptCache
Reduce your LLM costs. Accelerate your application.
A smart semantic cache for high-scale GenAI workloads.


π° The Problem
In production, a large percentage of LLM requests are repetitive:
- RAG applications: Variations of the same employee questions
- AI Agents: Repeated reasoning steps or tool calls
- Support Bots: Thousands of similar customer queries
Every redundant request means extra token cost and extra latency.
Why pay your LLM provider multiple times for the same answer?
π‘ The Solution: PromptCache
PromptCache is a lightweight middleware that sits between your application and your LLM provider.
It uses semantic understanding to detect when a new prompt has the same intent as a previous one β and returns the cached result instantly.
π Key Benefits
| Metric |
Without Cache |
With PromptCache |
Benefit |
| Cost per 1,000 Requests |
β $30 |
β $6 |
Lower cost |
| Avg Latency |
~1.5s |
~300ms |
Faster UX |
| Throughput |
API-limited |
Unlimited |
Better scale |
Numbers vary per model, but the pattern holds across real workloads:
semantic caching dramatically reduces cost and latency.
* Results may vary depending on model, usage patterns, and configuration.
π§ Smart Semantic Matching (Safer by Design)
Naive semantic caches can be risky β they may return incorrect answers when prompts look similar but differ in intent.
PromptCache uses a two-stage verification strategy to ensure accuracy:
- High similarity β direct cache hit
- Low similarity β skip cache directly
- Gray zone β intent check using a small, cheap verification model
This ensures cached responses are semantically correct, not just βclose enoughβ.
π Quick Start
PromptCache works as a drop-in replacement for the OpenAI API.
1. Run with Docker (Recommended)
# Clone the repo
git clone https://github.com/messkan/prompt-cache.git
cd prompt-cache
# Run with Docker Compose
export OPENAI_API_KEY=your_key_here
docker-compose up -d
2. Run from Source
Simply change the base_url in your SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # Point to PromptCache
api_key="sk-..."
)
# First request β goes to the LLM provider
client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum physics"}]
)
# Semantically similar request β served from PromptCache
client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "How does quantum physics work?"}]
)
No code changes. Just point your client to PromptCache.
π Architecture Overview
Built for speed, safety, and reliability:
- Pure Go implementation (high concurrency, minimal overhead)
- BadgerDB for fast embedded persistent storage
- In-memory caching for ultra-fast responses
- OpenAI-compatible API for seamless integration
- Docker Setup
π£οΈ Roadmap
βοΈ v0.1 (In progress - not stable)
- In-memory & BadgerDB storage
- Smart semantic verification (dual-threshold + intent check)
- OpenAI API compatibility
π§ v0.2
- Redis backend for distributed caching
- Web dashboard (hit rate, latency, cost metrics)
- Built-in support for Claude & Mistral APIs
π v1.0
- Clustered mode (Raft or gossip-based replication)
- Custom embedding backends (Ollama, local models)
- Rate-limiting & request shaping
π License
MIT License.