inference-manager

module

v0.15.0 Latest Latest Go to latest Published: Apr 26, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/llm-operator/inference-manager

Links

Open Source Insights

README ¶

inference-manager

TODO

Implement the API endpoints (but still bypass to Ollama)
Replace Ollama with its own code
Be able to support multiple open source models
Be able to support multiple models that are fine-tuned by users
Support Autoscaling (with KEDA?)
Support multi-GPU & multi-node inference (?)
Explore optimizations

Here are some other notes:

Ollama internally uses llama.cpp. It provides a lightweight OpenAI API compatible HTTP server.
go-llama.cpp provides a Go binding.
LocalAI is another OpenAI API compatible HTTP server (supported by Spectro Cloud).
kaito internally uses torchrun or accelerate launch to launch an inference workload. See its Dockerfiles and preset Python programs.
localllm from Google Cloud.

Running Engine Locally

Run the following command:

make build-docker-engine
docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  llm-operator/inference-manager-engine \
  run \
  --config /config/config.yaml

Then hit the HTTP point and verify that Ollama responds.

curl http://localhost:11434/api/generate -d '{
  "model": "gemma:2b",
  "prompt":"Why is the sky blue?"
}'

Directories ¶

Path	Synopsis
api
v1 Package v1 is a reverse proxy.	Package v1 is a reverse proxy.
engine
cmd command
internal/config
internal/modelsyncer
internal/ollama
internal/s3
internal/server
server
cmd command
internal/config
internal/server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL