```rst .. _deployment-considerations:
Deploying LLMs in Hybrid Cloud: Why llama.cpp Wins for Us
1. Current Setup: Ollama in Testing
We have been using Ollama in a test environment with excellent results:
Easy to use — ollama run llama3.1 just works
Docker support is first-class:
FROM ollama/ollama COPY Modelfile /root/.ollama/ RUN ollama create my-llama3.1 -f Modelfile
Models are pulled, versioned, and cached automatically
Web UI and OpenAI-compatible API available out of the box
Verdict: Perfect for prototyping, local dev, and small-scale testing.
2. Production Requirements: Hybrid Cloud & Multi-User Access
When moving to production in a hybrid cloud, new constraints emerge:
We need a lightweight, portable, hardware-agnostic inference engine.
3. Evaluation: vLLM vs llama.cpp
Criteria |
vLLM |
llama.cpp |
|---|---|---|
Hardware Requirements |
Requires AVX-512 (fails on Xeon v2, older i7) |
Runs on SSE2+, AVX2 optional |
GPU Support |
Excellent (PagedAttention, high throughput) |
CUDA, Metal, Vulkan — full offloading |
CPU Performance |
Poor without AVX-512 |
Best-in-class quantized inference |
Binary Size |
~200 MB + Python deps |
< 10 MB (statically linked) |
Deployment |
Python server, complex deps |
Single binary, scp and run |
Multi-user / API |
Built-in OpenAI API |
server binary with full OpenAI compat + web UI |
Quantization Support |
FP16/BF16 only |
Q4_K, Q5_K, Q8_0, etc. — 4–8 GB models fit in RAM |
Key Finding:
> We cannot use vLLM on our legacy Xeon v2 fleet due to missing AVX-512. > llama.cpp runs efficiently on the same hardware with Q4_K_M models.
4. Why llama.cpp Is Our Production Choice
Decision
llama.cpp is selected for hybrid cloud LLM deployment because:
Runs everywhere: Old CPUs, new GPUs, laptops, edge
Single static binary: No Python, no CUDA runtime hell
GGUF format: Share models with Ollama, local files, S3
Built-in server: OpenAI API + full web UI
Thread & context control: –threads, –ctx-size, –n-gpu-layers
Kubernetes-ready: Tiny image, fast startup
Example: Production-Ready Server
./llama.cpp/server \
--model /models/llama3.1-8b-instruct.Q4_K_M.gguf \
--port 8080 \
--host 0.0.0.0 \
--threads 16 \
--ctx-size 8192 \
--n-gpu-layers 0 # CPU-only on older nodes
--log-disable
Deploy via Docker:
FROM alpine:latest
COPY llama.cpp/server /usr/bin/
COPY models/*.gguf /models/
EXPOSE 8080
CMD ["server", "--model", "/models/llama3.1-8b-instruct.Q4_K_M.gguf", "--port", "8080"]
5. Migration Path: From Ollama → llama.cpp
# 1. Reuse Ollama's GGUF
cp ~/.ollama/models/blobs/sha256-* /production/models/
# 2. Deploy llama.cpp server
kubectl apply -f llama-cpp-deployment.yaml
# 3. Point clients to new endpoint
export OPENAI_API_BASE=http://llama-cpp-prod:8080/v1
Zero model reconversion. Zero downtime.
Summary
Use Case |
Recommended Tool |
|---|---|
Local dev / prototyping |
Ollama |
Hybrid cloud, old hardware, scale |
llama.cpp |
High-throughput GPU cluster |
vLLM (if AVX-512 available) |
> llama.cpp = the Swiss Army knife of LLM inference.
—