Hybrid

``
`

rst .. _deployment-considerations:

Deploying LLMs in Hybrid Cloud: Why llama.cpp Wins for Us

Table of Contents

1. Current Setup: Ollama in Testing

We have been using Ollama in a test environment with excellent results:

Verdict: Perfect for prototyping, local dev, and small-scale testing.


2. Production Requirements: Hybrid Cloud & Multi-User Access

When moving to production in a hybrid cloud, new constraints emerge:

We need a lightweight, portable, hardware-agnostic inference engine.


3. Evaluation: vLLM vs llama.cpp

CriteriavLLMllama.cpp
Hardware RequirementsRequires AVX-512 (fails on Xeon v2, older i7)Runs on SSE2+, AVX2 optional
GPU SupportExcellent (PagedAttention, high throughput)CUDA, Metal, Vulkan — full offloading
CPU PerformancePoor without AVX-512Best-in-class quantized inference
Binary Size~200 MB + Python deps< 10 MB (statically linked)
DeploymentPython server, complex depsSingle binary, scp and run
Multi-user / APIBuilt-in OpenAI APIserver binary with full OpenAI compat + web UI
Quantization SupportFP16/BF16 onlyQ4_K, Q5_K, Q8_0, etc. — 4–8 GB models fit in RAM

Key Finding:

We cannot use vLLM on our legacy Xeon v2 fleet due to missing AVX-512. llama.cpp runs efficiently on the same hardware with Q4_K_M models.


4. Why llama.cpp Is Our Production Choice

Example: Production-Ready Server

./llama.cpp/server \
  --model /models/llama3.1-8b-instruct.Q4_K_M.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  --threads 16 \
  --ctx-size 8192 \
  --n-gpu-layers 0    # CPU-only on older nodes
  --log-disable

Deploy via Docker:

FROM alpine:latest
COPY llama.cpp/server /usr/bin/
COPY models/*.gguf /models/
EXPOSE 8080
CMD ["server", "--model", "/models/llama3.1-8b-instruct.Q4_K_M.gguf", "--port", "8080"]

5. Migration Path: From Ollama → llama.cpp

# 1. Reuse Ollama's GGUF
cp ~/.ollama/models/blobs/sha256-* /production/models/

# 2. Deploy llama.cpp server
kubectl apply -f llama-cpp-deployment.yaml

# 3. Point clients to new endpoint
export OPENAI_API_BASE=http://llama-cpp-prod:8080/v1

Zero model reconversion. Zero downtime.


Summary

Use CaseRecommended Tool
Local dev / prototypingOllama
Hybrid cloud, old hardware, scalellama.cpp
High-throughput GPU clustervLLM (if AVX-512 available)

llama.cpp = the Swiss Army knife of LLM inference.