```rst .. _deployment-considerations:

Deploying LLMs in Hybrid Cloud: Why llama.cpp Wins for Us


1. Current Setup: Ollama in Testing

We have been using Ollama in a test environment with excellent results:

  • Easy to useollama run llama3.1 just works

  • Docker support is first-class:

    FROM ollama/ollama
    COPY Modelfile /root/.ollama/
    RUN ollama create my-llama3.1 -f Modelfile
    
  • Models are pulled, versioned, and cached automatically

  • Web UI and OpenAI-compatible API available out of the box

Verdict: Perfect for prototyping, local dev, and small-scale testing.


2. Production Requirements: Hybrid Cloud & Multi-User Access

When moving to production in a hybrid cloud, new constraints emerge:

We need a lightweight, portable, hardware-agnostic inference engine.


3. Evaluation: vLLM vs llama.cpp

Criteria

vLLM

llama.cpp

Hardware Requirements

Requires AVX-512 (fails on Xeon v2, older i7)

Runs on SSE2+, AVX2 optional

GPU Support

Excellent (PagedAttention, high throughput)

CUDA, Metal, Vulkan — full offloading

CPU Performance

Poor without AVX-512

Best-in-class quantized inference

Binary Size

~200 MB + Python deps

< 10 MB (statically linked)

Deployment

Python server, complex deps

Single binary, scp and run

Multi-user / API

Built-in OpenAI API

server binary with full OpenAI compat + web UI

Quantization Support

FP16/BF16 only

Q4_K, Q5_K, Q8_0, etc. — 4–8 GB models fit in RAM

Key Finding:

> We cannot use vLLM on our legacy Xeon v2 fleet due to missing AVX-512. > llama.cpp runs efficiently on the same hardware with Q4_K_M models.


4. Why llama.cpp Is Our Production Choice

Decision

llama.cpp is selected for hybrid cloud LLM deployment because:

  • Runs everywhere: Old CPUs, new GPUs, laptops, edge

  • Single static binary: No Python, no CUDA runtime hell

  • GGUF format: Share models with Ollama, local files, S3

  • Built-in server: OpenAI API + full web UI

  • Thread & context control: –threads, –ctx-size, –n-gpu-layers

  • Kubernetes-ready: Tiny image, fast startup

Example: Production-Ready Server

./llama.cpp/server \
  --model /models/llama3.1-8b-instruct.Q4_K_M.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  --threads 16 \
  --ctx-size 8192 \
  --n-gpu-layers 0    # CPU-only on older nodes
  --log-disable

Deploy via Docker:

FROM alpine:latest
COPY llama.cpp/server /usr/bin/
COPY models/*.gguf /models/
EXPOSE 8080
CMD ["server", "--model", "/models/llama3.1-8b-instruct.Q4_K_M.gguf", "--port", "8080"]

5. Migration Path: From Ollama → llama.cpp

# 1. Reuse Ollama's GGUF
cp ~/.ollama/models/blobs/sha256-* /production/models/

# 2. Deploy llama.cpp server
kubectl apply -f llama-cpp-deployment.yaml

# 3. Point clients to new endpoint
export OPENAI_API_BASE=http://llama-cpp-prod:8080/v1

Zero model reconversion. Zero downtime.


Summary

Use Case

Recommended Tool

Local dev / prototyping

Ollama

Hybrid cloud, old hardware, scale

llama.cpp

High-throughput GPU cluster

vLLM (if AVX-512 available)

> llama.cpp = the Swiss Army knife of LLM inference.