```rst .. _deployment-considerations: ============================================================ Deploying LLMs in Hybrid Cloud: Why llama.cpp Wins for Us ============================================================ .. contents:: Table of Contents :depth: 2 :local: ---------- 1. Current Setup: Ollama in Testing ================================== We have been using **Ollama** in a **test environment** with excellent results: - **Easy to use** — `ollama run llama3.1` just works - **Docker support** is first-class: .. code-block:: dockerfile FROM ollama/ollama COPY Modelfile /root/.ollama/ RUN ollama create my-llama3.1 -f Modelfile - Models are pulled, versioned, and cached automatically - Web UI and OpenAI-compatible API available out of the box **Verdict**: Perfect for **prototyping**, **local dev**, and **small-scale testing**. ---------- 2. Production Requirements: Hybrid Cloud & Multi-User Access ============================================================ When moving to **production in a hybrid cloud**, new constraints emerge: +---------------------------------------------------+----------------------------------+ | Requirement | Challenge with Ollama | +===================================================+==================================+ | **Multi-user concurrency** | Single-process; no built-in queueing | +---------------------------------------------------+----------------------------------+ | **Horizontal scaling across nodes** | Not designed for clustering | +---------------------------------------------------+----------------------------------+ | **Resource isolation & quotas per team/user** | No native support | +---------------------------------------------------+----------------------------------+ | **Integration with Kubernetes / CI/CD** | Limited operators & observability| +---------------------------------------------------+----------------------------------+ | **Hardware heterogeneity (old CPUs, no AVX-512)** | vLLM fails; Ollama still works | +---------------------------------------------------+----------------------------------+ We need a **lightweight, portable, hardware-agnostic** inference engine. ---------- 3. Evaluation: vLLM vs llama.cpp ================================ .. list-table:: :header-rows: 1 :widths: 30 35 35 * - Criteria - vLLM - llama.cpp * - **Hardware Requirements** - Requires **AVX-512** (fails on Xeon v2, older i7) - Runs on **SSE2+**, AVX2 optional * - **GPU Support** - Excellent (PagedAttention, high throughput) - CUDA, Metal, Vulkan — full offloading * - **CPU Performance** - Poor without AVX-512 - **Best-in-class** quantized inference * - **Binary Size** - ~200 MB + Python deps - **< 10 MB** (statically linked) * - **Deployment** - Python server, complex deps - Single binary, `scp` and run * - **Multi-user / API** - Built-in OpenAI API - `server` binary with full OpenAI compat + web UI * - **Quantization Support** - FP16/BF16 only - Q4_K, Q5_K, Q8_0, etc. — **4–8 GB models fit in RAM** **Key Finding**: > **We cannot use vLLM** on our legacy Xeon v2 fleet due to missing **AVX-512**. > **llama.cpp runs efficiently** on the **same hardware** with **Q4_K_M** models. ---------- 4. Why llama.cpp Is Our Production Choice ========================================== .. admonition:: Decision **llama.cpp** is selected for **hybrid cloud LLM deployment** because: - **Runs everywhere**: Old CPUs, new GPUs, laptops, edge - **Single static binary**: No Python, no CUDA runtime hell - **GGUF format**: Share models with Ollama, local files, S3 - **Built-in server**: OpenAI API + full web UI - **Thread & context control**: `--threads`, `--ctx-size`, `--n-gpu-layers` - **Kubernetes-ready**: Tiny image, fast startup Example: Production-Ready Server -------------------------------- .. code-block:: bash ./llama.cpp/server \ --model /models/llama3.1-8b-instruct.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --threads 16 \ --ctx-size 8192 \ --n-gpu-layers 0 # CPU-only on older nodes --log-disable Deploy via Docker: .. code-block:: dockerfile FROM alpine:latest COPY llama.cpp/server /usr/bin/ COPY models/*.gguf /models/ EXPOSE 8080 CMD ["server", "--model", "/models/llama3.1-8b-instruct.Q4_K_M.gguf", "--port", "8080"] ---------- 5. Migration Path: From Ollama → llama.cpp ========================================== .. code-block:: bash # 1. Reuse Ollama's GGUF cp ~/.ollama/models/blobs/sha256-* /production/models/ # 2. Deploy llama.cpp server kubectl apply -f llama-cpp-deployment.yaml # 3. Point clients to new endpoint export OPENAI_API_BASE=http://llama-cpp-prod:8080/v1 **Zero model reconversion. Zero downtime.** ---------- Summary ======= +-------------------------------------+-------------------------------+ | Use Case | Recommended Tool | +=====================================+===============================+ | Local dev / prototyping | **Ollama** | +-------------------------------------+-------------------------------+ | Hybrid cloud, old hardware, scale | **llama.cpp** | +-------------------------------------+-------------------------------+ | High-throughput GPU cluster | vLLM (if AVX-512 available) | +-------------------------------------+-------------------------------+ > **llama.cpp = the Swiss Army knife of LLM inference.** ---