.. _llama-cpp:model-format-gguf:

==================================
Running Llama 3.1 with llama.cpp
==================================

.. contents:: Table of Contents
   :depth: 2
   :local:

----------

.. _llama-cpp:compile-llama-cpp-for-intel-i7-cpu-only:

1. Model Format: GGUF
=====================

``llama.cpp`` uses the **GGUF** (GPT-Generated Unified Format) model format.

You can download a pre-quantized **Llama 3.1 8B** model in GGUF format directly from Hugging Face:

.. code-block:: bash

   https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF

Example file: ``Meta-Llama-3-8B.Q4_K_S.gguf`` (~4.7 GB, 4-bit quantization, excellent quality/size tradeoff).

.. note::
   The ``Q4_K_S`` variant uses ~4.7 GB RAM and runs efficiently on Intel i7 CPUs.

----------

2. Compile llama.cpp for Intel i7 (CPU-only)
============================================

.. code-block:: bash

   # Step 1: Clone the repository
   git clone https://github.com/ggerganov/llama.cpp
   cd llama.cpp

   # Step 2: Build for CPU (Intel i7, AVX2 enabled by default)
   make clean
   make -j$(nproc) LLAMA_CPU=1

   # Optional: Force AVX2 (most i7 CPUs support it)
   make clean
   make -j$(nproc) LLAMA_CPU=1 LLAMA_AVX2=1

The binaries will be in the root directory:

- ``./llama-cli`` → interactive CLI
- ``./server``   → web server (OpenAI-compatible API + full web UI)

.. warning::
   **Do not use vLLM** on older Xeon v2 CPUs — they **lack AVX-512**, which vLLM requires.  
   **llama.cpp is a better choice** — it runs efficiently with just AVX2 or even SSE.

----------

.. _llama-cpp:run-the-model-with-web-interface:

3. Run the Model with Web Interface
===================================

Place the downloaded GGUF file in a ``models/`` folder:

.. code-block:: bash

   mkdir -p models
   # Copy or symlink the model
   ln -s /path/to/Meta-Llama-3-8B.Q4_K_S.gguf models/

Start the server on port **8087** using **12 threads**:

.. code-block:: bash

   ./server \
     --model models/Meta-Llama-3-8B.Q4_K_S.gguf \
     --port 8087 \
     --threads 12 \
     --host 0.0.0.0

In a corporate network : avoid contacting the proxy !!

.. code-block:: bash

    curl -X POST http://127.0.0.1:8087/v1/chat/completions \
      --noproxy 127.0.0.1,localhost \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Meta-Llama-3-8B",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Hello! How are you?"}
        ]
      }'
    

.. _llama-cpp:features:

Features
--------

- **Full web UI** at: http://localhost:8087
- **OpenAI-compatible API** at: http://localhost:8087/v1
- List models:

.. code-block:: bash

   curl http://localhost:8087/v1/models

**Response**:

.. code-block:: json

   {
     "object": "list",
     "data": [
       {
         "id": "models/Meta-Llama-3-8B.Q4_K_S.gguf",
         "object": "model",
         "created": 1762783003,
         "owned_by": "llamacpp",
         "meta": {
           "vocab_type": 2,
           "n_vocab": 128256,
           "n_ctx_train": 8192,
           "n_embd": 4096,
           "n_params": 8030261248,
           "size": 4684832768
         }
       }
     ]
   }

----------

.. _llama-cpp:compile-llama-cpp-with-nvidia-gpu-support-cuda:

4. Compile llama.cpp with NVIDIA GPU Support (CUDA)
====================================================

If you have an **NVIDIA GPU** (e.g., RTX 3060, 4070, A100, etc.), enable **CUDA acceleration**:

.. _llama-cpp:prerequisites:

Prerequisites
-------------

- NVIDIA driver (\\geq{} 525)
- CUDA Toolkit (\\geq{} 11.8, preferably 12.x)
- ``nvcc`` in ``$PATH``

.. _llama-cpp:build-with-cuda:

Build with CUDA
---------------

.. code-block:: bash

   # Clean previous build
   make clean

   # Build with full CUDA support
   make -j$(nproc) \
     LLAMA_CUDA=1 \
     LLAMA_CUDA_DMMV=1 \
     LLAMA_CUDA_F16=1

   # Optional: Specify compute capability (e.g., for RTX 40xx)
   # make LLAMA_CUDA=1 CUDA_ARCH="-gencode arch=compute_89,code=sm_89"

.. _llama-cpp:run-with-gpu-offloading:

Run with GPU offloading
-----------------------

.. code-block:: bash

   ./server \
     --model models/Meta-Llama-3-8B.Q4_K_S.gguf \
     --port 8087 \
     --threads 8 \
     --n-gpu-layers 999 \   # offload ALL layers to GPU
     --host 0.0.0.0

.. tip::
   Use ``nvidia-smi`` to monitor VRAM usage.  
   For 8B Q4 (~4.7 GB), even a **6 GB GPU** can run it fully offloaded.

----------

.. _llama-cpp:summary:

Summary
=======

+------------------------+-----------------------------------------------+
| Feature                | Command / Note                                |
+========================+===============================================+
| Model Format           | **GGUF**                                      |
+------------------------+-----------------------------------------------+
| Download               | https://huggingface.co/QuantFactory/...GGUF   |
+------------------------+-----------------------------------------------+
| CPU Build (i7)         | ``make LLAMA_CPU=1``                          |
+------------------------+-----------------------------------------------+
| GPU Build (CUDA)       | ``make LLAMA_CUDA=1``                         |
+------------------------+-----------------------------------------------+
| Run Server             | ``./server --model ... --port 8087``          |
+------------------------+-----------------------------------------------+
| Web UI                 | http://localhost:8087                         |
+------------------------+-----------------------------------------------+
| API                    | http://localhost:8087/v1                      |
+------------------------+-----------------------------------------------+

**llama.cpp = lightweight, CPU/GPU flexible, no AVX-512 needed → ideal replacement for vLLM on older hardware.**

---