Running Llama 3.1 with llama.cpp
1. Model Format: GGUF
llama.cpp uses the GGUF (GPT-Generated Unified Format) model format.
You can download a pre-quantized Llama 3.1 8B model in GGUF format directly from Hugging Face:
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF
Example file: Meta-Llama-3-8B.Q4_K_S.gguf (~4.7 GB, 4-bit quantization, excellent quality/size tradeoff).
Note
The Q4_K_S variant uses ~4.7 GB RAM and runs efficiently on Intel i7 CPUs.
2. Compile llama.cpp for Intel i7 (CPU-only)
# Step 1: Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Step 2: Build for CPU (Intel i7, AVX2 enabled by default)
make clean
make -j$(nproc) LLAMA_CPU=1
# Optional: Force AVX2 (most i7 CPUs support it)
make clean
make -j$(nproc) LLAMA_CPU=1 LLAMA_AVX2=1
The binaries will be in the root directory:
./llama-cli→ interactive CLI./server→ web server (OpenAI-compatible API + full web UI)
Warning
Do not use vLLM on older Xeon v2 CPUs — they lack AVX-512, which vLLM requires. llama.cpp is a better choice — it runs efficiently with just AVX2 or even SSE.
3. Run the Model with Web Interface
Place the downloaded GGUF file in a models/ folder:
mkdir -p models
# Copy or symlink the model
ln -s /path/to/Meta-Llama-3-8B.Q4_K_S.gguf models/
Start the server on port 8087 using 12 threads:
./server \
--model models/Meta-Llama-3-8B.Q4_K_S.gguf \
--port 8087 \
--threads 12 \
--host 0.0.0.0
In a corporate network : avoid contacting the proxy !!
curl -X POST http://127.0.0.1:8087/v1/chat/completions \
--noproxy 127.0.0.1,localhost \
-H "Content-Type: application/json" \
-d '{
"model": "Meta-Llama-3-8B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! How are you?"}
]
}'
Features
Full web UI at: http://localhost:8087
OpenAI-compatible API at: http://localhost:8087/v1
List models:
curl http://localhost:8087/v1/models
Response:
{
"object": "list",
"data": [
{
"id": "models/Meta-Llama-3-8B.Q4_K_S.gguf",
"object": "model",
"created": 1762783003,
"owned_by": "llamacpp",
"meta": {
"vocab_type": 2,
"n_vocab": 128256,
"n_ctx_train": 8192,
"n_embd": 4096,
"n_params": 8030261248,
"size": 4684832768
}
}
]
}
4. Compile llama.cpp with NVIDIA GPU Support (CUDA)
If you have an NVIDIA GPU (e.g., RTX 3060, 4070, A100, etc.), enable CUDA acceleration:
Prerequisites
NVIDIA driver (\geq{} 525)
CUDA Toolkit (\geq{} 11.8, preferably 12.x)
nvccin$PATH
Build with CUDA
# Clean previous build
make clean
# Build with full CUDA support
make -j$(nproc) \
LLAMA_CUDA=1 \
LLAMA_CUDA_DMMV=1 \
LLAMA_CUDA_F16=1
# Optional: Specify compute capability (e.g., for RTX 40xx)
# make LLAMA_CUDA=1 CUDA_ARCH="-gencode arch=compute_89,code=sm_89"
Run with GPU offloading
./server \
--model models/Meta-Llama-3-8B.Q4_K_S.gguf \
--port 8087 \
--threads 8 \
--n-gpu-layers 999 \ # offload ALL layers to GPU
--host 0.0.0.0
Tip
Use nvidia-smi to monitor VRAM usage.
For 8B Q4 (~4.7 GB), even a 6 GB GPU can run it fully offloaded.
Summary
Feature |
Command / Note |
|---|---|
Model Format |
GGUF |
Download |
|
CPU Build (i7) |
|
GPU Build (CUDA) |
|
Run Server |
|
Web UI |
|
API |
llama.cpp = lightweight, CPU/GPU flexible, no AVX-512 needed → ideal replacement for vLLM on older hardware.
—