Running Multiple Models on llama.cpp Using Docker
This guide demonstrates how to run multiple language models simultaneously using llama.cpp in Docker via docker-compose. The example below defines two services: a lightweight reranker model (Qwen3 0.6B) and a general-purpose chat model (Llama 3.1).
Example docker-compose.yml
1services:
2 qwen-reranker:
3 image: ghcr.io/ggerganov/llama.cpp:server-cpu
4 ports:
5 - "8123:8080"
6 volumes:
7 - /home/naj/qwen3-reranker-0.6b:/models/qwen3-reranker-0.6b:ro
8 environment:
9 - MODEL=/models/qwen3-reranker-0.6b
10 command: >
11 --model /models/qwen3-reranker-0.6b/model.gguf
12 --port 8080
13 --host 0.0.0.0
14 --n-gpu-layers 0
15 --ctx-size 8192
16 --threads 6
17 --temp 0.0
18 --rpc
19 deploy:
20 resources:
21 limits:
22 cpus: '6'
23 memory: 10G
24 shm_size: 4g
25 restart: unless-stopped
26
27 llama3.1-chat:
28 image: ghcr.io/ggerganov/llama.cpp:server-cpu
29 ports:
30 - "8124:8080"
31 volumes:
32 - /home/naj/llama3.1:/models/llama3.1:ro
33 environment:
34 - MODEL=/models/llama3.1
35 command: >
36 --model /models/llama3.1/model.gguf
37 --port 8080
38 --host 0.0.0.0
39 --n-gpu-layers 0
40 --ctx-size 8192
41 --threads 10
42 --temp 0.7
43 --rpc
44 deploy:
45 resources:
46 limits:
47 cpus: '10'
48 memory: 20G
49 shm_size: 8g
50 restart: unless-stopped
Key Configuration Notes
Images: Both services use the official CPU-optimized
llama.cppserver image.Ports: - Reranker exposed on
8123→ internal8080- Chat model exposed on8124→ internal8080Volumes: Model directories are mounted read-only (
:ro) from the host.Environment:
MODELvariable simplifies path references in commands.Command Flags: -
--n-gpu-layers 0: Forces CPU-only inference. ---ctx-size 8192: Sets context length. ---temp: Controls randomness (0.0 for deterministic reranking, 0.7 for chat). ---rpc: Enables RPC interface for external control.Resource Limits: CPU and memory capped via
deploy.resources.limits.Shared Memory (shm_size): Increased to support larger contexts and batching.
Restart Policy:
unless-stoppedensures containers restart on failure or reboot.
Usage
Start the services:
docker compose up -d
Access the models:
Reranker:
http://localhost:8123Chat:
http://localhost:8124
Send requests using the OpenAI-compatible API or llama.cpp client tools.