Running Multiple Models on llama.cpp Using Docker

This guide demonstrates how to run multiple language models simultaneously using llama.cpp in Docker via docker-compose. The example below defines two services: a lightweight reranker model (Qwen3 0.6B) and a general-purpose chat model (Llama 3.1).

Example docker-compose.yml

 1services:
 2  qwen-reranker:
 3    image: ghcr.io/ggerganov/llama.cpp:server-cpu
 4    ports:
 5      - "8123:8080"
 6    volumes:
 7      - /home/naj/qwen3-reranker-0.6b:/models/qwen3-reranker-0.6b:ro
 8    environment:
 9      - MODEL=/models/qwen3-reranker-0.6b
10    command: >
11      --model /models/qwen3-reranker-0.6b/model.gguf
12      --port 8080
13      --host 0.0.0.0
14      --n-gpu-layers 0
15      --ctx-size 8192
16      --threads 6
17      --temp 0.0
18      --rpc
19    deploy:
20      resources:
21        limits:
22          cpus: '6'
23          memory: 10G
24    shm_size: 4g
25    restart: unless-stopped
26
27  llama3.1-chat:
28    image: ghcr.io/ggerganov/llama.cpp:server-cpu
29    ports:
30      - "8124:8080"
31    volumes:
32      - /home/naj/llama3.1:/models/llama3.1:ro
33    environment:
34      - MODEL=/models/llama3.1
35    command: >
36      --model /models/llama3.1/model.gguf
37      --port 8080
38      --host 0.0.0.0
39      --n-gpu-layers 0
40      --ctx-size 8192
41      --threads 10
42      --temp 0.7
43      --rpc
44    deploy:
45      resources:
46        limits:
47          cpus: '10'
48          memory: 20G
49    shm_size: 8g
50    restart: unless-stopped

Key Configuration Notes

  • Images: Both services use the official CPU-optimized llama.cpp server image.

  • Ports: - Reranker exposed on 8123 → internal 8080 - Chat model exposed on 8124 → internal 8080

  • Volumes: Model directories are mounted read-only (:ro) from the host.

  • Environment: MODEL variable simplifies path references in commands.

  • Command Flags: - --n-gpu-layers 0: Forces CPU-only inference. - --ctx-size 8192: Sets context length. - --temp: Controls randomness (0.0 for deterministic reranking, 0.7 for chat). - --rpc: Enables RPC interface for external control.

  • Resource Limits: CPU and memory capped via deploy.resources.limits.

  • Shared Memory (shm_size): Increased to support larger contexts and batching.

  • Restart Policy: unless-stopped ensures containers restart on failure or reboot.

Usage

Start the services:

docker compose up -d

Access the models:

  • Reranker: http://localhost:8123

  • Chat: http://localhost:8124

Send requests using the OpenAI-compatible API or llama.cpp client tools.