Running Multiple Models on llama.cpp Using Docker

This guide demonstrates how to run multiple language models simultaneously using llama.cpp in Docker via docker-compose. The example below defines two services: a lightweight reranker model (Qwen3 0.6B) and a general-purpose chat model (Llama 3.1).

Example `docker-compose.yml`

services:
  qwen-reranker:
    image: ghcr.io/ggerganov/llama.cpp:server-cpu
    ports:
      - "8123:8080"
    volumes:
      - /home/naj/qwen3-reranker-0.6b:/models/qwen3-reranker-0.6b:ro
    environment:
      - MODEL=/models/qwen3-reranker-0.6b
    command: >
      --model /models/qwen3-reranker-0.6b/model.gguf
      --port 8080
      --host 0.0.0.0
      --n-gpu-layers 0
      --ctx-size 8192
      --threads 6
      --temp 0.0
      --rpc
    deploy:
      resources:
        limits:
          cpus: '6'
          memory: 10G
    shm_size: 4g
    restart: unless-stopped

  llama3.1-chat:
    image: ghcr.io/ggerganov/llama.cpp:server-cpu
    ports:
      - "8124:8080"
    volumes:
      - /home/naj/llama3.1:/models/llama3.1:ro
    environment:
      - MODEL=/models/llama3.1
    command: >
      --model /models/llama3.1/model.gguf
      --port 8080
      --host 0.0.0.0
      --n-gpu-layers 0
      --ctx-size 8192
      --threads 10
      --temp 0.7
      --rpc
    deploy:
      resources:
        limits:
          cpus: '10'
          memory: 20G
    shm_size: 8g
    restart: unless-stopped

Key Configuration Notes

Images: Both services use the official CPU-optimized llama.cpp server image.
Ports: - Reranker exposed on 8123 → internal 8080 - Chat model exposed on 8124 → internal 8080
Volumes: Model directories are mounted read-only (:ro) from the host.
Environment: MODEL variable simplifies path references in commands.
Command Flags: - --n-gpu-layers 0: Forces CPU-only inference. - --ctx-size 8192: Sets context length. - --temp: Controls randomness (0.0 for deterministic reranking, 0.7 for chat). - --rpc: Enables RPC interface for external control.
Resource Limits: CPU and memory capped via deploy.resources.limits.
Shared Memory (shm_size): Increased to support larger contexts and batching.
Restart Policy: unless-stopped ensures containers restart on failure or reboot.

Usage

Start the services:

docker compose up -d

Access the models:

Reranker: http://localhost:8123
Chat: http://localhost:8124

Send requests using the OpenAI-compatible API or llama.cpp client tools.

Running Multiple Models on llama.cpp Using Docker

Example docker-compose.yml

Key Configuration Notes

Usage

Example `docker-compose.yml`