Vllm Cpu

Serving vLLM Reranker Using Docker (CPU-Only)

To ensure reproducibility, portability, and isolation, vLLM can be deployed using Docker. This is especially useful in environments with restricted internet access (e.g., corporate networks behind proxies or firewalls), where Hugging Face Hub may be blocked or rate-limited.

In this setup, vLLM runs on CPU only because:

Thus, we use the official CPU-optimized vLLM image built from: https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.cpu

Docker Compose Configuration (CPU Mode)

version: '3.8'
services:
  qwen-reranker:
    image: vllm-cpu:latest
    ports: ["8123:8000"]
    volumes:
      - /home/naj/qwen3-reranker-0.6b:/models/qwen3-reranker-0.6b:ro
    environment:
      VLLM_HF_OVERRIDES: |
        {
          "architectures": ["Qwen3ForSequenceClassification"],
          "classifier_from_token": ["no", "yes"],
          "is_original_qwen3_reranker": true
        }
    command: >
      /models/qwen3-reranker-0.6b
      --task score
      --dtype float32
      --port 8000
      --trust-remote-code
      --max-model-len 8192
    deploy:
      resources:
        limits:
          cpus: '10'
          memory: 16G
    shm_size: 4g
    restart: unless-stopped

Key Components Explained

Why the Model Must Be Pre-Downloaded Locally

The container cannot download the model at runtime due to:

  1. Corporate Proxy / Firewall
    • Outbound traffic to huggingface.co is blocked or requires authentication.
  2. Hugging Face Hub Blocked
    • Git LFS and model downloads fail in restricted networks.
  3. vLLM Auto-Download Fails Offline
    • vLLM uses transformers.AutoModel → attempts online download if model not found.

Solution: Download via mirror

HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-Reranker-0.6B --local-dir ./qwen3-reranker-0.6b

Why CPU-Only (No GPU)?

Performance Note: CPU inference is slower (~1–3 sec per batch), but sufficient for development, prototyping, or low-throughput use cases.

Start the Service

docker-compose up -d

Verify Availability

curl http://localhost:8123/v1/models

Expected output confirms the model is loaded and ready.

Integration with RAGFlow

Update RAGFlow config:

reranker:
  provider: vllm
  api_base: http://localhost:8123/v1
  model: /models/qwen3-reranker-0.6b

Benefits of This CPU + Docker Setup

Ideal for local RAGFlow development and constrained production environments.