.. _vllm_docker: Serving vLLM Reranker Using Docker (CPU-Only) ============================================= To ensure **reproducibility**, **portability**, and **isolation**, **vLLM** can be deployed using **Docker**. This is especially useful in environments with restricted internet access (e.g., corporate networks behind proxies or firewalls), where **Hugging Face Hub** may be blocked or rate-limited. In this setup, **vLLM runs on CPU only** because: - **Laptop has no GPU** - **Home server has an old NVIDIA GPU** (not supported by vLLM’s CUDA requirements) Thus, we use the **official CPU-optimized vLLM image** built from: https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.cpu --- Docker Compose Configuration (CPU Mode) --------------------------------------- .. code-block:: yaml version: '3.8' services: qwen-reranker: image: vllm-cpu:latest ports: ["8123:8000"] volumes: - /home/naj/qwen3-reranker-0.6b:/models/qwen3-reranker-0.6b:ro environment: VLLM_HF_OVERRIDES: | { "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], "is_original_qwen3_reranker": true } command: > /models/qwen3-reranker-0.6b --task score --dtype float32 --port 8000 --trust-remote-code --max-model-len 8192 deploy: resources: limits: cpus: '10' memory: 16G shm_size: 4g restart: unless-stopped --- Key Components Explained ------------------------ - **``image: vllm-cpu:latest``** - Official vLLM CPU image (no CUDA dependencies). - Built from: `vllm-project/vllm/docker/Dockerfile.cpu `_ - Uses PyTorch CPU backend with optimized inference kernels. - **``ports: ["8123:8000"]``** - Host port **8123** → container port **8000** (vLLM default). - **``volumes``** - Mounts **locally pre-downloaded model** in **read-only** mode. - **``VLLM_HF_OVERRIDES``** - Required for **Qwen3-Reranker** due to custom classification head and token handling. - **``command``** - ``--task score``: Enables reranker scoring (outputs relevance logits). - ``--dtype float32``: Mandatory on CPU (no half-precision support). - ``--max-model-len 8192``: Supports long query+passage pairs. - **Resource Limits** - ``cpus: '10'`` and ``memory: 16G`` prevent system overload. - ``shm_size: 4g`` ensures sufficient shared memory for batched inference. --- Why the Model Must Be Pre-Downloaded Locally -------------------------------------------- The container **cannot download the model at runtime** due to: 1. **Corporate Proxy / Firewall** - Outbound traffic to ``huggingface.co`` is blocked or requires authentication. 2. **Hugging Face Hub Blocked** - Git LFS and model downloads fail in restricted networks. 3. **vLLM Auto-Download Fails Offline** - vLLM uses ``transformers.AutoModel`` → attempts online download if model not found. **Solution: Download via mirror** .. code-block:: bash HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-Reranker-0.6B --local-dir ./qwen3-reranker-0.6b - Remark : for some models you need a token HF_TOKEN=xxxxxxxx (you have to specify the model in the token definition!) - Remark2 : use "sudo" if non-root!!! - Uses **accessible mirror** (``hf-mirror.com``). - Saves model locally for volume mounting. --- Why CPU-Only (No GPU)? ---------------------- - **Laptop**: Integrated graphics only (no discrete GPU). - **Home Server**: NVIDIA GPU too old (e.g., pre-Ampere) → **not supported** by vLLM’s CUDA 11.8+ / FlashAttention requirements. - **vLLM CPU image** enables full functionality without GPU. > **Performance Note**: CPU inference is slower (~1–3 sec per batch), but sufficient for **development**, **prototyping**, or **low-throughput** use cases. --- Start the Service ----------------- .. code-block:: bash docker-compose up -d Verify Availability ------------------- .. code-block:: bash curl http://localhost:8123/v1/models Expected output confirms the model is loaded and ready. --- Integration with RAGFlow ------------------------ Update RAGFlow config: .. code-block:: yaml reranker: provider: vllm api_base: http://localhost:8123/v1 model: /models/qwen3-reranker-0.6b --- Benefits of This CPU + Docker Setup ----------------------------------- - **Works on any machine** (laptop, old server, air-gapped systems) - **No GPU required** - **Offline-first** with pre-downloaded model - **Consistent environment** via Docker - **Secure**: read-only model, isolated container - **Scalable later**: switch to GPU image when hardware upgrades **Ideal for local RAGFlow development and constrained production environments.**