.. _vllm_reranker:

Why vLLM is Used to Serve the Reranker Model
============================================

vLLM is a high-throughput, memory-efficient inference engine specifically designed for serving large language models (LLMs). In **Infiniflow RAGFlow**, the **reranker model**—responsible for fine-grained relevance scoring of retrieved document chunks—is served using **vLLM** to ensure low-latency, scalable, and production-ready performance.

Key Reasons for Using vLLM to Serve the Reranker
------------------------------------------------

1. **PagedAttention for Memory Efficiency**
   - vLLM uses **PagedAttention**, a novel attention mechanism that manages KV cache in non-contiguous memory pages.
   - This dramatically reduces memory fragmentation and enables **higher batch sizes** and **longer sequence lengths** (up to 8192 tokens in this case), critical for processing query-chunk pairs during reranking.

2. **High Throughput & Low Latency**
   - Supports **continuous batching**, allowing dynamic batch formation as requests arrive.
   - Eliminates head-of-line blocking and maximizes GPU utilization—ideal for real-time reranking in interactive RAG pipelines.

3. **OpenAI-Compatible API**
   - Exposes a clean, standardized REST API compatible with OpenAI’s format.
   - Enables seamless integration with RAGFlow’s orchestration layer without custom inference code.

4. **Support for Cross-Encoder Rerankers**
   - Models like **Qwen3-Reranker-0.6B** are cross-encoders that take ``[query, passage]`` pairs as input.
   - vLLM efficiently handles the bidirectional attention required, delivering relevance scores via ``logits[0]`` (typically for binary classification: relevant/irrelevant).

5. **Ollama Does Not Support Reranker Models (Yet)**
   - **Ollama** is excellent for local LLM inference and chat models, but **currently lacks native support for reranker (cross-encoder) models**.
   - Rerankers require structured input formatting and logit extraction that Ollama’s current API and model loading system do not accommodate.
   - vLLM, in contrast, supports any Hugging Face transformer model—including rerankers—with full access to outputs and fine-grained control.

6. **Scalability Advantage Over Ollama**
   - When scaling to **multiple concurrent users** or **high-throughput workloads**, vLLM is significantly more robust than Ollama.
   - vLLM supports **distributed serving**, **tensor parallelism**, **GPU clustering**, and **dynamic batching at scale**.
   - Ollama is primarily designed for **single-user, local development**, and does not scale efficiently in production environments.

Serving the Reranker Locally with vLLM
---------------------------------------

You can run the reranker model locally using vLLM with the following command:

.. code-block:: bash

   vllm serve /models/qwen3-reranker-0.6b \
       --port 8123 \
       --max-model-len 8192 \
       --dtype auto \
       --trust-remote-code

Once running, the model is accessible via the OpenAI-compatible endpoint:

**GET** ``http://localhost:8123/v1/models``

**Example Response**:

.. code-block:: json

   {
     "object": "list",
     "data": [
       {
         "id": "/models/qwen3-reranker-0.6b",
         "object": "model",
         "created": 1762258164,
         "owned_by": "vllm",
         "root": "/models/qwen3-reranker-0.6b",
         "parent": null,
         "max_model_len": 8192,
         "permission": [
           {
             "id": "modelperm-1a0d5938e30b4eeebb53d9e5c7d9599e",
             "object": "model_permission",
             "created": 1762258164,
             "allow_create_engine": false,
             "allow_sampling": true,
             "allow_logprobs": true,
             "allow_search_indices": false,
             "allow_view": true,
             "allow_fine_tuning": false,
             "organization": "*",
             "group": null,
             "is_blocking": false
           }
         ]
       }
     ]
   }

RAGFlow Integration
-------------------

RAGFlow configures the reranker endpoint in its settings:

.. code-block:: yaml

   reranker:
     provider: vllm
     api_base: http://localhost:8123/v1
     model: /models/qwen3-reranker-0.6b

During inference, RAGFlow sends batched ``[query, passage]`` pairs to the vLLM server, receives relevance scores, and reorders chunks before passing them to the chat model.

**Result**: Fast, accurate, and scalable reranking powered by optimized LLM inference—**where Ollama cannot currently follow, and where vLLM excels in both development and production.**