RAGFlow GPU vs CPU: Full Explanation (2025 Edition)
***************************************************

Why Does RAGFlow Still Need a GPU Even When Using Ollama?
========================================================

You are absolutely right to ask this question.

Most people configure RAGFlow to use **external services** (Ollama, Xinference, vLLM, OpenAI, etc.) for:

- Embedding model
- Re-ranking model  
- Inference LLM (the actual answer generator)

Ollama runs these models **entirely on your GPU** — RAGFlow only sends HTTP requests.  
So why does the official documentation and community still strongly recommend the **ragflow-gpu** image (or setting ``DEVICE=gpu``)?

The answer is simple: **Deep Document Understanding (DeepDoc)** — the part that happens *before* any embedding or LLM call.

Complete RAGFlow Pipeline (with GPU usage marked)
=================================================

.. table::
   :widths: auto
   :align: center

   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | Step                      | Component                     | Runs inside RAGFlow?       | Uses GPU when?            |
   +===========================+===============================+============================+===========================+
   | 1. Document Upload        | DeepDoc parser                | Yes (core of RAGFlow)      | **YES** — heavily         |
   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | 2. Chunking               | Text splitting                | Yes                        | No (pure CPU)             |
   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | 3. Embedding              | Sentence-Transformers, BGE…   | External (Ollama, etc.)    | GPU via Ollama/vLLM       |
   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | 4. Vector storage         | Elasticsearch / InfiniFlow DB | Yes                        | No                        |
   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | 5. Retrieval              | Vector + keyword search       | Yes                        | No                        |
   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | 6. Re-ranking             | Cross-encoder (optional)      | External or local          | GPU via external service  |
   +---------------------------+-------------------------------+----------------------------+---------------------------+
   | 7. Answer generation      | LLM (Llama 3, Qwen2, etc.)    | External (Ollama, etc.)    | GPU via Ollama/vLLM       |
   +---------------------------+-------------------------------+----------------------------+---------------------------+

→ The **only part that RAGFlow itself accelerates with GPU** is Step 1 — but it is by far the most compute-intensive for real-world documents.

What DeepDoc Actually Does (and Why GPU Makes It 5–20× Faster)
==============================================================

When you upload a PDF, scanned image, or complex report, DeepDoc performs these AI-heavy tasks:

1. **Layout Detection**  
   Detects columns, headers, footers, reading order using CNN-based models (LayoutLM-style).

2. **Table Structure Recognition (TSR)**  
   Identifies table boundaries, row/column spans, merged cells — extremely important for accurate retrieval.

3. **Formula & Math Recognition**  
   Converts LaTeX/math images into readable text.

4. **Enhanced OCR**  
   For scanned PDFs: runs deep-learning OCR models (not just Tesseract CPU).

5. **Visual Language Model Tasks** (charts, diagrams, screenshots)  
   Optionally calls lightweight VLMs (Qwen2-VL, LLaVA, etc.) to describe images inside the document.

All of these run **inside RAGFlow’s deepdoc module** using PyTorch + CUDA when ``DEVICE=gpu`` is enabled.

Real-World Performance Numbers
==============================

.. table::
   :widths: auto

   +----------------------------------+-----------------+-----------------+
   | Document Type                    | ragflow-cpu     | ragflow-gpu     |
   +==================================+=================+=================+
   | 50-page clean text PDF           | ~2–4 minutes    | ~2–4 minutes    |
   +----------------------------------+-----------------+-----------------+
   | 50-page scanned PDF (images)     | 30–90 minutes   | 4–10 minutes    |
   +----------------------------------+-----------------+-----------------+
   | 100-page financial report w/ tables | Often fails   | 8–15 minutes    |
   +----------------------------------+-----------------+-----------------+
   | PDF with charts & diagrams       | Very poor OCR   | Accurate + fast |
   +----------------------------------+-----------------+-----------------+

When Do You Actually Need ragflow-gpu?
=======================================

**YES – You need it if your documents contain any of the following:**
- Scanned pages (images instead of selectable text)
- Complex tables or financial reports
- Charts, graphs, screenshots
- Mixed layouts (multi-column, sidebars, footnotes)
- Handwritten notes or formulas

**NO – You can stay on ragflow-cpu if:**
- All documents are clean, born-digital text (Word → PDF, Markdown, etc.)
- You only do quick prototypes with a few simple files
- You have no NVIDIA GPU available

Recommended Setup in 2025 (Best of Both Worlds)
===============================================

.. code-block:: bash

   # .env file
   DEVICE=gpu                     # ← Enables DeepDoc GPU acceleration
   SVR_HTTP_PORT=80

   # Use Ollama (or vLLM) for embeddings + LLM
   EMBEDDING_MODEL=ollama/bge-large
   RERANK_MODEL=reranker  (this cannnot be served by ollama, vLLM or llama.cpp is needed)
   LLM_MODEL=ollama/llama3.1:70b

Result:
- DeepDoc parsing → blazing fast on your NVIDIA GPU (inside RAGFlow container)
- Embeddings + LLM → also blazing fast on the same GPU (inside Ollama container)
- No bottlenecks anywhere

Monitoring & Verification
=========================

During document upload:

.. code-block:: bash

   watch -n 1 nvidia-smi

You will see:
- RAGFlow container using 4–12 GB VRAM during parsing
- Ollama container using VRAM only when embedding or generatinging

Conclusion
==========

- **ragflow-cpu**  → fine for clean text only  
- **ragflow-gpu**  → mandatory for real-world unstructured documents

Even if Ollama handles your LLM and embeddings perfectly, **without ragflow-gpu, the very first step (understanding the document) will be painfully slow or inaccurate**.

Use ``DEVICE=gpu`` — it’s the difference between a toy and a true enterprise-grade RAG system.