MinerU and Its Use in RAGFlow ============================= test : https://huggingface.co/spaces/opendatalab/MinerU .. code-block:: bash docker build -t mineru:latest -f Dockerfile . (34GB in size!!) docker run --gpus all --shm-size 32g -p 30000:30000 -p 7860:7860 -p 8003:8003 --ipc=host -it mineru:latest /bin/bash (normally with port 8000, but I used 8003) Introduction to MinerU ---------------------- MinerU is an open-source tool developed by OpenDataLab (Shanghai AI Laboratory) for converting complex PDF documents into machine-readable formats, such as Markdown or JSON. It excels at extracting text, tables (in HTML or LaTeX), mathematical formulas (in LaTeX), images (with captions), and preserving document structure, including headings, paragraphs, lists, and reading order for multi-column layouts. Key features include: - Removal of noise elements like headers, footers, footnotes, and page numbers. - Support for scanned PDFs via OCR (PaddleOCR, multilingual with over 80 languages). - Handling of complex layouts, including scientific literature with symbols and equations. - Multiple backends: pipeline (rule-based, CPU-friendly), VLM-based (vision-language models for higher accuracy, often GPU-accelerated), and hybrid modes. - Built on PDF-Extract-Kit models for layout detection, table recognition, and formula parsing. - AGPL-3.0 license. MinerU is particularly suited for preparing documents for LLM workflows, such as Retrieval-Augmented Generation (RAG), due to its structured, clean output that minimizes hallucinations in downstream tasks. Recent versions (e.g., MinerU 2.5) achieve state-of-the-art performance on benchmarks like OmniDocBench. MinerU in RAGFlow ----------------- RAGFlow is an open-source RAG engine focused on deep document understanding, supporting complex data ingestion for accurate question-answering with citations. MinerU integration was introduced in RAGFlow v0.22.0 (released October 2025), supporting MinerU >= 2.6.3. RAGFlow acts solely as a **client** to MinerU: - RAGFlow calls MinerU to parse uploaded PDFs. - MinerU processes the file and outputs structured data (e.g., JSON/Markdown with images and tables). - RAGFlow reads the output and proceeds with chunking, embedding, and indexing. Configuration options: - Enable via ``USE_MINERU=true`` in Docker/.env or manual environment variables. - Select MinerU in the dataset configuration UI under “PDF parser” (for built-in pipelines) or in the Parser component (for custom pipelines). - Supports remote MinerU API deployment (e.g., via vLLM backend for GPU offloading, decoupling from RAGFlow’s CPU-only server). - Alongside other parsers like DeepDoc (RAGFlow’s default VLM), Naive (text-only), and Docling. This integration leverages MinerU’s superior handling of complex PDFs (e.g., tables, formulas in academic/technical documents) to improve retrieval quality in RAGFlow-based applications. Comparison with Existing PDF Ingestion Tools -------------------------------------------- Common PDF ingestion tools for RAG include Unstructured.io, LlamaParse (LlamaIndex), Docling, Marker, and traditional libraries like PyMuPDF. As of early 2026, MinerU frequently ranks among the top open-source options in benchmarks for complex PDFs, especially scientific/technical ones with tables and formulas. .. table:: Comparison of PDF Parsers for RAG (as of early 2026) ================= ============= ========================================================================== ========================================================================= ============================================= =========== ======================= Tool Type Key Strengths Weaknesses Best For Open-Source GPU Required (Optional) ================= ============= ========================================================================== ========================================================================= ============================================= =========== ======================= **MinerU** VLM/Rule-based Excellent table/formula extraction (LaTeX/HTML), layout preservation, multilingual OCR, clean Markdown/JSON, SOTA on scientific PDFs. Resource-intensive (VLM backend), AGPL-3.0 may require source disclosure for SaaS. Academic/technical PDFs, precise structured RAG. Yes Yes (for best accuracy) **Unstructured.io** Rule-based + partitions Broad format support, fast partitioning, good integrations. Weaker on complex tables/formulas vs VLM tools, needs post-processing. General enterprise documents, multi-format. Yes (core) No **LlamaParse** Cloud/API Fast, superior table extraction, seamless LlamaIndex integration. Proprietary/paid for advanced features, privacy concerns. Quick high-quality parsing (cloud). No No (cloud) **Docling** Modular (IBM) Fast, local/offline, native Office format support, balanced accuracy. Less strong on formulas/scientific content or complex tables than MinerU. Local deployments, mixed document types. Yes No **Marker** VLM-based Fast PDF-to-Markdown, good OCR, offline. Slightly behind MinerU on table/formula precision in recent benchmarks. Offline Markdown conversion. Yes Optional ================= ============= ========================================================================== ========================================================================= ============================================= =========== ======================= Summary of MinerU Advantages ---------------------------- - High performance in 2025–2026 benchmarks (e.g., top scores in table recognition, formula parsing, and layout accuracy on complex docs). - Superior to Unstructured for structured scientific output; often comparable to or better than LlamaParse in open-source/local setups. - In RAGFlow, it complements or outperforms the default DeepDoc parser for challenging PDFs requiring top-tier layout/table handling. For most RAG pipelines in 2026, MinerU is a leading open-source choice for difficult PDFs, particularly when integrated into frameworks like RAGFlow. Activating MinerU in RagFlow ------------------------------ in the dataset select in the configuration / ingestion pipeline / pdf parser -> mineru-from-env-1 Experimental adapt the .env settings : # Enable Mineru # Uncommenting these lines will automatically add MinerU to the model provider whenever possible. # More details see https://ragflow.io/docs/faq#how-to-use-mineru-to-parse-pdf-documents. MINERU_APISERVER=http://host.docker.internal:8003 MINERU_DELETE_OUTPUT=0 # keep output directory MINERU_BACKEND=pipeline # or another backend you prefer