Mineru

MinerU and Its Use in RAGFlow

test : https://huggingface.co/spaces/opendatalab/MinerU

docker build -t mineru:latest -f Dockerfile .   (34GB in size!!)
docker run --gpus all   --shm-size 32g   -p 30000:30000 -p 7860:7860 -p 8003:8003   --ipc=host   -it mineru:latest /bin/bash (normally with port 8000, but I used 8003)

Introduction to MinerU

MinerU is an open-source tool developed by OpenDataLab (Shanghai AI Laboratory) for converting complex PDF documents into machine-readable formats, such as Markdown or JSON. It excels at extracting text, tables (in HTML or LaTeX), mathematical formulas (in LaTeX), images (with captions), and preserving document structure, including headings, paragraphs, lists, and reading order for multi-column layouts.

Key features include:

MinerU is particularly suited for preparing documents for LLM workflows, such as Retrieval-Augmented Generation (RAG), due to its structured, clean output that minimizes hallucinations in downstream tasks. Recent versions (e.g., MinerU 2.5) achieve state-of-the-art performance on benchmarks like OmniDocBench.

MinerU in RAGFlow

RAGFlow is an open-source RAG engine focused on deep document understanding, supporting complex data ingestion for accurate question-answering with citations.

MinerU integration was introduced in RAGFlow v0.22.0 (released October 2025), supporting MinerU >= 2.6.3. RAGFlow acts solely as a client to MinerU:

Configuration options:

This integration leverages MinerU’s superior handling of complex PDFs (e.g., tables, formulas in academic/technical documents) to improve retrieval quality in RAGFlow-based applications.

Comparison with Existing PDF Ingestion Tools

Common PDF ingestion tools for RAG include Unstructured.io, LlamaParse (LlamaIndex), Docling, Marker, and traditional libraries like PyMuPDF. As of early 2026, MinerU frequently ranks among the top open-source options in benchmarks for complex PDFs, especially scientific/technical ones with tables and formulas.

Summary of MinerU Advantages

For most RAG pipelines in 2026, MinerU is a leading open-source choice for difficult PDFs, particularly when integrated into frameworks like RAGFlow.

Activating MinerU in RagFlow

in the dataset select in the configuration / ingestion pipeline / pdf parser -> mineru-from-env-1 Experimental

adapt the .env settings :

Enable Mineru

MINERU_APISERVER=http://host.docker.internal:8003 MINERU_DELETE_OUTPUT=0 # keep output directory MINERU_BACKEND=pipeline # or another backend you prefer