Batch Processing and Metadata Management in Infiniflow RAGFlow

Infiniflow RAGFlow provides a RESTful API (/api/v1) that enables programmatic control over datasets and documents, making it ideal for batch processing large volumes of documents, automated ingestion pipelines, and metadata enrichment.

This is essential in enterprise settings where thousands of PDFs, reports, or web pages need to be:

  • Ingested in bulk

  • Tagged with structured metadata (author, date, source, category, etc.)

  • Updated post-ingestion

  • Queried or filtered later via the RAG system

API Base URL

http://<RAGFLOW_HOST>/api/v1

Authentication

All requests require a Bearer token:

Authorization: Bearer ragflow-<your-token>

> Tip: Obtain token via login or API key management in the RAGFlow UI.

Step 1: Retrieve Dataset and Document IDs

Before updating, you must know the target:

  • Dataset ID (e.g., f388c05e9df711f0a0fe0242ac170003)

  • Document ID (e.g., 4920227c9eb711f0bff40242ac170003)

List all datasets:

curl -H "Authorization: Bearer ragflow-..." \
     http://192.168.0.213/api/v1/datasets

List documents in a dataset:

curl -H "Authorization: Bearer ragflow-..." \
     http://192.168.0.213/api/v1/datasets/<dataset_id>/documents

Step 2: Add Metadata to a Document (via PUT)

Use the PUT endpoint to update metadata of an existing document:

curl --request PUT \
     --url http://192.168.0.213/api/v1/datasets/f388c05e9df711f0a0fe0242ac170003/documents/4920227c9eb711f0bff40242ac170003 \
     --header 'Content-Type: multipart/form-data' \
     --header 'Authorization: Bearer ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm' \
     --data '{
       "meta_fields": {
         "author": "Example Author",
         "publish_date": "2025-01-01",
         "category": "AI Business Report",
         "url": "https://example.com/report.pdf"
       }
     }'

Request Breakdown:

  • Method: PUT

  • Path: /api/v1/datasets/<dataset_id>/documents/<document_id>

  • Content-Type: multipart/form-data (required even for JSON payload)

  • Body: JSON string with “meta_fields” object

Response (on success):

{
  "code": 0,
  "message": "Success",
  "data": { "document_id": "4920227c9eb711f0bff40242ac170003" }
}

Use Case: Batch Metadata Enrichment

You can automate metadata tagging for 1000s of documents using a script:

import requests
import json

BASE_URL = "http://192.168.0.213/api/v1"
TOKEN = "ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm"
HEADERS = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "multipart/form-data"
}

# Example: Load CSV with doc_id, author, date, url...
import pandas as pd
df = pd.read_csv("documents_metadata.csv")

for _, row in df.iterrows():
    dataset_id = row['dataset_id']
    doc_id = row['document_id']
    payload = {
        "meta_fields": {
            "author": row['author'],
            "publish_date": row['publish_date'],
            "category": row['category'],
            "url": row['source_url']
        }
    }

    files = {'': ('', json.dumps(payload), 'application/json')}
    resp = requests.put(
        f"{BASE_URL}/datasets/{dataset_id}/documents/{doc_id}",
        headers=HEADERS,
        files=files
    )
    print(doc_id, resp.json().get("message"))

Benefits:

  • Enrich RAG context with structured, queryable metadata

  • Enable filtering in UI or API (e.g., “Show reports from 2025 by Author X”)

  • Improve traceability and auditability

Other Batch-Capable Endpoints

Endpoint | Purpose |

|--------|——–| | POST /api/v1/datasets | Create new dataset | | POST /api/v1/datasets/{id}/documents | Upload new documents (with metadata) | | DELETE /api/v1/datasets/{id}/documents/{doc_id} | Remove document | | GET /api/v1/datasets/{id}/documents | List + filter by metadata |

Best Practices

  1. Always use IDs — never rely on filenames

  2. Batch in chunks (e.g., 100 docs/sec) to avoid rate limits

  3. Validate metadata schema in RAGFlow settings first

  4. Log responses for retry logic

  5. Use dataset-level permissions for access control

See also:

https://github.com/infiniflow/ragflow/blob/main/example/http/dataset_example.sh

Summary

RAGFlow’s API-first design enables:

  • Scalable batch ingestion

  • Rich metadata attachment

  • Full automation of document lifecycle

> Perfect for ETL pipelines, CMS integration, or enterprise knowledge base automation.

With this API, you can manage tens of thousands of documents with full metadata — all programmatically.