Api

Batch Processing and Metadata Management in Infiniflow RAGFlow

Infiniflow RAGFlow provides a RESTful API (/api/v1) that enables programmatic control over datasets and documents, making it ideal for batch processing large volumes of documents, automated ingestion pipelines, and metadata enrichment.

This is essential in enterprise settings where thousands of PDFs, reports, or web pages need to be:

API Base URL

http://<RAGFLOW_HOST>/api/v1

Authentication

All requests require a Bearer token:

Authorization: Bearer ragflow-<your-token>

Tip: Obtain token via login or API key management in the RAGFlow UI.

Step 1: Retrieve Dataset and Document IDs

Before updating, you must know the target:

List all datasets:

curl -H "Authorization: Bearer ragflow-..." \
     http://192.168.0.213/api/v1/datasets

List documents in a dataset:

curl -H "Authorization: Bearer ragflow-..." \
     http://192.168.0.213/api/v1/datasets/<dataset_id>/documents

Step 2: Add Metadata to a Document (via PUT)

Use the PUT endpoint to update metadata of an existing document:

curl --request PUT \
     --url http://192.168.0.213/api/v1/datasets/f388c05e9df711f0a0fe0242ac170003/documents/4920227c9eb711f0bff40242ac170003 \
     --header 'Content-Type: multipart/form-data' \
     --header 'Authorization: Bearer ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm' \
     --data '{
       "meta_fields": {
         "author": "Example Author",
         "publish_date": "2025-01-01",
         "category": "AI Business Report",
         "url": "https://example.com/report.pdf"
       }
     }'

Request Breakdown:

Response (on success):

{
  "code": 0,
  "message": "Success",
  "data": { "document_id": "4920227c9eb711f0bff40242ac170003" }
}

Use Case: Batch Metadata Enrichment

You can automate metadata tagging for 1000s of documents using a script:

import requests
import json

BASE_URL = "http://192.168.0.213/api/v1"
TOKEN = "ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm"
HEADERS = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "multipart/form-data"
}

# Example: Load CSV with doc_id, author, date, url...
import pandas as pd
df = pd.read_csv("documents_metadata.csv")

for _, row in df.iterrows():
    dataset_id = row['dataset_id']
    doc_id = row['document_id']
    payload = {
        "meta_fields": {
            "author": row['author'],
            "publish_date": row['publish_date'],
            "category": row['category'],
            "url": row['source_url']
        }
    }

    files = {'': ('', json.dumps(payload), 'application/json')}
    resp = requests.put(
        f"{BASE_URL}/datasets/{dataset_id}/documents/{doc_id}",
        headers=HEADERS,
        files=files
    )
    print(doc_id, resp.json().get("message"))

Benefits:

Other Batch-Capable Endpoints

Endpoint | Purpose |

——–| | POST /api/v1/datasets | Create new dataset | | POST /api/v1/datasets/{id}/documents | Upload new documents (with metadata) | | DELETE /api/v1/datasets/{id}/documents/{doc_id} | Remove document | | GET /api/v1/datasets/{id}/documents | List + filter by metadata |

Best Practices

  1. Always use IDs — never rely on filenames
  2. Batch in chunks (e.g., 100 docs/sec) to avoid rate limits
  3. Validate metadata schema in RAGFlow settings first
  4. Log responses for retry logic
  5. Use dataset-level permissions for access control

See also:

https://github.com/infiniflow/ragflow/blob/main/example/http/dataset_example.sh

Summary

RAGFlow’s API-first design enables:

Perfect for ETL pipelines, CMS integration, or enterprise knowledge base automation.

With this API, you can manage tens of thousands of documents with full metadata — all programmatically.