Batch Processing and Metadata Management in Infiniflow RAGFlow
Infiniflow RAGFlow provides a RESTful API (/api/v1) that enables programmatic control over datasets and documents, making it ideal for batch processing large volumes of documents, automated ingestion pipelines, and metadata enrichment.
This is essential in enterprise settings where thousands of PDFs, reports, or web pages need to be:
Ingested in bulk
Tagged with structured metadata (author, date, source, category, etc.)
Updated post-ingestion
Queried or filtered later via the RAG system
—
API Base URL
http://<RAGFLOW_HOST>/api/v1
Authentication
All requests require a Bearer token:
Authorization: Bearer ragflow-<your-token>
> Tip: Obtain token via login or API key management in the RAGFlow UI.
—
Step 1: Retrieve Dataset and Document IDs
Before updating, you must know the target:
Dataset ID (e.g., f388c05e9df711f0a0fe0242ac170003)
Document ID (e.g., 4920227c9eb711f0bff40242ac170003)
List all datasets:
curl -H "Authorization: Bearer ragflow-..." \
http://192.168.0.213/api/v1/datasets
List documents in a dataset:
curl -H "Authorization: Bearer ragflow-..." \
http://192.168.0.213/api/v1/datasets/<dataset_id>/documents
—
Step 2: Add Metadata to a Document (via PUT)
Use the PUT endpoint to update metadata of an existing document:
curl --request PUT \
--url http://192.168.0.213/api/v1/datasets/f388c05e9df711f0a0fe0242ac170003/documents/4920227c9eb711f0bff40242ac170003 \
--header 'Content-Type: multipart/form-data' \
--header 'Authorization: Bearer ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm' \
--data '{
"meta_fields": {
"author": "Example Author",
"publish_date": "2025-01-01",
"category": "AI Business Report",
"url": "https://example.com/report.pdf"
}
}'
Request Breakdown:
Method: PUT
Path: /api/v1/datasets/<dataset_id>/documents/<document_id>
Content-Type: multipart/form-data (required even for JSON payload)
Body: JSON string with “meta_fields” object
Response (on success):
{
"code": 0,
"message": "Success",
"data": { "document_id": "4920227c9eb711f0bff40242ac170003" }
}
—
Use Case: Batch Metadata Enrichment
You can automate metadata tagging for 1000s of documents using a script:
import requests
import json
BASE_URL = "http://192.168.0.213/api/v1"
TOKEN = "ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm"
HEADERS = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "multipart/form-data"
}
# Example: Load CSV with doc_id, author, date, url...
import pandas as pd
df = pd.read_csv("documents_metadata.csv")
for _, row in df.iterrows():
dataset_id = row['dataset_id']
doc_id = row['document_id']
payload = {
"meta_fields": {
"author": row['author'],
"publish_date": row['publish_date'],
"category": row['category'],
"url": row['source_url']
}
}
files = {'': ('', json.dumps(payload), 'application/json')}
resp = requests.put(
f"{BASE_URL}/datasets/{dataset_id}/documents/{doc_id}",
headers=HEADERS,
files=files
)
print(doc_id, resp.json().get("message"))
Benefits:
Enrich RAG context with structured, queryable metadata
Enable filtering in UI or API (e.g., “Show reports from 2025 by Author X”)
Improve traceability and auditability
—
Other Batch-Capable Endpoints
|--------|——–| | POST /api/v1/datasets | Create new dataset | | POST /api/v1/datasets/{id}/documents | Upload new documents (with metadata) | | DELETE /api/v1/datasets/{id}/documents/{doc_id} | Remove document | | GET /api/v1/datasets/{id}/documents | List + filter by metadata |
—
Best Practices
Always use IDs — never rely on filenames
Batch in chunks (e.g., 100 docs/sec) to avoid rate limits
Validate metadata schema in RAGFlow settings first
Log responses for retry logic
Use dataset-level permissions for access control
—
See also:
https://github.com/infiniflow/ragflow/blob/main/example/http/dataset_example.sh
Summary
RAGFlow’s API-first design enables:
Scalable batch ingestion
Rich metadata attachment
Full automation of document lifecycle
> Perfect for ETL pipelines, CMS integration, or enterprise knowledge base automation.
With this API, you can manage tens of thousands of documents with full metadata — all programmatically.