Api
Batch Processing and Metadata Management in Infiniflow RAGFlow
Infiniflow RAGFlow provides a RESTful API (/api/v1) that enables programmatic control over datasets and documents, making it ideal for batch processing large volumes of documents, automated ingestion pipelines, and metadata enrichment.
This is essential in enterprise settings where thousands of PDFs, reports, or web pages need to be:
- Ingested in bulk
- Tagged with structured metadata (author, date, source, category, etc.)
- Updated post-ingestion
- Queried or filtered later via the RAG system
—
API Base URL
http://<RAGFLOW_HOST>/api/v1
Authentication
All requests require a Bearer token:
Authorization: Bearer ragflow-<your-token>
Tip: Obtain token via login or API key management in the RAGFlow UI.
—
Step 1: Retrieve Dataset and Document IDs
Before updating, you must know the target:
- Dataset ID (e.g., f388c05e9df711f0a0fe0242ac170003)
- Document ID (e.g., 4920227c9eb711f0bff40242ac170003)
List all datasets:
curl -H "Authorization: Bearer ragflow-..." \
http://192.168.0.213/api/v1/datasets
List documents in a dataset:
curl -H "Authorization: Bearer ragflow-..." \
http://192.168.0.213/api/v1/datasets/<dataset_id>/documents
—
Step 2: Add Metadata to a Document (via PUT)
Use the PUT endpoint to update metadata of an existing document:
curl --request PUT \
--url http://192.168.0.213/api/v1/datasets/f388c05e9df711f0a0fe0242ac170003/documents/4920227c9eb711f0bff40242ac170003 \
--header 'Content-Type: multipart/form-data' \
--header 'Authorization: Bearer ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm' \
--data '{
"meta_fields": {
"author": "Example Author",
"publish_date": "2025-01-01",
"category": "AI Business Report",
"url": "https://example.com/report.pdf"
}
}'
Request Breakdown:
- Method: PUT
- Path: /api/v1/datasets/<dataset_id>/documents/<document_id>
- Content-Type: multipart/form-data (required even for JSON payload)
- Body: JSON string with “meta_fields” object
Response (on success):
{
"code": 0,
"message": "Success",
"data": { "document_id": "4920227c9eb711f0bff40242ac170003" }
}
—
Use Case: Batch Metadata Enrichment
You can automate metadata tagging for 1000s of documents using a script:
import requests
import json
BASE_URL = "http://192.168.0.213/api/v1"
TOKEN = "ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm"
HEADERS = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "multipart/form-data"
}
# Example: Load CSV with doc_id, author, date, url...
import pandas as pd
df = pd.read_csv("documents_metadata.csv")
for _, row in df.iterrows():
dataset_id = row['dataset_id']
doc_id = row['document_id']
payload = {
"meta_fields": {
"author": row['author'],
"publish_date": row['publish_date'],
"category": row['category'],
"url": row['source_url']
}
}
files = {'': ('', json.dumps(payload), 'application/json')}
resp = requests.put(
f"{BASE_URL}/datasets/{dataset_id}/documents/{doc_id}",
headers=HEADERS,
files=files
)
print(doc_id, resp.json().get("message"))
Benefits:
- Enrich RAG context with structured, queryable metadata
- Enable filtering in UI or API (e.g., “Show reports from 2025 by Author X”)
- Improve traceability and auditability
—
Other Batch-Capable Endpoints
Endpoint | Purpose |
——–| | POST /api/v1/datasets | Create new dataset | | POST /api/v1/datasets/{id}/documents | Upload new documents (with metadata) | | DELETE /api/v1/datasets/{id}/documents/{doc_id} | Remove document | | GET /api/v1/datasets/{id}/documents | List + filter by metadata |
—
Best Practices
- Always use IDs — never rely on filenames
- Batch in chunks (e.g., 100 docs/sec) to avoid rate limits
- Validate metadata schema in RAGFlow settings first
- Log responses for retry logic
- Use dataset-level permissions for access control
—
See also:
https://github.com/infiniflow/ragflow/blob/main/example/http/dataset_example.sh
Summary
RAGFlow’s API-first design enables:
- Scalable batch ingestion
- Rich metadata attachment
- Full automation of document lifecycle
Perfect for ETL pipelines, CMS integration, or enterprise knowledge base automation.
With this API, you can manage tens of thousands of documents with full metadata — all programmatically.