.. _ragflow_batch_api: Batch Processing and Metadata Management in Infiniflow RAGFlow ============================================================== **Infiniflow RAGFlow** provides a **RESTful API** (`/api/v1`) that enables **programmatic control** over datasets and documents, making it ideal for **batch processing large volumes of documents**, **automated ingestion pipelines**, and **metadata enrichment**. This is essential in enterprise settings where thousands of PDFs, reports, or web pages need to be: - Ingested in bulk - Tagged with structured metadata (author, date, source, category, etc.) - Updated post-ingestion - Queried or filtered later via the RAG system --- API Base URL ------------ .. code-block:: text http:///api/v1 Authentication -------------- All requests require a **Bearer token**: .. code-block:: text Authorization: Bearer ragflow- > **Tip**: Obtain token via login or API key management in the RAGFlow UI. --- Step 1: Retrieve Dataset and Document IDs ------------------------------------------ Before updating, you **must know** the target: - **Dataset ID** (e.g., `f388c05e9df711f0a0fe0242ac170003`) - **Document ID** (e.g., `4920227c9eb711f0bff40242ac170003`) **List all datasets**: .. code-block:: bash curl -H "Authorization: Bearer ragflow-..." \ http://192.168.0.213/api/v1/datasets **List documents in a dataset**: .. code-block:: bash curl -H "Authorization: Bearer ragflow-..." \ http://192.168.0.213/api/v1/datasets//documents --- Step 2: Add Metadata to a Document (via PUT) -------------------------------------------- Use the **PUT** endpoint to **update metadata** of an existing document: .. code-block:: bash curl --request PUT \ --url http://192.168.0.213/api/v1/datasets/f388c05e9df711f0a0fe0242ac170003/documents/4920227c9eb711f0bff40242ac170003 \ --header 'Content-Type: multipart/form-data' \ --header 'Authorization: Bearer ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm' \ --data '{ "meta_fields": { "author": "Example Author", "publish_date": "2025-01-01", "category": "AI Business Report", "url": "https://example.com/report.pdf" } }' **Request Breakdown**: - **Method**: `PUT` - **Path**: `/api/v1/datasets//documents/` - **Content-Type**: `multipart/form-data` (required even for JSON payload) - **Body**: JSON string with `"meta_fields"` object **Response (on success)**: .. code-block:: json { "code": 0, "message": "Success", "data": { "document_id": "4920227c9eb711f0bff40242ac170003" } } --- Use Case: Batch Metadata Enrichment ------------------------------------ You can **automate metadata tagging** for **1000s of documents** using a script: .. code-block:: python import requests import json BASE_URL = "http://192.168.0.213/api/v1" TOKEN = "ragflow-QxNWIzMGNlOWRmMzExZjBhZjljMDI0Mm" HEADERS = { "Authorization": f"Bearer {TOKEN}", "Content-Type": "multipart/form-data" } # Example: Load CSV with doc_id, author, date, url... import pandas as pd df = pd.read_csv("documents_metadata.csv") for _, row in df.iterrows(): dataset_id = row['dataset_id'] doc_id = row['document_id'] payload = { "meta_fields": { "author": row['author'], "publish_date": row['publish_date'], "category": row['category'], "url": row['source_url'] } } files = {'': ('', json.dumps(payload), 'application/json')} resp = requests.put( f"{BASE_URL}/datasets/{dataset_id}/documents/{doc_id}", headers=HEADERS, files=files ) print(doc_id, resp.json().get("message")) **Benefits**: - Enrich RAG context with **structured, queryable metadata** - Enable **filtering** in UI or API (e.g., “Show reports from 2025 by Author X”) - Improve **traceability** and **auditability** --- Other Batch-Capable Endpoints ----------------------------- | Endpoint | Purpose | |--------|--------| | `POST /api/v1/datasets` | Create new dataset | | `POST /api/v1/datasets/{id}/documents` | Upload new documents (with metadata) | | `DELETE /api/v1/datasets/{id}/documents/{doc_id}` | Remove document | | `GET /api/v1/datasets/{id}/documents` | List + filter by metadata | --- Best Practices --------------- 1. **Always use IDs** — never rely on filenames 2. **Batch in chunks** (e.g., 100 docs/sec) to avoid rate limits 3. **Validate metadata schema** in RAGFlow settings first 4. **Log responses** for retry logic 5. **Use dataset-level permissions** for access control --- See also: --------- https://github.com/infiniflow/ragflow/blob/main/example/http/dataset_example.sh Summary ------- RAGFlow’s **API-first design** enables: - **Scalable batch ingestion** - **Rich metadata attachment** - **Full automation** of document lifecycle > **Perfect for ETL pipelines, CMS integration, or enterprise knowledge base automation.** With this API, you can manage **tens of thousands of documents** with full metadata — all programmatically.