Ingest - ToraDB

Python records

Pass dicts with a text field, or plain strings:

table.add([
    {"text": "Nikola Tesla filed AC patents in 1888", "year": "1888", "tag": "patent"},
    "A plain string becomes the document body",
])

Other columns are stored as metadata for filtering and analytics.

Files

from toradb.ingest import add_file

count = add_file(table, "data/sample.txt", chunk_by="paragraph")

chunk_by="paragraph" — split on blank lines (default)
chunk_by="line" — one document per non-empty line

Pandas

from toradb.ingest import add_dataframe

import pandas as pd
df = pd.DataFrame([
    {"text": "Nikola Tesla wireless experiments", "tag": "science"},
])
add_dataframe(table, df)

PyArrow (zero-copy path)

Requires pyarrow:

pip install pyarrow

import pyarrow as pa
from toradb.ingest import add_arrow

batch = pa.table({
    "text": ["Nikola Tesla wireless energy vision"],
    "tag": ["vision"],
})
add_arrow(table, batch)

When the Rust Table.add_arrow binding is available, ingestion uses the Arrow C Data interface directly.

Bulk load (100k+ rows)

For production-scale loads from Parquet or JSONL, prefer the Rust ingest CLI (no Python on the hot path):

cargo build -p toradb-cli --release
./target/release/toradb-ingest bulk \
  --db ./data/msmarco_1m \
  --table passages \
  --source parquet \
  --path ./downloads/shards \
  --drop-table

toradb-ingest finish / resume rebuild segment BM25 sidecars without an active bulk session. Hugging Face downloads can stay in Python; point --source parquet at exported shards. For smaller loads or HF streaming, use the Python bulk session:

import toradb
from toradb.ingest import add_arrow

db = toradb.local("./big_db")
db.begin_bulk_ingest("docs")
table = db.create_table("docs", mode="text")  # or db.table("docs")

for batch in arrow_batches:  # 100k–200k rows per batch is typical with `--fast-bulk`
    add_arrow(table, batch)

db.finish_bulk_ingest("docs", compact=True)

Call begin_bulk_ingest before the first add / add_arrow on that table.
Always call finish_bulk_ingest when done — it builds deferred segment BM25 sidecars, merges table indexes, and reloads texts for search.
reindex_bm25=True is only needed if you skipped BM25 during finish (--no-reindex on the demo ingest script).
While finish runs, progress is written to {table}/indexes/build_status.json. The Python SDK exposes db.index_build_status(table) without loading the full corpus.
After a crash during finish, call db.resume_index_build("docs") to continue idempotently from Parquet + build_manifest.json.
Crash window: batches flushed during bulk ingest are on disk (Parquet + manifest). If the process dies before finish_bulk_ingest completes, search may be incomplete until you call finish_bulk_ingest or resume_index_build. WAL entries may be buffered until batch fsync at finish when using fast bulk (defer_wal_fsync).
Text-only tables skip dense index rebuilds automatically even outside bulk mode.

API during indexing

The demo API reads build_status.json directly so /api/health stays available while finish runs in another process. /api/search returns 503 with index_building until state is ready.

Tips

Batch large loads, then reindex if search quality lags.
Hybrid tables must include embedding vectors of the declared dimension for dense search.

See Ingest API.

​Python records

​Files

​Pandas

​PyArrow (zero-copy path)

​Bulk load (100k+ rows)

​API during indexing

​Tips