Python records
Pass dicts with atext field, or plain strings:
Files
chunk_by="paragraph"— split on blank lines (default)chunk_by="line"— one document per non-empty line
Pandas
PyArrow (zero-copy path)
Requirespyarrow:
Table.add_arrow binding is available, ingestion uses the Arrow C Data interface directly.
Bulk load (100k+ rows)
For production-scale loads from Parquet or JSONL, prefer the Rust ingest CLI (no Python on the hot path):toradb-ingest finish / resume rebuild segment BM25 sidecars without an active bulk session. Hugging Face downloads can stay in Python; point --source parquet at exported shards.
For smaller loads or HF streaming, use the Python bulk session:
- Call
begin_bulk_ingestbefore the firstadd/add_arrowon that table. - Always call
finish_bulk_ingestwhen done — it builds deferred segment BM25 sidecars, merges table indexes, and reloads texts for search. reindex_bm25=Trueis only needed if you skipped BM25 during finish (--no-reindexon the demo ingest script).- While finish runs, progress is written to
{table}/indexes/build_status.json. The Python SDK exposesdb.index_build_status(table)without loading the full corpus. - After a crash during finish, call
db.resume_index_build("docs")to continue idempotently from Parquet +build_manifest.json. - Crash window: batches flushed during bulk ingest are on disk (Parquet + manifest). If the process dies before
finish_bulk_ingestcompletes, search may be incomplete until you callfinish_bulk_ingestorresume_index_build. WAL entries may be buffered until batch fsync at finish when using fast bulk (defer_wal_fsync). - Text-only tables skip dense index rebuilds automatically even outside bulk mode.
API during indexing
The demo API readsbuild_status.json directly so /api/health stays available while finish runs in another process. /api/search returns 503 with index_building until state is ready.
Tips
- Batch large loads, then reindex if search quality lags.
- Hybrid tables must include
embeddingvectors of the declared dimension for dense search.
