Search strategies

table.search(query, ...) runs retrieval against a table. Control behavior with strategy and related kwargs.

Basic search

results = articles.search("Nikola Tesla alternating current", top_k=3)

Strategy reference

`strategy`	Behavior
(default)	Hybrid-style: sparse + dense when vectors exist
`"sparse"`, `"bm25"`, `"text"`	Lexical BM25 only
`"dense"`, `"vector"`, `"hnsw"`	Dense ANN (HNSW); needs embeddings or proxy vector
`"diskann"`	On-disk ANN graph
`"hyde"`	HYDE expansion path
`"crag"`	CRAG filtering path
`"distributed"`	Parallel segment scan
`"graph"`, `"hybrid"`	Enables graph expansion (see below)

Parameters

Parameter	Default	Description
`top_k`	`20`	Max results returned
`offset`	`0`	Skip first N hits (pagination)
`explain`	`false`	Execute with provenance — returns a structured DAG of candidate flow
`graph_expand`	`false`	Graph neighbor expansion
`depth`	`2`	Graph expansion depth
`query_vector`	`None`	Query embedding for dense / DiskANN

Retrieval provenance (`explain=True`)

ToraDB is the only retrieval database that shows its work. When explain=True, every search returns a structured provenance trace alongside results — showing exactly which documents were considered and why each was kept or dropped at every tier.

import json

results = docs.search("why did the 2008 financial crisis happen", top_k=5, explain=True)

prov = json.loads(results.provenance)

# What BM25 found in Tier 1
print(prov["tier1"]["bm25_candidates"])
# [{"id": 42, "score": 0.91}, {"id": 77, "score": 0.88}, ...]

# What RRF fusion produced in Tier 2
print(prov["tier2"]["rrf_merged"])
# [{"id": 42, "score": 0.94}, ...]

# What was dropped and why
print(prov["tier2"]["drops"])
# [{"id": 77, "stage": "crag_filter", "reason": "crag median filter"}]

# End-to-end latency
print(prov["total_latency_ms"])
# 3.4

The provenance record is also appended to <table>/_search_log.ndjson on disk for cross-query analysis. See Retrieval provenance for the full schema and use cases.

Examples

Metadata-style query (field filters in query string):

articles.search("year:1888", top_k=5)

Hybrid with provenance and graph expansion:

results = articles.search(
    "Nikola Tesla wireless",
    top_k=5,
    strategy="hybrid",
    explain=True,
    graph_expand=True,
    depth=2,
)
prov = json.loads(results.provenance)
print(f"BM25 candidates: {len(prov['tier1']['bm25_candidates'])}")
print(f"Drops at tier2: {len(prov['tier2']['drops'])}")

Dense search with explicit vector:

papers.search(
    "Tesla coil resonant",
    top_k=2,
    strategy="dense",
    query_vector=[0.9, 0.1, 0.0, 0.0],
)

DiskANN (build graph with enough rows, typically 32+):

ann.search(
    "vector doc",
    top_k=3,
    strategy="diskann",
    query_vector=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
    explain=True,
)

Performance (large segment-only tables)

After bulk ingest, the first search may mmap many segment indexes (cold start). Mitigations:

Use serving profile settings: TORADB_CACHE_AUTO=1 or set TORADB_CACHE_INDEX_BYTES high enough to hold hot segment sidecars (see Production serving profiles).
Enable TORADB_WARMUP_ON_START=1 so the API warms indexes in the background.
Run toradb-ingest resume after ingest so TBM3 block-max sidecars, lexicons, and bm25.route.bin exist (required once after upgrading index format).
Use explain=True to inspect the provenance trace and identify which tier is slowest.

Stream search

from toradb.table import stream_search

for batch in stream_search(articles, "Nikola Tesla", batch_size=2):
    print(batch.to_pandas())

See Table API for the full signature.

​Basic search

​Strategy reference

​Parameters

​Retrieval provenance (explain=True)

​Examples

​Performance (large segment-only tables)

​Stream search