A method guide for deciding when to train your own retrieval encoder. Pins down what an encoder serves (relevance), how success is measured, and the levers that move nDCG, latency, and index size. Instantiates the formulation across web QA, legal, code, and biomedical retrieval. Cites GradCache, NV-Retriever, Gecko, GOR, Matryoshka, EmbeddingGemma, E5, and BGE.
Committed 2026-05-30: target nDCG@10 above a threshold fixed before building, within p95 latency, self-hosted cost, and index-size budgets.
The gap must be real and measurable on your own judged set, not a public benchmark. Measure the best off-the-shelf option on your own evaluation first; most “we need a custom model” instincts evaporate once the API model already clears the bar.
Each option cell is ranked 1 (best on that row) to 3 (worst); read down a column to characterize an option, or across a row to compare options on one metric.
| Decision metric | Lexical match (BM25) | Off-the-shelf API or open model | Train an in-house encoder |
|---|---|---|---|
| In-domain relevance (nDCG@10) | 3 · strong when query and document share wording; blind to synonyms and paraphrase | 2 · good on general language; degrades on niche domains the model never saw | 1 · highest on your domain once tuned to your own relevance labels |
| Domain vocabulary, jargon | 3 · exact term only; misses abbreviations and domain synonyms | 2 · general coverage; maps rare domain terms and acronyms wrong | 1 · learns domain terms directly from your training pairs |
| Query, document asymmetry (NL to code, claim to evidence) | 3 · fails; there is little shared surface form to match on | 2 · handles NL questions well; weaker on code or claim-evidence links | 1 · trainable on exactly your (query, document) structure |
| p95 latency | 1 · very low; an inverted-index lookup | 3 · network round-trip (API) or full model load (open weights) | 2 · small local model; latency you control via size and dimension |
| Cost at scale | 1 · negligible | 3 · per-call cost (API) or steady infrastructure cost (self-hosted open model) | 2 · near zero at inference once self-hosted; training cost is up front |
| Privacy, on-prem | 1 · fully local | 3 · API sends your data out; an open model can run local | 2 · fully on-prem, but you operate it |
| Index size, memory | 2 · small postings list | 3 · fixed (often large) dimension; no control | 1 · controllable via Matryoshka truncation and quantization |
| Setup and maintenance | 1 · trivial; no model to own | 2 · low (managed API) to moderate (self-hosted) | 3 · you own the data, the training run, and drift over time |
No column ranks 1 on every row, which is the point. A bi-encoder followed by a cross-encoder reranker sits on top of either of the last two columns when you need top-tier precision and can afford rerank latency over a retrieved candidate set.
Define these for your domain before building. The method does not change across domains; these do. The relevance criterion is the load-bearing one; in every column, similar is not relevant.
| Axis | Web, QA | Legal | Code search | Biomedical | Your domain |
|---|---|---|---|---|---|
| Anchor (retrievable unit) | passage | statute section, case paragraph | function or snippet | paper abstract, passage | (fill in) |
| Query (form of the need) | NL question | legal issue, fact pattern | NL intent (“parse a date”) | clinical, research question | (fill in) |
| Relevance criterion | document potentially answers the query | controlling or persuasive authority on the issue | snippet implements or correctly uses the described functionality | passage reports evidence on the same entity, intervention, outcome | (fill in) |
| Hard negative (“close but wrong”) | on-topic passage that never answers | on-point case from the wrong jurisdiction | code with similar names doing something else | passage about a colliding gene symbol, acronym | (fill in) |
A near-duplicate, an off-jurisdiction case, a same-named function, or an acronym collision is topically close yet wrong. Training keys on exactly that distinction, so the hard-negative row decides how sharp the model becomes.
Objective (maximize): nDCG@10 (also Recall@k, MRR).
Constraints (stay under): p95 retrieval latency; $/1k queries; index size, memory.
What counts as a relevant judgment follows directly from your relevance criterion above and feeds nDCG. nDCG appears at three stages: as a training loss (contrastive, InfoNCE), as an evaluation on the embedding space, and at deployment; say which when reporting.
| Metric | Before (baseline: off-the-shelf embedding) | After (target) |
|---|---|---|
| nDCG@10 (domain eval) | baseline value on your judged set | committed threshold |
| p95 latency | network round-trip (API) | local inference (commonly 5 to 10x lower) |
| $/1k queries | API per-call cost | approximately $0 self-hosted |
| index size | fp32 x full dim | int8 or binary x truncated dim (approximately an order of magnitude smaller) |
| Input | Before (baseline retrieval) | After (domain-trained encoder) |
|---|---|---|
| Query uses a domain acronym | API model splits it into subwords, scores it unrelated, returns generic results | retrieves the document that defines and uses the acronym, ranked first |
| Paraphrased query with no term overlap | lexical baseline returns near-misses that repeat surface words | retrieves the passage that actually answers the rephrased need |
| Restated-question passage vs an answer passage | ranks the restatement top because it is most similar | ranks the answering passage top because it is relevant |
Each step has an MVP form (the cheapest path to a first measurement) and a Full form (the production build). Build the MVP end to end before deepening any single step.
Citation: Daft, Eventual Inc. · github.com/Eventual-Inc/Daft.
The training corpus is built before the trainer ever runs. A streaming, query-optimized data engine such as Daft chews through corpora far larger than RAM, runs heavy per-row work (tokenizing, embedding, LLM calls) as batched UDFs, and reads straight from S3 or Parquet or Hugging Face, scaling from a laptop to a cluster without code changes. The job here is to load, clean, deduplicate, and filter the raw permissive corpus into retrievable units.
MVP: a few thousand documents already on hand. Full: the full permissive corpus, deduplicated and quality-filtered as a streaming pipeline.
Citation: E5 / CCPairs, Wang et al., arXiv 2212.03533 (preprint, not peer-reviewed).
For text embeddings the data and the pair-construction recipe matter more than the model architecture: E5 used a vanilla BERT backbone but a carefully consistency-filtered pair set (CCPairs) and beat models with far more parameters. Construct positives from document structure (title to abstract, heading to section, citation context) and filter for consistency.
MVP: heuristic title -> abstract pairs plus ~200 hand-checked examples. Full: heuristic backbone, synthetic queries (next step), and positive relabeling so the labeled positive is the passage that actually answers, not just the seed.
Citation: NV-Retriever, Moreira et al. (NVIDIA), arXiv 2407.15831 (preprint, not peer-reviewed).
In-batch negatives are mostly easy: random documents the model already scores low, so the gradient is tiny. The strongest signal comes from hard negatives, documents that look relevant but are not. Mine them by retrieving each query’s top passages with a teacher model. The catch is that those top hits are riddled with false negatives (actually-relevant passages that just are not labeled), and training on them poisons the model. NV-Retriever’s fix is positive-aware filtering: use the known positive’s score as an anchor and discard any candidate scoring above ~95% of it (the TopK-PercPos rule), which drove NV-Retriever-v1 to the top of the MTEB/BEIR retrieval leaderboard.
MVP: in-batch negatives only. Full: positive-aware mining, skip the very top neighbors, keep ~3 to 4 per query; optionally add LLM-generated counterfactual negatives.
Citation: Cached MNRL / GradCache, Gao et al., RepL4NLP 2021 · arXiv 2101.06983 · code.
Train so relevant (query, document) pairs have high similarity, with MultipleNegativesRankingLoss (InfoNCE), a batch-wide softmax over in-batch negatives:
\[\mathcal{L} = -\frac{1}{B} \sum_{i=1}^{B} \log \frac{\exp(s(q_i, d_i) / \tau)}{\sum_{j=1}^{B} \exp(s(q_i, d_j) / \tau)}\]where $s(\cdot, \cdot)$ is the cosine or dot-product similarity, $\tau$ is the temperature, and $B$ is the batch size. Larger batches mean more negatives per step and better embeddings, but naive autograd caps batch size at GPU memory. GradCache decouples the loss (which needs only the tiny embedding vectors) from the encoder activations: embed without a graph, compute gradients on the embeddings, then re-embed one sub-batch at a time, giving an exact gradient at O(1) activation memory and effective batch sizes in the thousands on a single GPU.
MVP: small base model, bi-encoder, InfoNCE, 1 epoch. Full: cached contrastive training for a large effective batch, an isotropy regularizer (see Implementation Variance below), and a Matryoshka wrapper; optionally a decoder-to-encoder backbone with distillation.
Evaluate on the held-out test set against the off-the-shelf baseline on the metrics above. The headline result is a data ablation: heuristic pairs, then plus synthetic, then plus hard negatives, so the lift is attributed to the data decisions that produced it rather than to the model alone.
The same target is approached two ways. Research contributes the individual levers, one paper per mechanism; Industry ships systems that combine several levers into one recipe. Read research for what each lever does and industry for how they are stacked in practice.
batch-size → nDCG Citation: Cached MNRL / GradCache, Gao et al., RepL4NLP 2021 · arXiv 2101.06983.
More in-batch negatives produce better embeddings, and GradCache makes large effective batches feasible on one GPU by holding activation memory constant while keeping the gradient exact. This is the foundation the other levers build on.
nDCG (separation) Citation: NV-Retriever, Moreira et al. (NVIDIA), arXiv 2407.15831 (preprint, not peer-reviewed).
Hard negatives sharpen the decision boundary far more than piling on easy ones, and let you train strong models with smaller batches. The decisive detail is positive-aware filtering to avoid training on false negatives.
recall @ low resource Citation: Gecko, Lee et al. (Google), arXiv 2403.20327 (preprint, not peer-reviewed); E5-mistral, Wang et al., arXiv 2401.00368 (preprint, not peer-reviewed).
Quality comes from a pipeline, not a single prompt: generate queries anchored to real documents, diversified by a task taxonomy and attribute or persona conditioning; relabel the true positive by retrieval rather than trusting the seed (Gecko); then filter hard with round-trip consistency and reranker or judge scoring. E5-mistral reached state of the art trained almost entirely on synthetic pairs, showing the data can be the method.
nDCG + quantization-robustness Citation: Global Orthogonal Regularization, Zhang et al., ICCV 2017 · arXiv 1708.06320.
Contrastive training never explicitly tells the model to use the whole space, so embeddings often clump into a narrow cone: wasted dimensions, hubness, and fragility under compression. GOR pushes non-matching pairs to behave like random points on the unit sphere (inner-product mean approximately 0, second moment approximately $1/d$), giving a fully-used space and embeddings that survive aggressive quantization.
latency / index-size Citation: Matryoshka Representation Learning, Kusupati et al., NeurIPS 2022 · arXiv 2205.13147.
Matryoshka training makes leading prefixes of the vector usable standalone, so you pick the dimension per budget for free at inference. Combined with int8, binary, or PQ quantization it compresses the index by an order of magnitude, which is what makes billion-vector retrieval affordable.
nDCG ceiling Citation: EmbeddingGemma, Schechter Vera et al. (Google), arXiv 2509.20354 (preprint, not peer-reviewed).
Modern LLMs are decoder-only with a causal mask, but an embedding wants to see the whole text at once. The conversion un-masks attention to bidirectional, mean-pools token states into one vector, and contrastively fine-tunes, lifting the achievable quality ceiling by starting from a far stronger backbone.
nDCG @ size Citation: Schechter Vera et al., arXiv 2509.20354 (preprint, not peer-reviewed).
The capstone recipe: decoder-to-encoder conversion plus Matryoshka, GOR-style spread-out regularization, quantization-robustness, model souping, and distillation, topping MTEB in its size class. It stacks nearly every research lever above into one production model.
recall Citation: Xiao et al., arXiv 2309.07597 (preprint, not peer-reviewed).
A three-stage pipeline: masked-autoencoder pretraining, then approximately 100M consistency-filtered weak pairs with in-batch negatives, then labeled triplets with mined hard negatives and task instructions.
data quality → nDCG Citation: Wang et al., arXiv 2212.03533 (preprint, not peer-reviewed).
A vanilla BERT backbone trained on carefully consistency-filtered web pairs (CCPairs) beat embedding models with far more parameters, the canonical demonstration that data quality outweighs model scale.
nDCG (separation) Citation: Moreira et al., arXiv 2407.15831 (preprint, not peer-reviewed).
Productionizes positive-aware hard-negative mining at scale (the TopK-PercPos approximately 0.95 rule), which took NV-Retriever-v1 to the top of the MTEB/BEIR retrieval leaderboard.
latency / storage Citation: OpenAI, Cohere, Voyage (product documentation).
Commercial embedding APIs ship Matryoshka-truncatable vectors so callers trade dimension count for storage and latency without retraining, moving a research lever directly into a product knob.