How to compress a sentence-transformer into a static lookup table that runs on CPU in three lines of NumPy. A re-implementation guide covering the full pipeline, vocab prep, per-token forward, PCA whitening, SIF/Zipf weighting, optional vocab quantization, and dtype quantization, with a ~80-line reference implementation.
A static-embedding distillation algorithm, for ML engineers who want to re-implement it.
A sentence-transformer $f_\theta$ maps a token sequence $\mathbf{x} = (x_1,\dots,x_n)$ to a vector
\[\mathbf{e}(\mathbf{x}) \;=\; \text{Pool}\big(f_\theta(\mathbf{x})\big) \in \mathbb{R}^d.\]The model is contextual: each token’s hidden state depends on its neighbours through attention. Inference cost is $\mathcal{O}(n^2 d)$ per sentence (plus $L$ layers), and you need a GPU to make it pleasant.
Model2Vec’s claim: for many downstream tasks (classification, retrieval, clustering), a sentence embedding is well-approximated by a weighted average of context-free token embeddings:
\[\hat{\mathbf{e}}(\mathbf{x}) \;\approx\; \frac{1}{n}\sum_{i=1}^{n} \mathbf{E}[x_i],\]| where $\mathbf{E} \in \mathbb{R}^{ | V | \times d’}$ is a static lookup table, one vector per vocabulary entry. Inference becomes gather + mean, i.e. $\mathcal{O}(nd’)$, no attention, no torch, CPU-friendly. |
Three observations make this work better than you’d naively expect:
Mental model. We’re asking the teacher: “if this single token were a sentence on its own, what would your output be?” We do that once per token, cache the answer, and average at query time. PCA + SIF clean up the cache.
Start from the teacher’s tokenizer $T$ with vocabulary $V_T$. We produce a cleaned vocabulary $V$ on which we’ll generate static vectors.
[unused\d+], placeholder slots BERT-style models reserve).adds_prefix_space, set it to True. This makes "cat" and " cat" tokenize to the same id, so word-initial and word-internal occurrences share a vector. Surprisingly important.Implementation note. You don’t need
skeletokento reproduce this, any tokenizer library that lets you mutate the vocab will do. The substantive transformations are: regex-filter, set prefix-space flag, de-dup.
For each token $v \in V$, construct the single-token input $\mathbf{x}_v$ (just the token id, no special tokens). Run the teacher and pool:
\[\mathbf{E}_v \;=\; \text{Pool}\big(f_\theta(\mathbf{x}_v)\big) \in \mathbb{R}^d.\]| Stack into $\mathbf{E} \in \mathbb{R}^{ | V | \times d}$. |
Pooling choice depends on the teacher’s training objective:
| Mode | Formula | Use when |
|---|---|---|
MEAN | $\frac{1}{n}\sum_i h_i$ | BERT-style encoder trained with mean-pooling (most sentence-transformers). |
FIRST | $h_{[\text{CLS}]}$ | BERT with CLS-token training. |
LAST | $h_n$ (last non-pad) | Decoder-style LLMs (Qwen-embed, LLM2Vec). |
POOLER | pooler_output | Models that expose a trained pooler head. |
Practical tip: batch by length-sorting and pad; you’ll save 5–10× compute vs. naïve batching because token lengths in $V$ are highly skewed.
Compute the principal components of $\mathbf{E}$ and project:
\[\mathbf{E}' \;=\; (\mathbf{E} - \boldsymbol{\mu})\,\mathbf{W}_{\text{PCA}}, \quad \mathbf{W}_{\text{PCA}} \in \mathbb{R}^{d \times d'}.\]Two cases:
Why PCA is the biggest quality lever. The teacher’s hidden space has a “dominant direction”, most tokens cluster along a single principal axis (the “rogue dimension” phenomenon). Subtracting the mean and rotating to PC basis spreads the cloud out so that cosine distances are informative again.
We don’t have a corpus to estimate token frequencies, but the tokenizer’s token id ordering already approximates frequency rank for BPE/WordPiece (more frequent → earlier merge → lower id). So we approximate the distribution by Zipf’s law on rank $r$:
\[p_r \;=\; \frac{1/(r+1)}{\sum_{k=1}^{|V|} 1/(k+1)}.\]Then apply Arora et al.’s SIF (smooth inverse frequency) weight with hyperparameter $a$ (default $10^{-4}$):
\[w_r \;=\; \frac{a}{a + p_r}.\]Multiply each row: $\mathbf{E}’v \leftarrow w{\text{rank}(v)} \cdot \mathbf{E}’_v$.
Detail worth implementing carefully. The original Model2Vec uses
inv_rank = 1 / np.arange(2, n+2)(so the first token gets weight $1/2$, not $1$). This avoids the singularity at $r=0$ and gives the head of the distribution a gentler slope. Don’t use $1/r$ directly.
| When $ | V | $ is large (say, 250k for multilingual models), the embedding table dominates model size. Replace it with $K \ll | V | $ centroids via KMeans: |
Two implementation subtleties:
| The final stored objects are: centroid matrix $\mathbf{C} \in \mathbb{R}^{K \times d’}$, token-to-centroid map $\pi \in \mathbb{Z}^{ | V | }$, per-token weight $\mathbf{w} \in \mathbb{R}^{ | V | }$. |
Cast the final embedding matrix from float32 to a smaller dtype. Default fp16 halves memory at essentially no quality cost; int8 (with per-row min/max scaling) is also supported. This is the last step because PCA and clustering both want fp32 numerics.
| Dtype | Bytes / weight | Quality impact |
|---|---|---|
| float32 | 4 | Baseline |
| float16 | 2 | Negligible |
| int8 | 1 | Small (<1% on MTEB) |
The whole point: at query time, encoding a sentence $\mathbf{x}=(x_1,\dots,x_n)$ is
\[\hat{\mathbf{e}}(\mathbf{x}) \;=\; \mathcal{N}\!\Bigg(\frac{1}{n}\sum_{i=1}^{n} w_{x_i} \cdot \mathbf{E}_{\pi(x_i)}\Bigg),\]where $\mathcal{N}(\cdot)$ is optional L2-normalization, $\pi$ is identity if no vocab quantization, $w_{x_i}=1$ if SIF weights were folded into the rows during distillation.
That’s three lines of NumPy. No torch, no transformers, just the tokenizer (one HF dependency) and a matrix.
Distillation, end-to-end. Strip error handling for clarity.
import numpy as np
import torch
from sklearn.decomposition import PCA
from transformers import AutoModel, AutoTokenizer
def distill(model_name: str, pca_dims: int = 256, sif_a: float = 1e-4,
quantize_to=np.float16) -> dict:
# --- Step 1: vocab prep ---
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModel.from_pretrained(model_name).eval()
vocab = tok.get_vocab() # {token_str: id}
tokens = sorted(vocab, key=vocab.get) # ordered by id
keep = [t for t in tokens if not t.startswith("[unused")]
ids = [vocab[t] for t in keep]
# --- Step 2: one forward pass per token ---
# Length-sort and batch; here we use trivial batches for clarity.
E = np.zeros((len(ids), model.config.hidden_size), dtype=np.float32)
with torch.inference_mode():
for i in range(0, len(ids), 256):
batch = ids[i:i+256]
x = torch.tensor([[tid] for tid in batch]) # shape (B, 1)
mask = torch.ones_like(x)
out = model(input_ids=x, attention_mask=mask).last_hidden_state
# mean pool (B,1,d) -> (B,d)
E[i:i+len(batch)] = out.mean(dim=1).float().numpy()
# --- Step 3: PCA ---
pca = PCA(n_components=pca_dims, svd_solver="full")
E = pca.fit_transform(E) # (|V|, d')
# --- Step 4: SIF / Zipf weights ---
rank = np.arange(2, len(E) + 2) # avoid 1/0 at r=0
inv = 1.0 / rank
p = inv / inv.sum()
w = sif_a / (sif_a + p) # (|V|,)
E = E * w[:, None] # fold into rows
# --- Step 6: dtype quantize (skipping Step 5 for brevity) ---
E = E.astype(quantize_to)
return {"embedding": E, "tokens": keep, "tokenizer": tok}
def encode(static: dict, sentences: list[str]) -> np.ndarray:
tok, E = static["tokenizer"], static["embedding"].astype(np.float32)
out = []
for s in sentences:
ids = tok(s, add_special_tokens=False)["input_ids"]
v = E[ids].mean(axis=0) if ids else np.zeros(E.shape[1])
out.append(v / (np.linalg.norm(v) + 1e-12))
return np.stack(out)
That’s it. ~50 lines for distill, ~10 for encode. To match the reference library, add:
jinaai/jina-embeddings-v2-base-code). Pass a code-specific vocabulary (identifiers, operators) to bias toward whole-token lookups.What model2vec is not good at. Anything that requires word-order or syntactic information: NLI, certain reranking tasks, long-range coreference. Bag-of-(weighted)-embeddings throws away order by construction. If your downstream task is order-sensitive, this technique caps your quality.
Reference: github.com/MinishLab/model2vec · SIF: Arora, Liang, Ma 2017 · Anisotropy: Ethayarajh 2019