Eleven years of geometric thinking about embedding spaces, distilled into eight papers. From Mikolov's linear analogies (2013) to Bricken et al.'s sparse autoencoders (2023), discover, fix, diagnose, exploit. The throughline is one conjecture refined over a decade, high-level concepts are encoded as directions in representation space.
A reading path: from linear analogies to sparse dictionaries. Eleven years of geometric thinking about embedding spaces, distilled.
The eight papers, in chronological order:
The throughline of this literature is a single conjecture, refined over a decade: high-level concepts are encoded as directions in a model’s representation space. Each paper either provides evidence for the conjecture, diagnoses a way it appears broken, fixes the breakage, or exploits the corrected geometry.
Reading these papers in order isn’t just history, it’s the cleanest pedagogy. Each one is a response to a question raised by the previous, and the methods compound: ABTT lives inside SIF, SIF’s intuition lives inside Ethayarajh’s measurements, all of it lives inside Zou et al.’s RepE pipeline, and SAEs generalize the whole apparatus from “find one direction” to “find a basis.”
Mikolov, Chen, Corrado, Dean. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. Mikolov, Yih, Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL 2013.
Word vectors learned by simple log-linear models encode semantic and syntactic relationships as approximately constant translations in $\mathbb{R}^d$. Analogies reduce to vector arithmetic.
Two architectures:
Both train a shallow log-linear model on billions of words; embedding dimensionality is a hyperparameter (typically 100–1000).
Given an analogy “a is to a* as b is to ?”, solve:
\[b^* \;=\; \underset{w \in V}{\arg\max}\; \cos\big(\mathbf{v}_w,\; \mathbf{v}_{a^*} - \mathbf{v}_a + \mathbf{v}_b\big).\]So $\mathbf{v}{\text{king}} - \mathbf{v}{\text{man}} + \mathbf{v}{\text{woman}}$ lands near $\mathbf{v}{\text{queen}}$. Tense-of-verb, country-to-capital, comparative-to-superlative, plural-to-singular, all approximately constant offsets.
“the vector from man to woman is approximately equal to the vector from king to queen”
This is the founding empirical claim of the entire field: directions in embedding space carry meaning. Every subsequent paper either exploits this, finds where it breaks, or formalizes it. Without the analogy phenomenon, none of the post-hoc corrections (SIF, ABTT) or modern interventions (steering, SAE features) would have a target.
Arora, Liang, Ma. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. ICLR 2017.
A frequency-weighted mean of word vectors, with its dominant principal component removed, beats supervised RNN/LSTM sentence encoders on textual-similarity tasks. No training required.
Given a corpus with unigram probabilities $p(w)$, vectors $\mathbf{v}_w$, and SIF parameter $a$ (typically $10^{-3}$ to $10^{-4}$):
For each sentence $s$, compute the weighted average:
\[\mathbf{v}_s \;=\; \frac{1}{|s|} \sum_{w \in s} \frac{a}{a + p(w)} \cdot \mathbf{v}_w.\]Project it out:
\[\mathbf{v}_s \;\leftarrow\; \mathbf{v}_s - \mathbf{u}\mathbf{u}^\top \mathbf{v}_s.\]Derived from a latent-discourse generative model: each word is emitted given a slowly-varying “discourse vector” $\mathbf{c}_s$ plus a smoothing term that produces stopwords irrespective of context. The MLE for $\mathbf{c}_s$ under this model is exactly the SIF weighted average; the first principal component captures the “common discourse” shared across all sentences (function words, syntactic noise) and is subtracted out.
10–30% improvement over RNN/LSTM supervised baselines on STS 2012–2015. The weight $a/(a+p(w))$ down-weights stopwords; the PC1 subtraction removes a syntactic-frequency dimension shared by all sentence vectors.
Connection to model2vec. The SIF weight is precisely the weighting Model2Vec applies to its static token embedding table (Step 4). Model2Vec also applies PCA (Step 3) but keeps the top components rather than removing them, a deliberate divergence because Model2Vec is correcting embeddings of single tokens, not of sentences, so the “common discourse” interpretation doesn’t transfer directly.
Mu, Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. arXiv:1702.01417, ICLR 2018.
Off-the-shelf word embeddings (word2vec, GloVe) have (a) a large common mean and (b) a few dominant directions that correlate with token frequency, not meaning. Removing them makes embeddings more isotropic and uniformly stronger on downstream tasks.
Given embeddings ${\mathbf{v}(w)}_{w \in V} \subset \mathbb{R}^d$:
| Center. Compute $\boldsymbol{\mu} = \frac{1}{ | V | }\sum_w \mathbf{v}(w)$ and set $\tilde{\mathbf{v}}(w) = \mathbf{v}(w) - \boldsymbol{\mu}$. |
Recommended $D \approx d/100$ (so $D=3$ for 300-dim GloVe). Crucially: keep everything except the top few PCs, the opposite of standard dimensionality reduction.
“a simple, and yet counter-intuitive, postprocessing technique” that eliminates “the common mean vector and a small set of dominating directions.”
Consistent gains across word similarity, concept categorization, analogy, semantic textual similarity, and text classification, across multiple languages, with no retraining.
ABTT is the canonical fix for anisotropy in static embeddings. The next question is obvious: do contextual embeddings, produced by BERT, ELMo, GPT-2, have the same pathology?
Ethayarajh. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. arXiv:1909.00512, EMNLP 2019.
Contextualized representations are highly anisotropic, they occupy a narrow cone in vector space, and the anisotropy increases sharply in upper layers, especially for GPT-2.
| Quantity | Definition | Intuition |
|---|---|---|
| Anisotropy$(\ell)$ | $\mathbb{E}[\cos(h_i^\ell, h_j^\ell)]$ for random $(i,j)$ from different contexts | Should be $\approx 0$ in isotropic space; the higher, the worse. |
| SelfSim$(w)$ | Mean cosine between embeddings of word $w$ in different contexts, baseline-corrected | How stable is $w$’s identity across contexts? |
| IntraSim$(s)$ | Mean cosine between different words in sentence $s$, baseline-corrected | How much do words in the same sentence look alike? |
| Model | Layer | Anisotropy baseline |
|---|---|---|
| BERT-base | 1 → 12 | ~0.2 → ~0.45 |
| ELMo | top | ~0.80 |
| GPT-2 small | 12 (last) | ~0.99 |
A baseline of 0.99 means any two random word embeddings have cosine similarity near 1. The entire embedding cloud lives inside a needle-thin cone in $\mathbb{R}^d$.
“In all layers of BERT, ELMo, and GPT-2, the representations of all words are anisotropic.” Less than 5% of variance in a word’s contextualized representations is explained by a static (mean) embedding.
When reporting similarities, always subtract the anisotropy baseline:
\[\cos_{\text{adj}}(a, b) \;=\; \cos(a, b) - \text{Anisotropy}(\ell).\]Otherwise your “similarity scores” are mostly measuring the cone, not the signal.
Timkey, van Schijndel. All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. arXiv:2109.04404, EMNLP 2021.
Anisotropy isn’t diffuse, it’s localized. A handful of “rogue” dimensions (often 1–3) with very large magnitude and high variance dominate cosine similarity between hidden states, swamping every other axis.
For two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$, write the unnormalized similarity as a sum of per-dimension contributions:
\[\mathbf{x} \cdot \mathbf{y} \;=\; \sum_{k=1}^d x_k y_k.\]Empirically, in BERT/GPT-2 hidden states, one or two terms in this sum account for the majority of the total. Removing them drops cosine values toward zero, exactly where they should be for unrelated tokens.
Per-dimension standardization (z-scoring) before computing similarities:
\[x'_k \;=\; \frac{x_k - \mu_k}{\sigma_k}, \qquad \cos_{\text{std}}(a, b) \;=\; \cos(a', b').\]This neutralizes rogue dimensions’ magnitude advantage without changing the model.
“a small number of rogue dimensions, often just 1–3, dominate these measures”, and there is “a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model.”
Rogue dimensions matter for similarity metrics (cosine, dot product), but they are not the dimensions most important to the model’s downstream behavior. Ablating them barely changes model outputs. This dissociation foreshadows Park, Choe, Veitch: the “probing direction” and the “intervention direction” can be different things, and you need a principled framework to relate them.
Park, Choe, Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models. arXiv:2311.03658, ICML 2024.
“Concept = direction” can be made precise. Three a priori distinct notions of linear representation, as a subspace, as a probe/measurement, and as an intervention, coincide under a particular non-Euclidean inner product called the causal inner product.
In Euclidean geometry, there’s no reason these should be the same direction, they’re conceptually different objects (a 1-D subspace lives in the representation space, a probe is a dual vector, an intervention is a tangent direction). Empirically, they often disagree.
The paper constructs an inner product $\langle \cdot, \cdot \rangle_C$ on representation space that:
Concept directions are constructed from counterfactual pairs (e.g., male↔female embeddings), and the causal inner product is derived from the model’s own statistics (essentially a whitening of the unembedding covariance).
“high-level concepts are represented linearly as directions in some representation space”, and the causal inner product “respects language structure in a sense we make precise.”
On LLaMA-2, concept directions constructed under the causal inner product are simultaneously good probes and good interventions. Directions constructed under the default Euclidean inner product are not.
The conceptual upshot. The “rogue dimensions” mismatch from Timkey isn’t a bug, it’s evidence that the right geometry isn’t Euclidean. Park et al. give a constructive recipe for the right geometry. This is the theoretical license for everything that follows.
Zou, Phan, Chen, Campbell, Guo, Pang, Hendrycks et al. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405.
Treat population-level neural activations, not individual neurons or circuits, as the primary unit of analysis. High-level concepts (honesty, power-seeking, fairness, emotion) correspond to linear directions in activation space that can be both read (for interpretability) and written (for control).
The core recipe for finding a concept direction:
Read (probing): project a new hidden state onto $\mathbf{v}$ to score it for the concept:
\[\text{score}(h) \;=\; \mathbf{v}^\top \mathbf{h}.\]Write (steering): at inference, intervene at the chosen layer with
\[\mathbf{h} \;\leftarrow\; \mathbf{h} + \alpha \cdot \mathbf{v},\]where $\alpha$ is a signed scalar, positive injects the concept, negative suppresses it.
On LLaMA-2-13B, LAT-based honesty steering improves TruthfulQA MC1 from 35.7% → 65.6%, a near-doubling, with no fine-tuning. The same recipe controls morality, emotion, power-seeking, fairness, and bias, and detects lying with high AUC.
“a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more” can be addressed by reading and editing representations.
RepE is the engineering manifesto for the linear representation hypothesis. It demonstrates that you can get useful behavior changes with a one-line intervention, no gradient updates, no RLHF, no curated dataset. The cost: each $\mathbf{v}$ requires labeled contrastive prompts, so the method scales linearly with the number of supervised concepts you want.
Bricken, Templeton, Batson, Chen, Jermyn, Conerly, Turner, Anil, Denison, Askell, Grosse, McCandlish, Kaplan, Amodei, Wattenberg, Olah. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. transformer-circuits.pub, October 2023.
Individual neurons in transformers are polysemantic: each neuron responds to mixtures of unrelated inputs. The reason is superposition, the model packs more features than it has dimensions by representing them as overlapping linear combinations of neurons. A sparse autoencoder (SAE) trained on activations can recover an over-complete, sparse, approximately monosemantic feature basis.
Given activations $\mathbf{x} \in \mathbb{R}^d$ from some layer (e.g., the 512-dim MLP output of a 1-layer transformer):
\[\begin{aligned} \mathbf{f}(\mathbf{x}) &= \text{ReLU}\big(\mathbf{W}_e (\mathbf{x} - \mathbf{b}_d) + \mathbf{b}_e\big) \in \mathbb{R}^F \\ \hat{\mathbf{x}} &= \mathbf{W}_d \mathbf{f}(\mathbf{x}) + \mathbf{b}_d \\ \mathcal{L} &= \|\mathbf{x} - \hat{\mathbf{x}}\|_2^2 \;+\; \lambda \|\mathbf{f}(\mathbf{x})\|_1 \end{aligned}\]Key design choices:
Each column $\mathbf{W}_d[:, k]$ is a learned “feature direction” in the original activation space. The corresponding scalar $\mathbf{f}(\mathbf{x})_k$ is the feature’s activation. Bricken et al. find that for the right choice of $\lambda$ and $F$, individual features cleanly correspond to interpretable concepts:
In their A/1 run (4,096 features on a 512-dim 1-layer transformer, an 8× overcomplete dictionary), a randomly sampled 162 features are evaluated and the vast majority are judged interpretable.
Training two SAEs independently on the same model recovers the same features in both, strong evidence that these are properties of the model, not artifacts of the SAE.
“decompose the activations of a one-layer transformer into features that are individually interpretable”, superposition causes “neurons to be polysemantic, activating in response to mixtures of unrelated inputs.”
Everything from Mikolov to Zou et al. was about supervised direction-finding: you know what concept you’re looking for, and you craft data to surface it. SAEs do this unsupervised and at scale: train one model, get thousands of feature directions, each interpretable by inspection. The “linear representation hypothesis” finally has a constructive recipe that scales.
If you read these eight papers as one argument, it goes like this:
| Operation | Where it appears |
|---|---|
| Mean subtraction | SIF, ABTT, Ethayarajh’s baseline correction, Timkey’s standardization |
| PCA / SVD of activations | SIF, ABTT, RepE’s LAT, SAE initialization |
| Pairwise contrastive prompts | Mikolov’s analogy tests, Park’s counterfactual pairs, RepE’s $S^+ / S^-$ stimuli |
| “Direction = concept” | All eight |
sae library, the sae_lens package) train in a few hours on a single GPU for small models. Train one on GPT-2 small. Inspect features.A reading path. All page references and formulas verified against arXiv abstracts and OpenReview entries; the SAE details are standard and consistent with the public Anthropic write-up.