Representation Analysis, A Reading Path

Eleven years of geometric thinking about embedding spaces, distilled into eight papers. From Mikolov's linear analogies (2013) to Bricken et al.'s sparse autoencoders (2023), discover, fix, diagnose, exploit. The throughline is one conjecture refined over a decade, high-level concepts are encoded as directions in representation space.

A reading path: from linear analogies to sparse dictionaries. Eleven years of geometric thinking about embedding spaces, distilled.

The eight papers, in chronological order:

Mikolov et al., Linear analogies (2013)
Arora, Liang, Ma, SIF sentence embeddings (2017)
Mu & Viswanath, All-but-the-Top (2018)
Ethayarajh, Geometry of contextual embeddings (2019)
Timkey & van Schijndel, Rogue dimensions (2021)
Park, Choe, Veitch, Linear representation hypothesis (2023)
Zou et al., Representation engineering (2023)
Bricken et al., Sparse autoencoders / monosemanticity (2023)

The arc

The throughline of this literature is a single conjecture, refined over a decade: high-level concepts are encoded as directions in a model’s representation space. Each paper either provides evidence for the conjecture, diagnoses a way it appears broken, fixes the breakage, or exploits the corrected geometry.

The four phases of the program: discover (concepts as directions), fix (post-hoc geometric correction), diagnose (the same problems in contextual models), exploit (read and write concepts directly).

Reading these papers in order isn’t just history, it’s the cleanest pedagogy. Each one is a response to a question raised by the previous, and the methods compound: ABTT lives inside SIF, SIF’s intuition lives inside Ethayarajh’s measurements, all of it lives inside Zou et al.’s RepE pipeline, and SAEs generalize the whole apparatus from “find one direction” to “find a basis.”

Mikolov 2013: Linear analogies

Mikolov, Chen, Corrado, Dean. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. Mikolov, Yih, Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL 2013.

Thesis

Word vectors learned by simple log-linear models encode semantic and syntactic relationships as approximately constant translations in $\mathbb{R}^d$. Analogies reduce to vector arithmetic.

Method

Two architectures:

CBOW: predict the center word from a window of context words.
Skip-gram: predict context words from the center word.

Both train a shallow log-linear model on billions of words; embedding dimensionality is a hyperparameter (typically 100–1000).

The famous trick

Given an analogy “a is to a* as b is to ?”, solve:

\[b^* \;=\; \underset{w \in V}{\arg\max}\; \cos\big(\mathbf{v}_w,\; \mathbf{v}_{a^*} - \mathbf{v}_a + \mathbf{v}_b\big).\]

So $\mathbf{v}{\text{king}} - \mathbf{v}{\text{man}} + \mathbf{v}{\text{woman}}$ lands near $\mathbf{v}{\text{queen}}$. Tense-of-verb, country-to-capital, comparative-to-superlative, plural-to-singular, all approximately constant offsets.

“the vector from man to woman is approximately equal to the vector from king to queen”

Why it matters for what follows

This is the founding empirical claim of the entire field: directions in embedding space carry meaning. Every subsequent paper either exploits this, finds where it breaks, or formalizes it. Without the analogy phenomenon, none of the post-hoc corrections (SIF, ABTT) or modern interventions (steering, SAE features) would have a target.

Arora, Liang, Ma 2017: SIF

Arora, Liang, Ma. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. ICLR 2017.

Thesis

A frequency-weighted mean of word vectors, with its dominant principal component removed, beats supervised RNN/LSTM sentence encoders on textual-similarity tasks. No training required.

Method (SIF: Smooth Inverse Frequency)

Given a corpus with unigram probabilities $p(w)$, vectors $\mathbf{v}_w$, and SIF parameter $a$ (typically $10^{-3}$ to $10^{-4}$):

For each sentence $s$, compute the weighted average:
\[\mathbf{v}_s \;=\; \frac{1}{|s|} \sum_{w \in s} \frac{a}{a + p(w)} \cdot \mathbf{v}_w.\]
Stack all ${\mathbf{v}_s}$ into a matrix and compute its first principal component $\mathbf{u}$.
Project it out:
\[\mathbf{v}_s \;\leftarrow\; \mathbf{v}_s - \mathbf{u}\mathbf{u}^\top \mathbf{v}_s.\]

Theoretical justification

Derived from a latent-discourse generative model: each word is emitted given a slowly-varying “discourse vector” $\mathbf{c}_s$ plus a smoothing term that produces stopwords irrespective of context. The MLE for $\mathbf{c}_s$ under this model is exactly the SIF weighted average; the first principal component captures the “common discourse” shared across all sentences (function words, syntactic noise) and is subtracted out.

Headline finding

10–30% improvement over RNN/LSTM supervised baselines on STS 2012–2015. The weight $a/(a+p(w))$ down-weights stopwords; the PC1 subtraction removes a syntactic-frequency dimension shared by all sentence vectors.

Connection to model2vec. The SIF weight is precisely the weighting Model2Vec applies to its static token embedding table (Step 4). Model2Vec also applies PCA (Step 3) but keeps the top components rather than removing them, a deliberate divergence because Model2Vec is correcting embeddings of single tokens, not of sentences, so the “common discourse” interpretation doesn’t transfer directly.

Mu & Viswanath 2018: All-but-the-Top

Mu, Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. arXiv:1702.01417, ICLR 2018.

Thesis

Off-the-shelf word embeddings (word2vec, GloVe) have (a) a large common mean and (b) a few dominant directions that correlate with token frequency, not meaning. Removing them makes embeddings more isotropic and uniformly stronger on downstream tasks.

Method (ABTT)

Given embeddings ${\mathbf{v}(w)}_{w \in V} \subset \mathbb{R}^d$:

Center. Compute $\boldsymbol{\mu} = \frac{1}{ V }\sum_w \mathbf{v}(w)$ and set $\tilde{\mathbf{v}}(w) = \mathbf{v}(w) - \boldsymbol{\mu}$.
PCA. Compute the top $D$ principal components $\mathbf{u}_1, \dots, \mathbf{u}_D$ of ${\tilde{\mathbf{v}}(w)}$.
Project out. $\mathbf{v}’(w) = \tilde{\mathbf{v}}(w) - \sum_{i=1}^{D} (\mathbf{u}_i^\top \tilde{\mathbf{v}}(w))\, \mathbf{u}_i$.

Recommended $D \approx d/100$ (so $D=3$ for 300-dim GloVe). Crucially: keep everything except the top few PCs, the opposite of standard dimensionality reduction.

ABTT removes the common mean and a small number of dominating directions. The resulting cloud is roughly spherical around the origin.

“a simple, and yet counter-intuitive, postprocessing technique” that eliminates “the common mean vector and a small set of dominating directions.”

Headline finding

Consistent gains across word similarity, concept categorization, analogy, semantic textual similarity, and text classification, across multiple languages, with no retraining.

The conceptual handoff

ABTT is the canonical fix for anisotropy in static embeddings. The next question is obvious: do contextual embeddings, produced by BERT, ELMo, GPT-2, have the same pathology?

Ethayarajh 2019: Geometry of contextual embeddings

Ethayarajh. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. arXiv:1909.00512, EMNLP 2019.

Thesis

Contextualized representations are highly anisotropic, they occupy a narrow cone in vector space, and the anisotropy increases sharply in upper layers, especially for GPT-2.

Three measurements

Quantity	Definition	Intuition
Anisotropy$(\ell)$	$\mathbb{E}[\cos(h_i^\ell, h_j^\ell)]$ for random $(i,j)$ from different contexts	Should be $\approx 0$ in isotropic space; the higher, the worse.
SelfSim$(w)$	Mean cosine between embeddings of word $w$ in different contexts, baseline-corrected	How stable is $w$’s identity across contexts?
IntraSim$(s)$	Mean cosine between different words in sentence $s$, baseline-corrected	How much do words in the same sentence look alike?

Headline numbers

Model	Layer	Anisotropy baseline
BERT-base	1 → 12	~0.2 → ~0.45
ELMo	top	~0.80
GPT-2 small	12 (last)	~0.99

A baseline of 0.99 means any two random word embeddings have cosine similarity near 1. The entire embedding cloud lives inside a needle-thin cone in $\mathbb{R}^d$.

“In all layers of BERT, ELMo, and GPT-2, the representations of all words are anisotropic.” Less than 5% of variance in a word’s contextualized representations is explained by a static (mean) embedding.

The pragmatic recommendation

When reporting similarities, always subtract the anisotropy baseline:

\[\cos_{\text{adj}}(a, b) \;=\; \cos(a, b) - \text{Anisotropy}(\ell).\]

Otherwise your “similarity scores” are mostly measuring the cone, not the signal.

Timkey & van Schijndel 2021: Rogue dimensions

Timkey, van Schijndel. All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. arXiv:2109.04404, EMNLP 2021.

Thesis

Anisotropy isn’t diffuse, it’s localized. A handful of “rogue” dimensions (often 1–3) with very large magnitude and high variance dominate cosine similarity between hidden states, swamping every other axis.

Decomposing cosine

For two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$, write the unnormalized similarity as a sum of per-dimension contributions:

\[\mathbf{x} \cdot \mathbf{y} \;=\; \sum_{k=1}^d x_k y_k.\]

Empirically, in BERT/GPT-2 hidden states, one or two terms in this sum account for the majority of the total. Removing them drops cosine values toward zero, exactly where they should be for unrelated tokens.

A few rogue dimensions have variance an order of magnitude larger than the rest. These dimensions dominate dot products and hence cosine similarities.

The fix

Per-dimension standardization (z-scoring) before computing similarities:

\[x'_k \;=\; \frac{x_k - \mu_k}{\sigma_k}, \qquad \cos_{\text{std}}(a, b) \;=\; \cos(a', b').\]

This neutralizes rogue dimensions’ magnitude advantage without changing the model.

“a small number of rogue dimensions, often just 1–3, dominate these measures”, and there is “a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model.”

The crucial subtlety

Rogue dimensions matter for similarity metrics (cosine, dot product), but they are not the dimensions most important to the model’s downstream behavior. Ablating them barely changes model outputs. This dissociation foreshadows Park, Choe, Veitch: the “probing direction” and the “intervention direction” can be different things, and you need a principled framework to relate them.

Park, Choe, Veitch 2023: Linear representation hypothesis

Park, Choe, Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models. arXiv:2311.03658, ICML 2024.

Thesis

“Concept = direction” can be made precise. Three a priori distinct notions of linear representation, as a subspace, as a probe/measurement, and as an intervention, coincide under a particular non-Euclidean inner product called the causal inner product.

The three notions that need unifying

Subspace: “The concept of gender is represented in a 1-D subspace spanned by $\mathbf{v}_{\text{gender}}$.”
Probe / measurement: “A linear classifier on $\mathbf{v}_{\text{gender}}$ predicts male vs. female.”
Intervention / steering: “Adding $\alpha \mathbf{v}_{\text{gender}}$ to the residual stream changes the model’s outputs toward the male/female pole.”

In Euclidean geometry, there’s no reason these should be the same direction, they’re conceptually different objects (a 1-D subspace lives in the representation space, a probe is a dual vector, an intervention is a tangent direction). Empirically, they often disagree.

The causal inner product

The paper constructs an inner product $\langle \cdot, \cdot \rangle_C$ on representation space that:

Distinguishes the unembedding (output / token) space from the context (residual stream) space.
Defines orthogonality with respect to causal separability: two concepts are orthogonal iff intervening on one doesn’t change the other.
Makes probe directions and steering directions dual under the same inner product.

Concept directions are constructed from counterfactual pairs (e.g., male↔female embeddings), and the causal inner product is derived from the model’s own statistics (essentially a whitening of the unembedding covariance).

“high-level concepts are represented linearly as directions in some representation space”, and the causal inner product “respects language structure in a sense we make precise.”

Headline finding

On LLaMA-2, concept directions constructed under the causal inner product are simultaneously good probes and good interventions. Directions constructed under the default Euclidean inner product are not.

The conceptual upshot. The “rogue dimensions” mismatch from Timkey isn’t a bug, it’s evidence that the right geometry isn’t Euclidean. Park et al. give a constructive recipe for the right geometry. This is the theoretical license for everything that follows.

Zou et al. 2023: Representation engineering

Zou, Phan, Chen, Campbell, Guo, Pang, Hendrycks et al. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405.

Thesis

Treat population-level neural activations, not individual neurons or circuits, as the primary unit of analysis. High-level concepts (honesty, power-seeking, fairness, emotion) correspond to linear directions in activation space that can be both read (for interpretability) and written (for control).

LAT: Linear Artificial Tomography

The core recipe for finding a concept direction:

Construct paired stimuli: prompts framed to elicit the concept ($S^+$) vs. prompts framed not to ($S^-$). E.g., for “honesty”: $S^+ = $ “Tell the truth: …”, $S^- = $ “Lie about: …”.
Collect hidden states $\mathbf{h}_i^+, \mathbf{h}_i^-$ at a chosen layer for each pair.
Compute the difference matrix $\Delta = {\mathbf{h}_i^+ - \mathbf{h}_i^-}$.
Take the top principal component of $\Delta$: that’s the concept direction $\mathbf{v}$.

Read and write

Read (probing): project a new hidden state onto $\mathbf{v}$ to score it for the concept:

\[\text{score}(h) \;=\; \mathbf{v}^\top \mathbf{h}.\]

Write (steering): at inference, intervene at the chosen layer with

\[\mathbf{h} \;\leftarrow\; \mathbf{h} + \alpha \cdot \mathbf{v},\]

where $\alpha$ is a signed scalar, positive injects the concept, negative suppresses it.

RepE in one picture. The concept direction v is found offline by LAT; at inference, it's added to (or subtracted from) the residual stream at a chosen layer.

Headline finding

On LLaMA-2-13B, LAT-based honesty steering improves TruthfulQA MC1 from 35.7% → 65.6%, a near-doubling, with no fine-tuning. The same recipe controls morality, emotion, power-seeking, fairness, and bias, and detects lying with high AUC.

“a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more” can be addressed by reading and editing representations.

Why this matters

RepE is the engineering manifesto for the linear representation hypothesis. It demonstrates that you can get useful behavior changes with a one-line intervention, no gradient updates, no RLHF, no curated dataset. The cost: each $\mathbf{v}$ requires labeled contrastive prompts, so the method scales linearly with the number of supervised concepts you want.

Bricken et al. 2023: Sparse autoencoders

Bricken, Templeton, Batson, Chen, Jermyn, Conerly, Turner, Anil, Denison, Askell, Grosse, McCandlish, Kaplan, Amodei, Wattenberg, Olah. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. transformer-circuits.pub, October 2023.

Thesis

Individual neurons in transformers are polysemantic: each neuron responds to mixtures of unrelated inputs. The reason is superposition, the model packs more features than it has dimensions by representing them as overlapping linear combinations of neurons. A sparse autoencoder (SAE) trained on activations can recover an over-complete, sparse, approximately monosemantic feature basis.

The SAE architecture

Given activations $\mathbf{x} \in \mathbb{R}^d$ from some layer (e.g., the 512-dim MLP output of a 1-layer transformer):

\[\begin{aligned} \mathbf{f}(\mathbf{x}) &= \text{ReLU}\big(\mathbf{W}_e (\mathbf{x} - \mathbf{b}_d) + \mathbf{b}_e\big) \in \mathbb{R}^F \\ \hat{\mathbf{x}} &= \mathbf{W}_d \mathbf{f}(\mathbf{x}) + \mathbf{b}_d \\ \mathcal{L} &= \|\mathbf{x} - \hat{\mathbf{x}}\|_2^2 \;+\; \lambda \|\mathbf{f}(\mathbf{x})\|_1 \end{aligned}\]

Key design choices:

$F \gg d$, the dictionary is overcomplete. Bricken et al. sweep from $1\times$ (512 features) up to $256\times$ (~131k features).
$\lambda |\mathbf{f}|_1$ enforces sparsity, only a few features fire per input. This is the dictionary-learning prior that breaks superposition.
Decoder columns of $\mathbf{W}_d$ are constrained to unit norm to prevent the L1 penalty from being gamed by shrinking $\mathbf{f}$ and scaling $\mathbf{W}_d$.

What’s a feature?

Each column $\mathbf{W}_d[:, k]$ is a learned “feature direction” in the original activation space. The corresponding scalar $\mathbf{f}(\mathbf{x})_k$ is the feature’s activation. Bricken et al. find that for the right choice of $\lambda$ and $F$, individual features cleanly correspond to interpretable concepts:

“DNA codons” feature, fires on sequences of nucleotide letters.
“Arabic script” feature.
“Base64 strings” feature.
“References to specific named entities,” etc.

In their A/1 run (4,096 features on a 512-dim 1-layer transformer, an 8× overcomplete dictionary), a randomly sampled 162 features are evaluated and the vast majority are judged interpretable.

Feature universality

Training two SAEs independently on the same model recovers the same features in both, strong evidence that these are properties of the model, not artifacts of the SAE.

“decompose the activations of a one-layer transformer into features that are individually interpretable”, superposition causes “neurons to be polysemantic, activating in response to mixtures of unrelated inputs.”

Why this closes the program

Everything from Mikolov to Zou et al. was about supervised direction-finding: you know what concept you’re looking for, and you craft data to surface it. SAEs do this unsupervised and at scale: train one model, get thousands of feature directions, each interpretable by inspection. The “linear representation hypothesis” finally has a constructive recipe that scales.

Tying it together

The single throughline

If you read these eight papers as one argument, it goes like this:

Mikolov (2013): “Concepts are directions.” (Observation.)
Arora (2017): “If concepts are directions, sentences are weighted bags of them, minus the common direction.” (First applied correction.)
Mu & Viswanath (2018): “The same correction generalizes: subtract the mean and the top-$D$ PCs from any embedding space.” (Canonical fix.)
Ethayarajh (2019): “Contextual embeddings are even more anisotropic than static ones. We need to know this before reporting any similarity.” (Diagnosis.)
Timkey & van Schijndel (2021): “The anisotropy isn’t spread out, it’s concentrated in 1–3 rogue dimensions, which dominate similarity but don’t matter for behavior.” (Sharper diagnosis.)
Park, Choe, Veitch (2023): “The ‘rogue dimensions matter for similarity but not behavior’ mismatch is because the natural inner product isn’t Euclidean. Here’s the right one.” (Theoretical resolution.)
Zou et al. (2023): “Under the right geometry, you can read and write concepts with a one-line intervention $\mathbf{h} \leftarrow \mathbf{h} + \alpha \mathbf{v}$.” (Applied program.)
Bricken et al. (2023): “And you don’t need to know which concepts to look for in advance, sparse autoencoders find a whole monosemantic basis automatically.” (Scalable discovery.)

What’s the same across papers

Operation	Where it appears
Mean subtraction	SIF, ABTT, Ethayarajh’s baseline correction, Timkey’s standardization
PCA / SVD of activations	SIF, ABTT, RepE’s LAT, SAE initialization
Pairwise contrastive prompts	Mikolov’s analogy tests, Park’s counterfactual pairs, RepE’s $S^+ / S^-$ stimuli
“Direction = concept”	All eight

What’s open

SAE composition. A single-layer SAE finds features in one layer. How do features compose across layers into circuits? (Anthropic’s follow-up work on “circuits” extends this.)
Steering at scale. RepE’s per-concept supervision doesn’t scale to thousands of concepts. Combining SAE features (unsupervised) with steering (RepE) is the obvious synthesis.
The right inner product, in practice. Park et al. give a clean theory, but constructing the causal inner product for a frontier model is empirically hard.
Are concepts really linear? The hypothesis is empirically successful but not derived from first principles. Why should a network trained with cross-entropy and gradient descent produce linearly-decodable concepts? Open theoretical question.

Where to start if you want to do work in this area

Get hands-on with SAEs. The reference implementations (Anthropic’s, EleutherAI’s sae library, the sae_lens package) train in a few hours on a single GPU for small models. Train one on GPT-2 small. Inspect features.
Reproduce a RepE experiment. LAT for “honesty” on LLaMA-3-8B is a weekend project. The result will surprise you, it really does work.
Read circuits. Anthropic’s transformer-circuits.pub is the working notebook of this field.

A reading path. All page references and formulas verified against arXiv abstracts and OpenReview entries; the SAE details are standard and consistent with the public Anthropic write-up.