Encoders - Squash Reality into a Vector Space

From SIFT histograms to CLIP's billion-parameter shared space, this post traces how vision and text communities independently discovered the same compression trick and then converged into one. Covers handcrafted features, shallow embeddings, deep CNNs, BERT, ViT, self-supervised pre-training, and multimodal alignment; with open problems and portfolio projects.

From a 128-dimensional SIFT histogram to a 1024-dimensional CLIP vector, the encoder has been machine learning’s most productive primitive: compress reality, discard noise, and keep only the geometry that makes predictions easy.


The throughline

An encoder is any learned (or engineered) function $f: \mathcal{X} \to \mathbb{R}^d$ that maps a high-dimensional, noisy input into a compact vector such that the structure useful for downstream tasks is preserved and irrelevant variation is discarded. The output vector is called an embedding; the space it inhabits is called the representation space or latent space. These two words - encoder and embedding - are the backbone of almost every modern machine learning system, from search engines to protein folding to autonomous driving.

The central insight that has driven three decades of progress is deceptively simple: if you can find a space where semantically similar things are geometrically close, every downstream problem becomes easier. Classification becomes nearest-neighbor lookup. Retrieval becomes cosine similarity. Transfer learning becomes fine-tuning a linear head. The hard problem is not the downstream task - it is learning the space.

Vision and text communities attacked this problem in parallel, mostly ignoring each other, and repeatedly invented the same solutions at intervals of a few years. The convergence that happened in 2021 with CLIP was not an accident; it was inevitable once you recognize that an image patch and a word token are both just sequences of features that need to be compressed. The field has traveled through seven distinct waves:

  1. Handcrafted features (pre-2012): human engineers decided what structure to preserve; no learning involved
  2. Shallow learned embeddings (2012-2014): data-driven feature extraction; one layer of learned geometry
  3. Depth and hierarchy (2014-2017): stack layers to capture multi-scale structure; skip connections for stability
  4. Contextualization (2018-2019): the same token means different things in different contexts; the encoder must be sensitive to position and neighbors
  5. Attention everywhere (2020-2022): model arbitrary long-range dependencies; vision adopts the transformer
  6. Self-supervised pre-training (2020-2023): use the input as its own supervision; contrastive and masked objectives remove the label bottleneck
  7. Multimodal convergence (2021-present): one shared space for images and text; the two encoder lineages merge
HANDCRAFTED pre-2012 SIFT, HOG TF-IDF, LSA SHALLOW EMBED 2012-2014 AlexNet word2vec, GloVe DEPTH 2014-2017 VGG, ResNet fastText, doc2vec CONTEXT 2018-2019 SE-Net, NL-Nets ELMo, BERT ATTENTION 2020-2022 ViT, Swin SBERT, SimCSE SELF-SUPERVISED 2020-2023 DINO, MAE SimCSE, E5 MULTIMODAL 2021-present CLIP, ALIGN SigLIP, BLIP-2 breaks: no generalization breaks: no long-range breaks: static context breaks: needs labels breaks: unimodal breaks: unimodal ongoing... Common thread: map high-dimensional input --> compact vector --> transfer to downstream task
Seven waves of encoder research. Vision and text ran in parallel; wave 7 is where they merged.

Act I - Before Learning (pre-2012)

In this era, the encoder is a hand-designed function $\phi: \mathcal{X} \to \mathbb{R}^d$ where every design choice - which invariances to build in, which statistics to compute, how large $d$ should be - is made by a human engineer. No parameters are learned from data.

Vision: Engineering the Perfect Descriptor

Before neural networks became practical, vision researchers had to decide manually what it means for two image patches to be “the same.” The answer they converged on was: two patches are the same if they produce the same local statistics, regardless of illumination change, rotation, and small geometric distortions.

The canonical result of this reasoning is David Lowe’s SIFT descriptor (Scale-Invariant Feature Transform, IJCV 2004). SIFT detects keypoints at multiple scales by finding extrema of a Difference-of-Gaussian pyramid, then describes each keypoint as a 128-dimensional histogram of local gradient orientations. Because orientation is relative to the dominant gradient direction at the keypoint, the descriptor is rotation-invariant. Because gradients are only weakly sensitive to linear illumination changes, the descriptor is largely illumination-invariant. One image produces a set of these 128-dimensional vectors; matching two images becomes nearest-neighbor lookup in descriptor space.

Navneet Dalal and Bill Triggs pushed this histogram idea further in “Histograms of Oriented Gradients for Human Detection” (CVPR 2005). HOG divides an image window into a dense grid of cells, computes a gradient orientation histogram per cell, and concatenates them into a single vector - typically 3780-dimensional for a pedestrian detection window. Unlike SIFT, HOG is computed densely over a fixed-size window rather than at detected keypoints, making it better suited as a fixed-dimensional image representation for linear classifiers.

Sivic and Zisserman’s “Video Google” (ICCV 2003) completed this era’s architecture: extract many SIFT descriptors from an image, quantize each to its nearest cluster centroid (a “visual word”) from a pre-trained vocabulary, count the visual words, and weight by inverse document frequency. The result is a Bag of Visual Words (BoVW) vector that represents the whole image as a histogram over a discrete vocabulary. This is an explicit import of the NLP bag-of-words model into vision.

The shared failure of all these representations is that they encode presence, not context. A histogram of gradient orientations says nothing about where those orientations are relative to each other or how they interact.

VISION Raw Image H x W x 3 Detect Keypoints DoG extrema Grad Histograms 4x4 cells x 8 bins Quantize (BoVW) k visual words Vector 128-d or k-d TEXT Raw Text unstructured Tokenize split on whitespace Count term frequency Reweight TF-IDF / LSA Vector |V|-d sparse
The two parallel feature-engineering pipelines of the pre-learning era. Both convert raw inputs to vectors through human-designed statistics; neither learns from data.

Text: Counting Words

The NLP community’s answer to representation was counting. A document is a bag of words: ignore word order, ignore syntax, count occurrences. The resulting vector has one dimension per vocabulary token, most of which are zero, and its $i$-th entry is the count of the $i$-th word in the document.

Raw counts conflate the importance of words. Function words (“the”, “is”, “a”) appear everywhere and carry no meaning; rare domain-specific terms carry enormous meaning. TF-IDF (Term Frequency - Inverse Document Frequency) corrects this with a simple reweighting. For term $t$ in document $d$ from a corpus $D$:

\[\text{tfidf}(t, d, D) = \underbrace{\text{tf}(t,d)}_{\text{local freq}} \cdot \underbrace{\log\frac{|D|}{|\{d' \in D : t \in d'\}|}}_{\text{corpus rarity}}\]

Terms that appear everywhere get a near-zero IDF weight; terms that are rare in the corpus but frequent in the document get a large weight. The resulting sparse, high-dimensional vector is good enough to power search engines for two decades. Google’s early retrieval systems and most document clustering of the 1990s-2000s ran on TF-IDF.

Scott Deerwester et al.’s Latent Semantic Analysis (“Indexing by Latent Semantic Analysis”, JASIST 1990) took TF-IDF one step further: apply truncated SVD to the term-document matrix and project both documents and queries into a low-rank subspace. This captures co-occurrence structure (“car” and “automobile” land near each other) but is a global linear projection with no capacity to model polysemy or syntactic context.


Act II - The Shallow Embedding Era (2012-2014)

Both vision and text made the same move in consecutive years: replace $\phi$ with a learned $f_\theta: \mathcal{X} \to \mathbb{R}^d$ where $\theta$ is optimized on data. The resulting embedding space encodes semantic similarity as geometric proximity: $\text{sim}(f_\theta(x), f_\theta(y)) \approx \text{semantic-similarity}(x, y)$.

The ImageNet Moment

The year 2012 is machine learning’s punctuation mark. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet Large Scale Visual Recognition Challenge and cut the top-5 error from 26% to 15% - a gap so large that every other team switched to deep learning within a year (“ImageNet Classification with Deep Convolutional Neural Networks”, NeurIPS 2012).

AlexNet’s key insight for representation is that deep, stacked convolutions learn a hierarchy of increasingly abstract features: edges in layer 1, textures in layer 2, parts in layer 3, objects in layer 4-5. The penultimate fully connected layer produces a 4096-dimensional vector. That vector was never explicitly designed as an embedding, but it turned out to be a near-universal image representation: fine-tuning just the final linear head on any new dataset consistently outperformed all prior hand-engineered systems.

The loss that trains this representation is softmax cross-entropy:

\[\mathcal{L}_\text{CE} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(\mathbf{w}_{y_i}^\top \mathbf{z}_i)}{\sum_{c=1}^C \exp(\mathbf{w}_c^\top \mathbf{z}_i)}\]

where $\mathbf{z}_i \in \mathbb{R}^{4096}$ is the penultimate-layer embedding of image $i$, $y_i$ is its class label, and $\mathbf{w}_c$ is the weight vector of class $c$. The encoder is trained to make the correct class’s inner product largest; in doing so, it is forced to produce embeddings that are linearly separable by class.

The Word2Vec Moment

One year later, Tomas Mikolov et al. published word2vec (“Distributed Representations of Words and Phrases and their Compositionality”, NeurIPS 2013). The core idea is to train a shallow two-layer neural network to predict context words from a center word (skip-gram) or vice versa (CBOW). No labels are needed because the corpus itself provides supervision: words that appear in similar contexts should have similar embeddings.

The skip-gram objective for a corpus of $T$ words with context window $c$ is:

\[\mathcal{L}_\text{SG} = -\frac{1}{T}\sum_{t=1}^{T}\sum_{\substack{-c \le j \le c \\ j \ne 0}} \log P(w_{t+j} \mid w_t), \quad P(w_O \mid w_I) = \frac{\exp\!\left(\mathbf{v}_{w_O}^\top \mathbf{v}_{w_I}\right)}{\sum_{w=1}^{W}\exp\!\left(\mathbf{v}_w^\top \mathbf{v}_{w_I}\right)}\]

where $\mathbf{v}w \in \mathbb{R}^{300}$ is the embedding of word $w$. The geometric miracle of this training objective is that it implicitly factorizes the pointwise mutual information matrix of word co-occurrences, so the resulting space has arithmetic structure: $\mathbf{v}\text{king} - \mathbf{v}\text{man} + \mathbf{v}\text{woman} \approx \mathbf{v}_\text{queen}$.

Jeffrey Pennington, Richard Socher, and Christopher Manning independently arrived at a similar space via explicit co-occurrence matrix factorization in GloVe (“GloVe: Global Vectors for Word Representation”, EMNLP 2014). GloVe makes the co-occurrence statistics and the factorization objective explicit:

\[\mathcal{L}_\text{GloVe} = \sum_{i,j=1}^V f(X_{ij})\left(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2\]

where $X_{ij}$ is the co-occurrence count of words $i$ and $j$, and $f$ is a weighting function that caps the contribution of very frequent pairs. Both word2vec and GloVe produce static, context-free vectors: “bank” in “river bank” and “bank” in “bank account” share one vector.

Note the structural parallel: AlexNet (NeurIPS 2012) and word2vec (NeurIPS 2013) were published in consecutive years and made the same move. Both replaced hand-designed statistics with data-driven optimization. Both produced vectors that could be transferred to downstream tasks. Neither community was paying attention to the other.

AlexNet (Vision) Image (224x224x3) Conv1-5 + Pool FC6, FC7 4096-d Embedding trained with CE loss word2vec (Text) Word Token Embedding Lookup Dot-prod + Softmax 300-d Embedding trained with SG loss Shared Idea (2012-2013) data decides what features matter geometry = semantics
AlexNet and word2vec made the same move in consecutive years: replace hand-designed statistics with a learned embedding trained via self-supervised or supervised objective. The two communities would not formally collaborate for almost a decade.

Act III - Depth and Global Context (2014-2017)

Shallow learned embeddings exposed a new bottleneck: representational depth. A single layer of learned features cannot capture the hierarchy of structure in natural images or the multi-scale compositionality of language. This act is about stacking layers, and the engineering problems that had to be solved to make stacking work - most importantly, how to keep the gradient $\partial \mathcal{L} / \partial \mathbf{x}$ from vanishing across 50 or 100 layers.

Going Deeper for Vision

If one layer of learned features is good, more layers should be better - except that deeper networks were notoriously difficult to train. Gradients vanished across many layers; activations saturated; weight initialization mattered enormously. The 2014-2017 period was largely about solving these engineering problems so that depth could be exploited.

Karen Simonyan and Andrew Zisserman’s VGG networks (“Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015) showed that homogeneous stacks of 3x3 convolutions could scale to 16-19 layers and improve over AlexNet by a large margin. The key insight is that two stacked 3x3 convolutions have the same receptive field as one 5x5 convolution but fewer parameters and an extra nonlinearity. VGG-16 and VGG-19 became standard feature extractors for a generation of transfer learning papers.

The deeper breakthrough came from Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in “Deep Residual Learning for Image Recognition” (CVPR 2016). ResNet introduced the skip connection: instead of learning the desired mapping $H(\mathbf{x})$ directly, each block learns the residual $\mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$:

\[H(\mathbf{x}) = \mathcal{F}(\mathbf{x},\, \{W_i\}) + \mathbf{x}\]

At initialization, $\mathcal{F}(\mathbf{x}) \approx 0$, so the block approximates the identity. Gradients flow unchanged through the additive skip connection; vanishing gradient is no longer a barrier. ResNet-152 won ImageNet 2015 with a top-5 error of 3.57% - below human-level performance on that benchmark. The representations learned by deep ResNets became the backbone of almost every vision system for the next five years.

Input x C x H x W Weight Layer 1 BN + Conv + ReLU Weight Layer 2 BN + Conv + H(x) = F(x)+x ReLU applied after + skip connection: x passes unchanged F(x) = residual to learn
The ResNet residual block. The skip connection lets gradients flow directly from output to input, making very deep networks trainable. The block learns the residual F(x) = H(x) - x rather than the full mapping H(x).

Subwords and Documents for Text

The text side of this era had a more modest-sounding problem: word2vec broke on rare and out-of-vocabulary words. Piotr Bojanowski et al.’s fastText (“Enriching Word Vectors with Subword Information”, TACL 2017) extended word2vec by representing each word as the mean of its character $n$-gram embeddings:

\[\mathbf{v}_w = \frac{1}{|G_w|}\sum_{g \in G_w} \mathbf{z}_g\]

where $G_w$ is the set of character $n$-grams (e.g., for $n=3$: “app”, “ppl”, “ple” from “apple”) and $\mathbf{z}_g \in \mathbb{R}^{300}$ is a learned $n$-gram embedding. This allows fastText to handle morphologically complex languages, misspellings, and technical terms that never appeared in training data.

Quoc Le and Tomas Mikolov’s doc2vec (“Distributed Representations of Sentences and Documents”, ICML 2014) extended word2vec to document-level: add a unique “paragraph token” to the context window of every prediction in a document and train it the same way. The paragraph token acts as a memory of document context across all word predictions. The resulting paragraph vector is a fixed-size representation of a whole document, independent of length.

Both fastText and doc2vec are still static: every instance of a word or paragraph maps to the same embedding at inference time. What the field needed was a representation that could shift based on context.


Act IV - Contextualization (2018-2019)

Static embeddings are functions of the token type alone: $f(w_t) = \mathbf{v}_{w_t}$. A contextual encoder is a function of the token and its sequence context: $f(w_t, \mathbf{x}) = \mathbf{h}_t^{(L)}$, where $\mathbf{h}_t^{(L)}$ is the $L$-th layer hidden state at position $t$ given the full input sequence $\mathbf{x}$. This act covers the architectures that made contextual encoding practical.

BERT and the Context Revolution

The most consequential insight of this era is embarrassingly simple: the same token should produce different embeddings in different contexts. “Bank” in “river bank” and “bank” in “bank account” are different things; they should not be the same vector. Any encoder that produces fixed embeddings per token - regardless of neighbors - cannot represent polysemy, and polysemy is everywhere in natural language.

Matthew Peters et al.’s ELMo (“Deep Contextualized Word Representations”, NAACL 2018) was the first practical solution. ELMo trains a bidirectional two-layer LSTM language model: predict the next token left-to-right, and predict the previous token right-to-left. The embedding of a token is a learned weighted sum of all the hidden states across both directions and both layers. For the first time, the same word type produces different vectors at different positions in a sentence.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova then scaled this idea into the transformer era with BERT (“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2019). BERT masks 15% of input tokens at random and trains the encoder to predict the masked tokens from all other tokens (Masked Language Modeling):

\[\mathcal{L}_\text{MLM} = -\sum_{t \in \mathcal{M}} \log P\!\left(w_t \;\middle|\; \mathbf{x}_{\setminus \mathcal{M}}\right)\]

where $\mathcal{M}$ is the set of masked positions and $\mathbf{x}_{\setminus \mathcal{M}}$ is the sequence with masked tokens replaced by a [MASK] token. Unlike causal language modeling, MLM allows the encoder to attend to both left and right context simultaneously, producing fully bidirectional representations. BERT-base produces 768-dimensional contextual embeddings; BERT-large produces 1024-dimensional ones. Fine-tuning on GLUE benchmarks set new state of the art on every task.

Static (word2vec) "I went to the river bank" "I deposited at the bank" bank bank same vector [0.3, 0.1, ...] one vector per word type Contextual (BERT) "I went to the river bank" "I deposited at the bank" bank bank river-bank vec [0.8, -0.2, ...] finance-bank vec [-0.1, 0.7, ...] context shifts the vector
Static vs. contextual embeddings for the word "bank". Word2vec assigns one vector per word type. BERT assigns a different vector at each occurrence, conditioned on the full surrounding context.

Squeeze-and-Excitation for Vision

The vision community arrived at context sensitivity via a different path. Jie Hu, Li Shen, and Gang Sun introduced Squeeze-and-Excitation Networks (“Squeeze-and-Excitation Networks”, CVPR 2018), which learn to dynamically re-weight channels based on global context. The module first squeezes spatial information into a channel descriptor via global average pooling, then excites by learning a channel-wise scaling vector:

\[\mathbf{z} = \text{GlobalAvgPool}(\mathbf{X}), \quad \mathbf{s} = \sigma\!\left(W_2\,\delta(W_1\,\mathbf{z})\right), \quad \tilde{x}_c = s_c \cdot x_c\]

where $\mathbf{z} \in \mathbb{R}^C$ is the squeezed descriptor, $\delta$ is ReLU, $\sigma$ is sigmoid, and $s_c$ is the learned importance weight for channel $c$. SE-Nets won the ILSVRC 2017 image classification challenge and showed that global channel-level attention - asking “which feature maps matter for this specific image?” - is as important as local spatial convolutions.

Xiaolong Wang et al. extended this to spatial attention in “Non-local Means Networks” (CVPR 2018), allowing any position in the feature map to attend to any other position via a dot-product attention mechanism - a direct precursor to the full self-attention of Vision Transformers.


Act V - Attention Takes Over (2020-2022)

The self-attention mechanism replaces local convolution with a global, input-dependent mixing operation. For a sequence of $N$ tokens, the attention-weighted output at position $i$ is a weighted sum of all value vectors:

\[\mathbf{o}_i = \sum_{j=1}^N \alpha_{ij}\,V_j, \qquad \alpha_{ij} = \frac{\exp(Q_i \cdot K_j / \sqrt{d_k})}{\sum_{k=1}^N \exp(Q_i \cdot K_k / \sqrt{d_k})}\]

This act covers what happened when both vision and text communities committed fully to this mechanism.

Vision Meets the Transformer

Alexey Dosovitskiy et al.’s Vision Transformer (“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR 2021) is the moment when the NLP transformer became a general-purpose visual encoder. The key operation is to split an image into non-overlapping $16 \times 16$ pixel patches, linearly project each patch to a $d$-dimensional vector, prepend a learnable [CLS] token, add sinusoidal or learned position embeddings, and feed the resulting sequence to a standard transformer encoder:

\[\mathbf{E} = [\mathbf{x}_\text{cls};\; P_1 \mathbf{E}_p;\; P_2 \mathbf{E}_p;\; \ldots;\; P_N \mathbf{E}_p] + \mathbf{E}_\text{pos}, \qquad \mathbf{z} = \text{Transformer}(\mathbf{E})[0]\]

where $\mathbf{E}p \in \mathbb{R}^{(P^2 C) \times d}$ is the patch projection matrix, $\mathbf{E}\text{pos} \in \mathbb{R}^{(N+1) \times d}$ is the position embedding, and $\mathbf{z} = \text{Transformer}(\mathbf{E})[0]$ is the [CLS] token’s final-layer hidden state - the global image embedding. The self-attention kernel is:

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

where $Q, K, V$ are linear projections of the input sequence. Unlike convolutions, self-attention has no inductive bias toward local structure: a patch in the top-left corner can directly attend to a patch in the bottom-right corner at every layer. This global receptive field from the first layer is both ViT’s strength (long-range dependencies) and its weakness (requires large-scale pretraining to compensate for the lost locality prior).

Ze Liu et al.’s Swin Transformer (“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, ICCV 2021) reintroduced a hierarchical structure by computing self-attention within non-overlapping local windows and shifting the windows between layers to allow cross-window interaction. This recovers the multi-scale feature hierarchy of ResNets while retaining the transformer’s attention mechanism.

ViT: image to token sequence 16x16 patches Linear Project (P^2*C) --> d + [CLS] token + pos embedding Transformer Encoder L layers, self-attention + MLP blocks BERT-style [CLS] output d-dim image emb Image Vector z in R^d Note: each patch token attends to every other patch token via self-attention - no locality constraint Position 1 can directly influence position N at every layer
ViT's image-to-embedding pipeline. An image is split into non-overlapping patches, each projected to a d-dimensional vector. A [CLS] token is prepended; the whole sequence passes through a standard transformer. The [CLS] output is the global image embedding.

Dense Text Encoders

On the text side, the problem in 2019-2022 was different: BERT produces excellent token-level embeddings but poor sentence-level embeddings. The mean or CLS pooling of BERT representations performs worse than GloVe for semantic similarity tasks, because BERT was trained to predict masked tokens - not to produce metrically meaningful sentence vectors.

Nils Reimers and Iryna Gurevych’s Sentence-BERT (“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, EMNLP 2019) fixed this with a siamese fine-tuning setup: pass two sentences through the same BERT, mean-pool their token embeddings, and optimize the cosine similarity between the two pooled vectors using labeled sentence-pair data (NLI, STS):

\[\mathcal{L}_\text{SBERT} = -\log\frac{\exp\!\left(\cos(\mathbf{u}, \mathbf{v}^+)/\tau\right)}{\exp\!\left(\cos(\mathbf{u}, \mathbf{v}^+)/\tau\right) + \exp\!\left(\cos(\mathbf{u}, \mathbf{v}^-)/\tau\right)}\]

where $\mathbf{u}$ is the anchor sentence embedding, $\mathbf{v}^+$ is a semantically similar sentence, and $\mathbf{v}^-$ is a non-similar sentence. This seemingly small change - fine-tune for the metric you care about - produced sentence encoders that enabled real-time semantic similarity at scale for the first time.

Tianyu Gao, Xingcheng Yao, and Danqi Chen’s SimCSE (“SimCSE: Simple Contrastive Learning of Sentence Embeddings”, EMNLP 2021) removed even the need for labeled pairs by using dropout noise as the data augmentation: pass the same sentence through BERT twice with different dropout masks and treat the two resulting vectors as a positive pair.


Act VI - Self-Supervised Encoders (2020-2023)

The central bottleneck of every act so far was labels. Self-supervised learning removes this by defining a pretext task from the data itself: for an input $x$ and its augmented view $\tilde{x}$, train $f_\theta$ so that $f_\theta(x) \approx f_\theta(\tilde{x})$ while $f_\theta(x) \not\approx f_\theta(y)$ for unrelated $y$. No human annotation is required at pre-training time.

Contrastive Vision

The central problem of Acts II-V was the label bottleneck: training encoders at scale required millions of labeled examples. Self-supervised learning removes this bottleneck by using the structure of the data itself as supervision. The field converged on two families of objectives: contrastive learning and masked prediction.

Kaiming He et al.’s MoCo (“Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020) and Ting Chen et al.’s SimCLR (“A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020) formalized contrastive self-supervision for vision. The core idea: apply two random augmentations to an image, encode both through a network (possibly with a projection head), and train the representations to be similar (positive pair) while being dissimilar from representations of other images (negatives). SimCLR’s NT-Xent loss:

\[\ell(i, j) = -\log\frac{\exp\!\left(\text{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau\right)}{\displaystyle\sum_{k=1}^{2N}\mathbf{1}_{[k \ne i]}\exp\!\left(\text{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau\right)}\]

where $\mathbf{z}_i$ and $\mathbf{z}_j$ are the projected embeddings of the two augmented views of the same image, $N$ is the batch size, and $\tau$ is a temperature. The denominator sums over all $2N - 1$ other embeddings in the batch as negatives.

Mathilde Caron et al.’s DINO (“Emerging Properties in Self-Supervised Vision Transformers”, ICCV 2021) applied a teacher-student distillation approach without negative pairs: a student ViT is trained to predict the output of a momentum-updated teacher ViT on different views of the same image. The teacher’s weights are an exponential moving average of the student’s. DINO produced ViT representations that, when visualized via attention maps, showed explicit object segmentation structure that emerged without any segmentation supervision.

Kaiming He et al. took a different angle with MAE (“Masked Autoencoders Are Scalable Vision Learners”, CVPR 2022): mask a large fraction (75%) of the image patches and train the encoder-decoder to reconstruct the masked pixels. The reconstruction loss:

\[\mathcal{L}_\text{MAE} = \frac{1}{|\mathcal{M}|}\sum_{p \in \mathcal{M}}\left\|\mathbf{x}_p - \hat{\mathbf{x}}_p\right\|_2^2\]

where $\mathcal{M}$ is the set of masked patch indices and $\hat{\mathbf{x}}_p$ is the decoder’s prediction for patch $p$. The encoder only sees the visible patches; the decoder reconstructs all patches. At inference time the decoder is discarded and the encoder alone is used. Maxime Oquab et al. combined DINOv2 with curated large-scale data curation in “DINOv2: Learning Robust Visual Features without Supervision” (TMLR 2024), producing image encoders that match or exceed supervised ImageNet encoders on linear probing benchmarks.

Original Image x Aug View 1 crop, flip, color Aug View 2 different aug Encoder f same weights Encoder f same weights Proj head g z_i in R^128 Proj head g z_j in R^128 NT-Xent loss pull z_i, z_j together Negatives (2N-2) other images in batch
Contrastive self-supervised learning (SimCLR). Two augmented views of the same image are encoded and projected, then pulled together in representation space. All other images in the batch serve as negatives, pushed apart.

Label-Free Text Representation

The text side of this era extended the contrastive idea to sentence encoders. SimCSE (EMNLP 2021) showed that passing the same sentence through BERT twice with different dropout masks produces a sufficient positive pair signal. Liang Wang et al.’s E5 (“Text Embeddings by Weakly-Supervised Contrastive Pre-training”, arXiv 2022, not peer-reviewed) scaled this further: curate billions of weakly labeled text pairs from the web (e.g., title-body pairs from web pages, question-answer pairs from forums), and contrastively pre-train a text encoder on them. GTE (Li and Li, “Towards General Text Embeddings with Multi-stage Contrastive Learning”, arXiv 2023, not peer-reviewed) proposed a multi-stage training curriculum: large-batch pre-training on weak pairs, followed by fine-tuning on curated high-quality pairs.

The alignment-uniformity framework (Wang and Isola, “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”, ICML 2020) provides a clean decomposition of what a good embedding space requires:

\(\mathcal{L}_\text{align} = \mathbb{E}_{(x,y)\sim p_\text{pos}}\!\left[\|\mathbf{f}(x) - \mathbf{f}(y)\|_2^2\right]\) \(\mathcal{L}_\text{uniform} = \log\,\mathbb{E}_{x,y \sim p_\text{data}}\!\left[e^{-2\|\mathbf{f}(x) - \mathbf{f}(y)\|_2^2}\right]\)

Good encoders minimize alignment (pull positive pairs close) while also minimizing uniformity loss (spread all embeddings uniformly on the unit hypersphere, maximizing information content).


Act VII - The Multimodal Convergence (2021-present)

A vision encoder $f_I: \mathcal{X}_I \to \mathbb{R}^d$ and a text encoder $f_T: \mathcal{X}_T \to \mathbb{R}^d$ are architecturally identical - both map sequences of tokens to vectors via attention. The question is whether $\mathbb{R}^d$ can be the same space for both. This act shows that the answer is yes, and that aligning the two spaces with paired data yields capabilities that neither encoder possesses alone.

CLIP and the Shared Space

Alec Radford et al.’s CLIP (“Learning Transferable Visual Models From Natural Language Supervision”, ICML 2021) is the clearest possible statement that a vision encoder and a text encoder are the same kind of object. CLIP trains a vision encoder $f_I$ and a text encoder $f_T$ jointly to align matched image-text pairs in a shared $d$-dimensional space, using a contrastive objective over a batch of $N$ pairs:

\[\mathcal{L}_\text{CLIP} = -\frac{1}{2N}\sum_{i=1}^N\left[\log\frac{\exp\!\left(\mathbf{I}_i \cdot \mathbf{T}_i/\tau\right)}{\sum_{j=1}^N \exp\!\left(\mathbf{I}_i \cdot \mathbf{T}_j/\tau\right)} + \log\frac{\exp\!\left(\mathbf{T}_i \cdot \mathbf{I}_i/\tau\right)}{\sum_{j=1}^N \exp\!\left(\mathbf{T}_j \cdot \mathbf{I}_i/\tau\right)}\right]\]

where $\mathbf{I}_i = f_I(\text{image}_i) / |f_I(\text{image}_i)|$ and $\mathbf{T}_i = f_T(\text{text}_i) / |f_T(\text{text}_i)|$ are L2-normalized embeddings, and $\tau$ is a learned temperature. Both terms are symmetric cross-entropy: the first maximizes image-to-text retrieval, the second maximizes text-to-image retrieval. Trained on 400 million web-scraped image-caption pairs, CLIP’s shared space enables zero-shot image classification by embedding class names as text and finding the nearest neighbor in the shared space.

Chao Jia et al.’s ALIGN (“Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision”, ICML 2021) independently showed the same result at even larger scale (1.8 billion noisy pairs), confirming that scale compensates for label noise.

Image Encoder ViT / ResNet Text Encoder Transformer I_1, I_2, ... I_N normalized image vecs in R^d T_1, T_2, ... T_N normalized text vecs in R^d NxN Similarity Matrix I . T^T diagonal = matched pairs off-diagonal = negatives trained with InfoNCE Shared Space image and text embeddings comparable via dot product
CLIP's dual-encoder architecture. A vision encoder and a text encoder project their inputs into a shared d-dimensional space. The NxN similarity matrix is trained so the diagonal (matched pairs) is large and the off-diagonal (mismatched pairs) is small.

Sigmoid Loss and Beyond

Xiaohua Zhai et al.’s SigLIP (“Sigmoid Loss for Language Image Pre-Training”, ICCV 2023) replaced CLIP’s softmax (which normalizes across the full batch) with an independent sigmoid binary cross-entropy, treating each image-text pair as an independent binary classification:

\[\mathcal{L}_\text{SigLIP} = -\frac{1}{N^2}\sum_{i,j}\left[y_{ij}\log\sigma\!\left(\mathbf{I}_i \cdot \mathbf{T}_j/\tau\right) + (1-y_{ij})\log\sigma\!\left(-\mathbf{I}_i \cdot \mathbf{T}_j/\tau\right)\right]\]

where $y_{ij} = 1$ if pair $(i,j)$ is a match and 0 otherwise. The sigmoid loss has a key practical advantage: it does not require a large batch to have enough negatives, because each pair is evaluated independently. This allows training with smaller batch sizes without accuracy degradation. Junnan Li et al.’s BLIP-2 (“BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”, ICML 2023) introduced a lightweight Querying Transformer (Q-Former) that bridges a frozen image encoder and a frozen LLM, enabling visual question answering and captioning without fine-tuning either large model.


The Frontier (2025)

The frontier in 2025 is characterized by three shifts that are reshaping what an encoder is.

First, the scale and data curation of both text and vision encoders has grown substantially. SigLIP-2 (Tschannen et al., “SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding”, arXiv 2025, not peer-reviewed) extends SigLIP with masked prediction, caption generation, and self-distillation into a single multi-task objective, producing stronger localization and segmentation from the same encoder. On the text side, large language models are increasingly used as text encoders directly: E5-Mistral-7B (Wang et al., arXiv 2024, not peer-reviewed) and NV-Embed-v2 (Lee et al., arXiv 2024, not peer-reviewed) fine-tune Mistral-7B as a dense retrieval encoder, reaching the top of the MTEB leaderboard by leveraging the LLM’s broad linguistic coverage.

Second, the boundary between encoders and decoders is blurring. Models like ImageBind (Girdhar et al., “ImageBind: One Embedding Space to Bind Them All”, CVPR 2023) extend the CLIP idea to six modalities simultaneously: image, video, audio, text, depth, and IMU. The shared space is learned by pairing each modality with image-aligned data; modalities that are never directly paired (e.g., audio and depth) still end up geometrically aligned through their common anchor in image space.

Third, long-context encoding is becoming a first-class problem. Standard transformer encoders are limited to sequence lengths of 512 to 8192 tokens due to $\mathcal{O}(L^2)$ attention cost. Retrieval applications increasingly require encoding multi-page documents or high-resolution images. Architectures like late-interaction models (ColBERT, ColPali) sidestep this by storing per-token embeddings and computing similarity at query time rather than compressing a whole document to one vector.

\[\text{ColBERT score}(q, d) = \sum_{i \in E_q} \max_{j \in E_d} \mathbf{e}_{q,i}^\top \mathbf{e}_{d,j}\]

where $E_q$ and $E_d$ are the per-token embedding sets of the query and document respectively. This MaxSim operation retains the expressive power of token-level representations while staying in a retrieval framework.


Open Problems

The ideal encoder satisfies two properties simultaneously: alignment (paired inputs are nearby) and uniformity (all embeddings are spread uniformly on $\mathbb{S}^{d-1}$, the unit hypersphere). Formally, for a pair distribution $p_\text{pos}$ and a data distribution $p_\text{data}$:

\[\mathcal{L}^* = \underbrace{\mathbb{E}_{(x,y)\sim p_\text{pos}}\!\left[\|f(x) - f(y)\|^2\right]}_{\text{alignment}} + \underbrace{\log\,\mathbb{E}_{x,y\sim p_\text{data}}\!\left[e^{-2\|f(x)-f(y)\|^2}\right]}_{\text{uniformity}}\]

None of the following problems are fully solved by minimizing this objective.

The Isotropy Collapse Problem

A well-trained encoder should distribute embeddings uniformly across the unit hypersphere, maximizing the information capacity of the representation space. In practice, most encoders - including fine-tuned BERT variants - suffer from anisotropy: embeddings cluster in a narrow cone, so many pairs have unexpectedly high cosine similarity. The mean pairwise cosine similarity:

\[\bar{c} = \frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n \frac{\mathbf{e}_i^\top \mathbf{e}_j}{\|\mathbf{e}_i\|\|\mathbf{e}_j\|}\]

approaches 1 as embeddings collapse. The current mitigations (whitening, SimCSE’s dropout contrastive objective) help but do not fully solve the problem, particularly after domain-specific fine-tuning.

Compositionality

Current encoders do not compose: representing “red circle” does not involve structurally combining the representation of “red” and the representation of “circle.” The result is that encoders fail on negation, counting, and relational reasoning. For an encoder $f$ and a composition operator $\oplus$:

\[f(A \oplus B) \ne g(f(A),\, f(B)) \quad \text{for any fixed } g\]

There is no learned $g$ that generalizes across all compositions. This is particularly acute for CLIP, which fails at hard negative captions that differ from a true caption by swapping one attribute.

Distribution Shift in Embedding Spaces

An encoder trained on domain $\mathcal{D}\text{train}$ produces embeddings whose geometry reflects that domain’s co-occurrence statistics. When applied to $\mathcal{D}\text{test}$ with different statistics, the embedding space contracts or distorts, degrading retrieval precision. The gap is governed by the KL divergence between domains:

\[\Delta\text{Precision@}k \propto D_\text{KL}(\mathcal{D}_\text{train} \,\|\, \mathcal{D}_\text{test})\]

The current solution is domain-specific fine-tuning, but this requires labeled in-domain pairs and the resulting encoder degrades on the original domain - a catastrophic forgetting problem.

Long-Context Encoding

Compressing an arbitrarily long document into a single fixed-size vector while preserving the information relevant to an arbitrary query is an open problem. Formally, define the ideal encoder as:

\[\mathbf{z}^* = \arg\min_{\mathbf{z} \in \mathbb{R}^d} \max_{q \in \mathcal{Q}} \left|\text{relevance}(q, \text{doc}) - \text{sim}(\mathbf{z}_q, \mathbf{z})\right|\]

No fixed vector $\mathbf{z}$ can simultaneously optimize relevance for all possible queries $\mathcal{Q}$. Multi-vector encoders (ColBERT-style) trade storage efficiency for expressiveness; a principled solution that is both storage-efficient and query-optimal does not yet exist.

Grounded Meaning

Distributional encoders learn the statistical regularities of token co-occurrences; they have no grounding in perception or action. The word “red” has a vector near “color” and “scarlet”, but the encoder has no structural relationship to the electromagnetic wavelength $\lambda \approx 700\,\text{nm}$ or the physical sensation of redness. The grounding gap can be formalized as:

\[\text{enc}(\text{"red"}) \not\mapsto \text{any perceptual feature}, \quad \mathbb{E}\!\left[\cos(f(\text{"red"}), f(\text{"wavelength 700nm"}))\right] \approx 0\]

Multimodal encoders (CLIP) partially address this by aligning text with images, but the result is still statistical regularities in co-occurrence - not causal grounding in physics.

Efficient Long-Sequence Attention

Standard self-attention scales as $\mathcal{O}(L^2 d)$ in time and memory, where $L$ is the sequence length. For high-resolution images or long documents, this is prohibitive. Sparse attention, linear attention, and state-space models attempt to reduce this:

\[\text{Complexity: } \mathcal{O}(L^2 d) \xrightarrow{\text{goal}} \mathcal{O}(L d \log L) \text{ or } \mathcal{O}(L d)\]

But no sub-quadratic attention mechanism has yet matched the quality of full self-attention on general encoding tasks. The trade-off between approximation quality and computational cost remains the central bottleneck for encoding long-context inputs.


Fun Projects for Your Portfolio

Each project below is scoped to be doable on a single GPU in days to weeks. For each one the key metric $m$ to optimize and report is given as an equation; the Portfolio signal line states what the project demonstrates to a hiring engineer. A good target for any retrieval project is $\text{R@1} \ge 0.5$ on a held-out test set.

Build word2vec from Scratch

Implement the skip-gram model with negative sampling from scratch in PyTorch, train it on a medium-sized corpus (e.g., Wikipedia’s Simple English subset), and evaluate on the standard word analogy benchmark. Visualize the embedding space with t-SNE and verify that arithmetic structure (king - man + woman) holds in your trained space.

\[\text{analogy accuracy} = \frac{\left|\left\{(a,b,c,d): \arg\max_{w \ne a,b,c}\, \mathbf{v}_b - \mathbf{v}_a + \mathbf{v}_c = \mathbf{v}_d\right\}\right|}{|\text{test pairs}|}\]

Portfolio signal: Demonstrates that you understand self-supervised training objectives, distributed word representations, and the relationship between co-occurrence statistics and geometric structure.

CLIP Image-Text Retrieval System

Fine-tune or zero-shot evaluate OpenCLIP on a domain-specific image-text dataset (e.g., fashion, medical imaging, or satellite imagery). Build a retrieval interface that supports both image-to-text and text-to-image queries. Report Recall@k for $k \in {1, 5, 10}$:

\[\text{R@}k = \frac{1}{|Q|}\sum_{q \in Q} \mathbf{1}\!\left[\text{rank}(q) \le k\right]\]

Portfolio signal: End-to-end understanding of dual-encoder training, contrastive objectives, and embedding-space retrieval at scale.

Domain-Specific Sentence Encoder Fine-Tuning

Take a general-purpose sentence encoder (e.g., all-MiniLM-L6-v2) and fine-tune it on a niche domain with limited labeled data (e.g., legal contract clauses, biomedical abstracts). Measure the improvement on domain-specific semantic similarity benchmarks:

\[\Delta\text{MAP} = \text{MAP}_\text{fine-tuned} - \text{MAP}_\text{base}\]

Portfolio signal: Shows that you can adapt a pre-trained encoder to a new domain without destroying its general-purpose structure, which is the central skill in applied retrieval engineering.

ViT Attention Rollout Visualization

Implement attention rollout (Abnar and Zuidema, ACL 2020) for a pre-trained ViT: recursively multiply attention matrices across layers and add the identity at each step to account for residual connections:

\[A_\text{rollout}^{(L)} = \left(\hat{A}^{(1)} + I\right)\left(\hat{A}^{(2)} + I\right) \cdots \left(\hat{A}^{(L)} + I\right), \quad \hat{A}^{(l)} = \frac{A^{(l)}}{\sum_j A^{(l)}_{ij}}\]

Visualize the attention from the [CLS] token to all patch tokens overlaid on the original image. Compare a supervised ViT-B/16, a DINO ViT-B/16, and a DINOv2 ViT-B/14 on the same images.

Portfolio signal: Demonstrates interpretability skills and understanding of how self-supervised vs. supervised training shapes the attention geometry inside transformers.

Frozen vs. Fine-Tuned Encoder Comparison

Take a DINOv2 or CLIP vision encoder and evaluate it across three settings on five different image classification datasets: (a) frozen encoder with linear probe, (b) frozen encoder with k-NN, (c) full fine-tune. Report accuracy under each regime:

\[\text{Linear probing accuracy} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\!\left[\mathbf{w}_{y_i}^\top \mathbf{z}_i > \mathbf{w}_c^\top \mathbf{z}_i\;\forall c \ne y_i\right]\]

Portfolio signal: Shows that you understand the difference between representation quality (linear probing) and task-specific adaptation (fine-tuning), and when each is the right choice.

MAE Pre-Training on a Small Dataset

Implement MAE from scratch and pre-train a ViT-Small on a domain-specific image dataset with fewer than 100k images (e.g., plant disease images, satellite patches). Evaluate by fine-tuning with varying fractions of labeled data and compare to training from scratch:

\[\mathcal{L}_\text{MAE} = \frac{1}{|\mathcal{M}|}\sum_{p \in \mathcal{M}}\left\|\mathbf{x}_p - \hat{\mathbf{x}}_p\right\|_2^2\]

Portfolio signal: Demonstrates that you can apply self-supervised pre-training in the low-data regime, which is the realistic setting for most real-world vision applications outside ImageNet-scale domains.

Cross-Modal Search Engine

Build a search engine that accepts either an image or a text query and retrieves results from a multi-modal corpus (images with captions). Index a dataset of 50,000+ image-caption pairs with FAISS using CLIP embeddings. Evaluate retrieval quality with mean Average Precision at 10:

\[\text{mAP@10} = \frac{1}{|Q|}\sum_{q \in Q}\frac{1}{10}\sum_{k=1}^{10} \text{prec}(k) \cdot \text{rel}(k)\]

Portfolio signal: End-to-end applied ML engineering: embedding extraction, approximate nearest neighbor indexing, and evaluation methodology for information retrieval systems.


This survey traced the encoder lineage from Lowe’s 2004 SIFT histogram through word2vec (NeurIPS 2013), ResNet (CVPR 2016), BERT (NAACL 2019), ViT (ICLR 2021), CLIP (ICML 2021), DINO (ICCV 2021), MAE (CVPR 2022), DINOv2 (TMLR 2024), and SigLIP-2 (arXiv 2025). The frontier moves at NeurIPS, ICLR, ICML, CVPR, and ACL; for live signals, watch the MTEB leaderboard for text encoders and the ImageNet linear-probing leaderboard for vision backbones.