Layer by Layer: How Words Collapse Into a Sentence Vector

Looking for TL;DR? Skip to key takeaways

Imagine each word in a sentence as an arrow pointing in some direction in a high-dimensional space. At layer 0 of a transformer, the arrows are scattered — "the", "stock", "market", "surged" all point different ways. By the final layer of mxbai-embed-large-v1, all four arrows nearly overlap. Mean cosine similarity between tokens in the same sentence goes from 0.055 to 0.880 across 24 layers.

That convergence is one of the clearest things you can observe inside a sentence embedding model. But here's what surprised me: almost none of it matters for retrieval until the very last five layers. And when you run the same analysis on a ColBERT model, the geometry looks completely opposite — because the training objective demanded it.

I probed mxbai-embed-large-v1 (contrastive, mean-pool) and jina-colbert-v2 (ColBERT / MaxSim) across all 25 depths — embedding layer plus 24 transformer layers. Two findings: the training objective leaves a geometric fingerprint at every layer, not just the output. And almost all useful structure is built in the final five.

Tokens Collapse — But Slowly, Then All at Once

The blue line below is how similar tokens become to each other within a sentence as you go deeper. The orange line is how similar random, unrelated words are — a baseline for "these shouldn't be similar at all."

A few things to read off this chart:

The gap between the two lines (0.880 − 0.425 = 0.455) is what matters, not the absolute values. Both lines rise — the model is making all tokens more similar globally, which is a known problem with deep transformers. What the model actually encodes is the extra similarity that tokens within the same sentence have over random pairs. That gap is the sentence-level signal.

The trajectory is not smooth. Intra-sentence similarity rises through L3, dips back to 0.209 at L6, recovers, plateaus in L10–L14, dips again around L15, then rockets upward in the final five layers. I'll come back to the dip at the end.

The animation is smooth because transformers use residual connections — each layer adds a delta to the previous layer's representation rather than replacing it. This means token positions shift incrementally in the fixed PCA space. If you re-fit PCA fresh at each layer instead, you'd see spurious ~10× larger apparent jumps from rotational ambiguity in the new axes. The chart below makes this concrete:

The red line is not telling you the tokens moved 12 units — it's telling you the PCA axes rotated. Fixed L0 axes reveal the true geometry: token positions shift smoothly in ~0.3–1.5 unit steps, with the biggest single jump at the final layer (L23→L24 = 0.90 units). Apparent chaos hiding calm underlying movement.

What the Model Was Building Toward

Fixed L0 PCA shows where tokens started. But there's a third perspective: project every layer into the L24 PCA space — the axes that capture variance in the final output. This is the teleological view: at each layer, how much of the final semantic structure has already been assembled?

The faint markers are the L24 target positions. The bright markers are where tokens actually are at the current layer. At L0, tokens are far from their targets in the final semantic space — the model has to construct the entire structure through 24 layers of residual additions.

How far do representations travel? The cosine between a sentence's mean-pool at L0 and at L24 is approximately 0.00–0.05 for every sentence, hard or easy alike. In 1024 dimensions, the L0 representation and L24 representation are nearly orthogonal. The model doesn't move along the original direction — it builds in a completely new subspace.

Inside the 1024 Dimensions

The layer-level story is now clear. But we can go one level deeper: instead of asking "which layers matter", ask "which of the 1024 dimensions are doing what, and when?"

When does each dimension receive its biggest contribution? For each of the 1024 dims, find the layer transition that writes the most to it (measured as mean |Δ| across 20 sentences):

749 dimensions receive their single largest write at L24. Another 164 at L23. Combined: 91.7% of all dimensions are written most in the final two layers. The late-layer explosion isn't a few semantic dimensions firing up — the model is simultaneously writing across the entire 1024D space.

How much of the L0 signal survives to L24? For each dimension, compute the correlation of its value at L0 vs L24 across 20 sentences:

Mean correlation = 0.026, centered almost exactly at zero. Only 3 dims out of 1024 have r > 0.7. Zero dims have r < −0.7. The model doesn't invert anything — it overwrites almost everything. The final embedding is built in directions that are orthogonal to the input, not a transformation of it.

Which dimensions actually separate "surged" from "plunged"? The top discriminative dims at L24 all follow the same pattern — flat through layer 20, then steep climb at L22–L23:

All five discriminative dims reach half their L24 value at L22 or L23. Through L20 they're flat — the model hasn't yet written "surged vs plunged" into any specific dimension. Then two layers fire and the discrimination is fully established. This is not gradual learning spread across the network; it's a narrow window of 2–3 layers doing all the work.

mxbai uses Matryoshka Representation Learning, so the first 64 dims must be meaningful alone, first 128 dims, etc. This shows up in the discrimination profile:

The 0–64 dim slice is the least discriminative at L24 (cosine 0.895) — by design. It encodes "financial market / rate change" which both sentences share. Adding dimensions makes the pair more separable, with dims 256–512 being the most discriminative (0.804). The Matryoshka compression hierarchy intentionally groups same-domain sentences in the smallest slice.

Note that at L16, the 0–64 slice reaches cosine 0.999 — the two sentences are completely identical in the compressed Matryoshka space. Any system using only 64-dim embeddings would rank "surged" and "plunged" as the same sentence at mid-network. The separation only appears in the final layers and only in the larger slices.

The Real Work Happens in the Last Five Layers

Here's the most striking chart. It shows how well-separated sentence meanings are from each other at each layer — measured as the ratio of between-sentence spread to within-sentence token scatter. A value of 1.0 means sentences are as spread out as their individual tokens. A value of 3.6 means sentence-level neighborhoods are 3.6× more defined than token-level noise.

Twenty layers of computation — and sentences are barely more separable than at initialization. Then L20–L24 do almost everything: 0.22 → 0.32 → 0.41 → 0.83 → 3.66.

The same pattern shows up directly in retrieval quality:

Easy sentence pairs (different topics) saturate Recall@3 by layer 3 and stay there. Hard pairs — negations ("stock market surged" vs "plunged"), role reversals ("tech giant acquired rival" vs "rival acquired tech giant") — sit flat at 0.20–0.30 through most of the network, then jump to 0.80 at L24.

Two failure modes, two different causes:

Negation ("surged" vs "plunged"): individual word tokens are clearly different at L0 (cosine 0.535 between "surged" and "plunged" alone). The problem is mean pooling — 9 shared tokens dominate the sentence average, bringing full-sentence cosine to 0.899 at L0. Middle layers make it worse (peaking at 0.991 at L13), then the final layers pull it down to 0.843. Still very high, but discrimination is at least possible.

Role reversal ("tech giant acquired rival" vs "rival acquired tech giant"): starts at 0.988 and barely moves — ending at 0.980 after 24 layers. The model gains almost nothing. Mean pooling is order-insensitive, so swapping subject and object produces nearly the same average. The final layers apply the same contrastive gradient to both sentences and can barely separate them because they share all the same content words pointing in the same directions.

Why does discrimination concentrate so late? Not because the final layers apply larger updates — token norms are stable throughout. What changes is coherence: in the final layers, all tokens in a sentence get pushed in the same direction simultaneously. It's not that each token learns more; it's that the whole sentence's token cloud moves as a unit, driven by the contrastive gradient that's strongest at the output layer.

Why Mean Pooling Beats CLS

The SBERT paper showed empirically that mean pooling outperforms the CLS token for sentence similarity. The mechanism wasn't explained — it was attributed loosely to CLS being task-specific during pretraining. The layer trajectory gives the actual answer.

CLS starts at L0 partially misaligned with the average of all tokens (cosine = 0.242). It gets worse through the middle layers — dropping to 0.465 at L6 — before recovering. Only at L24 does it converge to 0.958.

Mean pooling, by definition, always is the average of all token positions. CLS is a poor approximation of that average for most of the network's depth and only catches up at the very end. Any system that uses CLS at an intermediate layer — probing studies, early-exit models, multi-exit architectures — pays this alignment penalty through most of the depth.

Note that the first principal component of the within-sentence token variation is almost orthogonal to the sentence average (cosine < 0.18 at all layers). The biggest axis of variation among a sentence's tokens is not the semantic direction — it's syntactic roles, positions, function vs content words. Mean pooling extracts the centroid and discards all of that noise. That's why it works.

Two Training Styles, Two Completely Different Geometries

Everything so far is for mxbai, trained with a contrastive loss that compares sentence averages. Now look at jina-colbert-v2, which uses ColBERT's MaxSim loss instead.

In ColBERT, there is no sentence average. Retrieval works by matching each query word to the best-matching document word individually. "Cardiac arrest" retrieves "heart attack" because the word "cardiac" in the query finds the word "cardiac" in an expanded document representation — not because the sentence averages are close.

That difference in how retrieval works produces a completely different geometry inside the model:

mxbai ends at 0.880 — all tokens nearly identical. ColBERT ends at 0.286 — tokens still spread out, each word pointing somewhere different. At the same time, ColBERT's sentence-level separability ratio at L24 is just 0.254 vs mxbai's 3.655. Sentence averages in ColBERT are not discriminative — because they never need to be.

The reason is direct. A mean-pool contrastive loss trains the model to make sentence averages of matching pairs similar and non-matching pairs different. Every token gets pushed in the same direction, because the average is just the sum of all tokens divided by n. Token collapse is not a side effect — it is the geometry the loss function is building.

MaxSim trains each query word to find its best match in the document independently. Different words get pulled toward different directions. The average of those directions is never used, so it's never made discriminative, and tokens stay spread out.

Note that ColBERT's token similarity actually peaks at 0.692 around L21 before the final layer drops it back to 0.286. The XLMRoberta backbone produces natural token convergence in middle layers — the ColBERT fine-tuning specifically reverses this in the final layers, preserving the diversity that MaxSim needs.

The training objective is not just present in the final embedding. It's readable in the geometry at every single layer.

An Open Question: The Non-Monotonic Dip

The within-sentence similarity curve for mxbai is not smooth. It rises to 0.361 by L3, drops to 0.209 at L6, recovers, then explodes in L20–L24.

The most plausible explanation: middle layers first make tokens more different from each other before late layers collapse them. To build syntactic structure — subject, object, modifier — the model needs tokens to carry distinct role signals. Differentiating them first sharpens that encoding. Then the contrastive objective in the final layers collapses the now-structured cloud toward a shared semantic direction.

This matches what probing studies show: syntactic features peak in L6–L10, semantic features in L16–L24. But I haven't run the experiment to confirm it directly — separating content words from function words, or ablating attention vs feed-forward contributions at the dip layers. For now it's a hypothesis, not a result.

Key Takeaways

The training objective leaves a geometric fingerprint at every layer, not just the output. Dense contrastive and ColBERT MaxSim produce completely opposite internal geometries — readable from the embedding layer through L24.
Almost all useful sentence structure is built in the final 4–5 layers. Sentence separability is near-flat through L19, then jumps from 0.18 to 3.66 in the final five. Hard-set Recall@1 matches: flat at 0.30 for 11 layers, 0.80 at L24.
Mean pooling beats CLS because CLS only tracks the sentence average at the final layer. At L6 it's only 0.465 aligned with mean_pool. Mean pooling is always exactly the average. The SBERT paper showed the result; the layer geometry explains the mechanism.
Dense training collapses tokens; ColBERT training keeps them spread. mxbai intra-sentence cosine: 0.055 → 0.880. ColBERT: 0.068 → 0.286. Because dense retrieval matches sentence averages, the loss pushes every token in the same direction. Because ColBERT matches individual words, tokens stay diverse.