The Similarity Attractor: How Transformers Process Hard Negatives

This is a companion to Layer by Layer: How Words Collapse Into a Sentence Vector. That post covers token collapse geometry. This one covers what happens when you compare two sentences across layers.

"The stock market surged after the central bank cut interest rates" and "The stock market plunged after the central bank raised interest rates" start at cosine similarity 0.899 at layer 0. By layer 13, they've reached 0.991 — more similar than they started. Only in the final layers does mxbai-embed-large-v1 pull them apart, landing at 0.843.

This isn't a quirk of these sentences. The same pattern appears across sentence pairs of every difficulty level: all of them converge toward high similarity around layer 12, regardless of how semantically different they are, before the final layers do the actual discrimination. I'm calling this the similarity attractor.

Watching It Happen

The animation below shows all four sentences' tokens moving through PCA space as layers progress (PCA fitted once on L0, axes fixed — same technique as the companion post). Circles are sentence 1 of each pair, squares are sentence 2; diamonds are the mean-pool centroids.

A few things to notice: at L0, the hard pair (red) tokens are tightly clustered — they share 9 of 11 tokens, so the two sentences start in nearly the same region. The easy pair (indigo) starts far away in a completely different part of the space. By L12, both pairs have converged internally — the two red centroids nearly overlap, and the two indigo centroids nearly overlap. In the final layers, the hard pair barely separates while the easy pair's centroids drift apart as the model builds sentence-level meaning.

Every Pair Peaks at Layer 12

Below are three sentence pairs, measured at every layer using mean-pooled cosine similarity:

Hard: "surged/cut" vs "plunged/raised" — same domain, opposite meaning, 9 shared tokens out of 11
Medium: a fox-jumps sentence vs the stock market sentence — completely different topics but English prose
Easy: the fox sentence vs a RAG/LLM sentence — different domain, different vocabulary

All three pairs peak in the L12–L13 window. The gap between hard and easy pairs narrows dramatically at the peak: 0.77 at L0, but only 0.23 at L12. The middle of the network briefly makes hard and easy negatives nearly indistinguishable.

The final layers recover separation. But they don't recover it fully — the hard pair lands at 0.843, still much closer to 1.0 than the easy pair's 0.371.

Why the Attractor Exists

Middle transformer layers encode surface structure — shared vocabulary, syntactic patterns, domain signals. The hard pair shares 9 of 11 tokens ("the", "stock", "market", "after", "the", "central", "bank", "interest", "rates"). The middle network correctly identifies these sentences as being from the same domain and registers that similarity strongly.

The final layers (L20–L24) are where contrastive fine-tuning has the most leverage. The retrieval loss is applied at the output, so the gradient that separates hard negatives propagates most strongly into the last few layers. This matches what we saw in the companion post: sentence separability is near-flat through L19, then jumps 20× in the final five layers.

The attractor is not a failure — it's the network correctly encoding what these sentences have in common before the final layers decode what makes them different.

Which Tokens Are Actually Discriminative

Not all tokens contribute equally. The hard pair has 9 identical tokens ("the", "stock", "market"...) and two that differ: "surged" vs "plunged", "cut" vs "raised". The heatmap below shows per-token MaxSim (each token's best match in the other sentence) across all 25 layers.

At L0, the shared tokens ("the", "stock", "market", etc.) are literally identical between the two sentences — MaxSim = 1.000. The discriminative tokens start very different: "cut" vs "raised" = 0.153, "surged" vs "plunged" = 0.312.

By L12 (the attractor peak), "cut" has risen to 0.693 and "surged" to 0.814 — the model has pulled them toward the shared semantic neighborhood. By L24, they recover slightly: 0.726 and 0.739. Meanwhile, the shared tokens drop from 1.000 to ~0.85–0.89 as the model builds sentence-level context that differentiates even the shared words by their role in the sentence.

The result: at L24 the token-level MaxSim span is 0.726–0.890 — the discriminative tokens are still the lowest, but the gap has compressed from 0.85 (at L0) to 0.16. They haven't converged enough to make the sentences easily separable.

The grouped bar chart below makes this stark:

The Failure Mode at L24

The hard pair lands at mean-pool cosine 0.843 at L24. A retrieval system using these embeddings would score "plunged/raised" as highly relevant to a query about "surged/cut" — exactly the failure mode that hard-negative training is designed to fix.

CLS is worse. Because CLS is identical between two sentences from the same underlying model weights at L0 (both [CLS] tokens have the same embedding before context), it starts at 1.000 and only reaches 0.801 at L24. If you're using CLS-pooled embeddings, hard negatives are even harder.

MaxSim (ColBERT-style) doesn't save you here either. The average MaxSim at L24 is 0.832 — almost identical to mean-pool's 0.843. The 9 shared tokens each find a near-identical match in the other sentence, outvoting the 2 discriminative tokens regardless of how far apart "cut" and "raised" have moved.

What This Means for Training

The similarity attractor explains several practical observations about embedding model training:

Hard negatives are necessary, not optional. A model trained with only random negatives (easy pairs) never learns to push apart sentences that share domain and vocabulary but differ in meaning. The attractor ensures these pairs are always rated highly similar by the base model — you can't fix this without explicitly showing the model that "surged/cut" and "plunged/raised" should be different.

Layer-wise probing understates difficulty. If you probe intermediate layers to measure how well a model separates sentences, you'll find the model looks worse at L12 than at L0 for hard negatives. This isn't the model getting confused — it's the attractor doing its job before the final layers decode meaning.

The final 4–5 layers are doing the hard-negative work. The hard pair only separates from 0.991 to 0.843 in those final layers. If you early-exit, distill from an intermediate layer, or prune final layers, you're specifically degrading hard-negative discrimination.

The attractor is a feature of the backbone, not the fine-tuning. XLM-RoBERTa (the base model for both mxbai and jina-colbert-v2) builds these shared surface representations in middle layers before any retrieval training. The fine-tuning just has to work with — and against — that structure in the final few layers.