Published on

Understanding Random Embeddings

Looking for TL;DR? Skip to key takeaways. And feel free to DM me on X or LinkedIn if I got something wrong :)

I've always heard that random vectors are useless for search, but I wanted to actually see it. So I generated 8,000 random 1024-dimensional unit vectors and asked the simplest question I could: for any given vector, which one is its nearest neighbor, and how near is it?

The closest of all 8,000 sits at cosine 0.12. A random pair sits at 0.00. So the "nearest neighbor" is basically the least-far of 8,000 roughly-equidistant points — there's nothing special about it.

Now the same test with real embeddings (mxbai-embed-large-v1 over 8,000 short text strings, same count, same 1024 dims): the nearest neighbor sits at 0.73, the median pair at 0.39. Look at that for a second — a real embedding's typical pair (0.39) is already closer than random's best neighbor of 8,000 (0.12). The whole distribution is shifted. That 0.12-vs-0.73 gap is what this whole post is about — why random vectors don't have it, what "structure" is as the cure, and what that structure actually buys you (mostly compute, it turns out, not recall).

🎲 Random vectors don't have neighbors

In high dimensions, random points concentrate: nearly all pairwise distances pile up around the same value. This is the concentration of distances, and it's been known to break nearest-neighbor search since Beyer et al. (1999) asked "when is nearest neighbor meaningful?". For random unit vectors the answer is: essentially never. Two random directions in 1024-d are almost always orthogonal (cosine ≈ 0), and even the closest of 8,000 only gets nudged to 0.12.

Here's the contrast directly — the cosine to a vector's nearest neighbor, its 10th neighbor, and a typical (median) pair:

For real embeddings the bars fall off a cliff: the nearest is much closer than typical, and there's a clear ordering among the top neighbors. For random vectors the bars are flat and low — the nearest neighbor (0.12) is barely distinguishable from a coin-flip pair (0.00). There is no neighborhood to find.

Why is the random "nearest neighbor" so far away? Because in high dimensions, almost every pair of vectors is perpendicular.

📐 In 1024 dimensions, everything is perpendicular

Here's the part I found genuinely unintuitive. Pick two random directions in d dimensions and measure the cosine between them. In 2D or 3D they're often fairly aligned. But as d grows, the cosine concentrates around 0, so two random unit vectors are almost always nearly orthogonal. Drag the dimension below and hit resample to watch the distribution of pairwise cosines collapse onto 0:

dimension d =
-1-0.500.51cosine similarity between two random unit vectors →
At d = 3, two random unit vectors have cosine 0 ± 0.577 (theory: 1/√d = 0.577). 10% of pairs land within ±0.1 of orthogonal.

The width of that distribution shrinks as exactly 1/√d. The intuition: a unit vector spreads its length across d coordinates, so each is about 1/√d in size; the dot product of two independent unit vectors sums d such terms with random signs, landing near 0 with standard deviation 1/√d. Measured against theory, it's a near-perfect match:

This is what wrecks search. In 2D, only 6.5% of random pairs are near-perpendicular (within ±0.1 cosine of orthogonal) — you can picture that. By 1024-d it's 99.9%:

dimensionrandom pairs within ±0.1 of orthogonal
26.5%
820%
6457%
25689%
102499.9%

When everything is perpendicular to everything else, "nearest" stops meaning much: no direction is meaningfully closer than any other. That's the geometry real embeddings escape — the rest of the post is how.

🔍 What that costs a search index

So what does this cost an actual index? HNSW finds nearest neighbors by greedily walking a graph: from some entry node, hop to whichever neighbor is closer to the query, repeat. That only works if there's a distance gradient to follow, i.e. each step has to get you meaningfully closer. Random vectors have no gradient (every direction is ~equidistant), so the walk wanders and the index has to explore far more candidates to compensate.

I built HNSW indexes (M=16, efConstruction=200) over both datasets and swept the search-effort knob efSearch, measuring recall@10 against exact brute force plus query latency:

Real embeddings hit 0.90 recall by efSearch=32, at 0.016ms per query, and stay pinned there. Random never gets there: even at efSearch=256 (8× the effort, 5× the latency) it tops out at 0.855. Note that this is the same index, same 1024 dims, same 8,000 vectors. The only thing that changed is whether the vectors have structure, and you can't efSearch your way out of geometry.

🧱 So what is structure, then?

"Structure" is a vague word, so let me pin it to something I can measure. I started from pure random vectors and injected structure three ways, then watched what each did to search. Each is a knob from "random" toward "real":

  1. Low intrinsic dimension — generate the vectors inside an r-dimensional subspace, then embed that into 1024-d. Sweep r from 2 to 1024.
  2. Spectral decay — keep all 1024 dimensions but scale them by a decaying spectrum (i^(-α/2)), so a few directions dominate. Sweep α. This is the anisotropy real embeddings are famous for.
  3. Clusters — draw the vectors from a mixture of 64 centers and sweep how tight the clusters are.

To put all three knobs on a common axis I use intrinsic dimension — roughly, how many numbers you'd actually need to pin down a point, regardless of how many it's stored with. A sheet of paper crumpled inside a 3D room is still intrinsically 2D; random vectors are the opposite, using up nearly every dimension they're given. I estimate it with TwoNN, a standard intrinsic-dimension estimator — and the contrast is the key number here: random comes out ~230, real mxbai ~14. Plotting query latency against intrinsic dimension (each point is one knob setting; hover to see which; lower-left is cheaper) separates two distinct mechanisms:

Two mechanisms, and they're not the same:

  • Lowering effective dimensionality (subspace rank or spectral decay) helps, and the gain tracks intrinsic dimension — both curves slide down-and-left together. This is the direct cure for distance concentration: collapse onto a low-dimensional surface and neighbors become meaningful again.
  • Clustering helps through a different door. As clusters tighten, latency drops 4–5× (0.056 → 0.013ms) while the intrinsic dimension stays high (~145). The cluster points fall almost straight down, off the dimensionality trend. Think of the clusters as islands: the greedy walk ferries to the right island cheaply, then does the fine search locally. That's a different win from low dimensionality, and the intrinsic-dimension estimate can't even see it.

Real mxbai (the red star) has both: a low intrinsic dimension (~14) and cluster structure — which is why it sits where it does, far cheaper than random.

There's a second, quieter result hiding in this sweep. Structure mostly buys compute, not recall. Once any of these knobs is turned even slightly past "random," recall snaps to the ~0.90 HNSW ceiling and stays there; the dramatic 4–5× swings are all in latency. The recall gap is really just the one-time jump from random's 0.81 floor to the structured 0.90 ceiling — after that, structure is a compute story, not an accuracy one.

Full numbers (intrinsic dim · recall@10 · latency) for all three sweeps
knobintrinsic dimrecall@10ms/query
random (anchor)229.90.8070.0578
real mxbai (anchor)14.20.9000.0372
intrinsic r=20.30.9000.0055
intrinsic r=86.40.9000.0242
intrinsic r=3225.10.9000.0487
intrinsic r=12864.20.8980.0537
intrinsic r=512118.20.8830.0554
intrinsic r=1024158.50.8710.0556
spectral α=0.5163.70.8770.0546
spectral α=1.065.50.8980.0341
spectral α=1.533.10.9000.0275
spectral α=2.020.40.9000.0218
spectral α=3.012.20.9000.0172
clusters c=0.5154.60.9000.0339
clusters c=2.0156.00.9000.0220
clusters c=8.0145.90.9000.0128

8,000 vectors, 1024-d, 1,500 queries, HNSW M=16/efC=200/efSearch=64.

For search, isotropy is the enemy

Here's the part that cuts against standard advice. Random vectors are the most isotropic data you can have — they fill every dimension equally — and they're the worst to search. Every structured dataset above is less isotropic than random. So the usual NLP framing of "isotropy good, anisotropy bad" is about a different problem (cosine saturation in similarity scores); for nearest-neighbor search you actually want anisotropy, because variance piled into fewer directions is what gives the greedy walk a gradient to follow. (Which single number best measures "structure" — and why the average-cosine anisotropy the field leaned on for years barely detects it — is a rabbit hole I save for the follow-up post.)

🚫 You can't bolt structure on at index time

If structure is what makes search cheap, can you just bolt it on after the embeddings are fixed, at index-build time, without touching the vectors (so recall stays put)? I tried the three obvious levers. All three came up empty, and the empty results are the interesting part.

Insertion order. HNSW is built incrementally, so the order you insert vectors changes the graph. I tried inserting cluster-by-cluster, cluster-medoids-first, and densest-points-first against a random baseline. The graph changes; the cost doesn't.

insert order (real embeddings)recall@10distance comps / query
random0.9001,022
cluster-blocked0.9001,027
medoid-first0.9001,024
hub-first0.9001,024

Smart entry points. HNSW spends hops descending its upper layers to find a good place to start the base-layer search. So I tried skipping that: route the query to its cluster first (compare to 64 centroids) and start the search right inside that cluster. It didn't help — and the reason is the useful part. A single fixed entry point with no hierarchy at all costs the same as the full hierarchy:

entry strategy (real embeddings)recall@10distance comps / query
full HNSW hierarchy0.9001,016
single global fixed entry0.9001,001
cluster-routed entry0.9001,016

At a practical efSearch, the cost is dominated by the base-layer beam — the work of examining the ~efSearch candidates around the query to polish the answer — not by getting to the right neighborhood. A better starting point only saves the cheap descent. (This flips at very large N, where the descent grows ~log N and routing starts to matter — worth revisiting at 100M+ scale, but at 8k it's noise.)

The third lever, M and efConstruction, does move the numbers — but only by trading compute for recall along the usual curve, not by manufacturing a free speedup.

The throughline: none of the index-side tricks make search cheaper for free. The ~3× spread we keep seeing — random at ~1,875 distance comps vs real at ~1,020 vs tight clusters at ~570 — comes from the data's own geometry, which is baked in when the embeddings are produced. An index can trade compute for recall; it can't put structure into vectors that don't have it.

Aside: HNSW indexes are nondeterministic — and it doesn't matter

Rebuild the same HNSW index over the same vectors and you get a different graph every time — not because of a random seed or insertion order, but because the default multi-threaded build has a race in how concurrent neighbor updates land. With the same seed and same order, the saved index differs byte-for-byte on every run; only single-threaded construction (num_threads=1) reproduces it exactly.

The reassuring part: across all those different graphs, recall and latency are stable to within ~0.1%. Different graph, same quality. So the nondeterminism is real but benign — worth knowing if you ever try to diff or cache indexes by hash.

📌 Key takeaways

  • Random vectors have no neighbors. The nearest of 8,000 random 1024-d vectors sits at cosine 0.12 vs 0.00 for a random pair. Real embeddings show a real gap (0.73 vs 0.39), and that gap is what nearest-neighbor search runs on.
  • In 1024 dimensions, everything is perpendicular. The cosine between two random vectors concentrates at 0 with spread 1/√d; by 1024-d, 99.9% of random pairs are within ±0.1 of orthogonal, so "nearest" barely means anything.
  • Random embeddings are expensive to search and cap out lower. Same index, same dims: real hits 0.90 recall at efSearch=32; random never beats 0.855 even at 8× the effort.
  • "Structure" is two separate things. Low effective dimensionality (cures distance concentration; the gain tracks intrinsic dim) and clustering (separable basins; 4–5× cheaper search, independent of dimensionality). Real embeddings have both: intrinsic dim ~14 plus clusters.
  • Structure mostly buys compute, not recall. Past the random→structured jump, recall pins at the HNSW ceiling; the big wins are all in latency.
  • For search, isotropy is the enemy. Random is the most isotropic data and the worst to search. You want variance piled into fewer directions, not spread evenly.

🔭 Closing thoughts

The thing I keep coming back to: structure isn't something the index provides, it's the compute the encoder already did, frozen into the geometry. A trained embedding model spends a forward pass folding a sentence onto a low-dimensional, clustered surface, and the index just cashes that check at query time. Random vectors never had a forward pass, so there's nothing to cash, and no amount of efSearch or clever insertion order can invent it after the fact.

Which raises the obvious next question: if structure is just cached compute, can we manufacture it directly, without training a model? Can you write down a generator that produces vectors as searchable as real ones? I tried, and the short version is: you can match every aggregate statistic of real embeddings and still fail a shape test, until you copy the geometry locally. That's the follow-up post — plus a look at what mxbai's own 24 layers are each doing to the geometry, from the raw word-embedding table to the final sentence vector.

Thanks for reading! Feel free to DM me on X or LinkedIn if you spot a mistake or want to nerd out about this.

  • You can't add structure in the index. Insertion order and smart entry points give nothing at this scale (the base-layer beam dominates, not the descent); M/efConstruction only trade compute for recall. The geometry is set when the embeddings are made — the index can't fake it.