Published on

Tokenization is a Compression Codec Nobody Uses That Way

Authors

Tokenization as Compression Cover

Looking for TL;DR? Check key takeaways

Every vector database has two layers: the vectors (heavily optimized — quantized, compressed, carefully sharded) and the raw text payload (stored verbatim as UTF-8). Nobody talks about the second one.

I ran some benchmarks and found that the payload layer is a 5x compression opportunity sitting in plain sight. The tool to do it — BPE tokenization — is already in every ML stack. It just needs to be pointed at the right problem.

The Forgotten Layer

When you insert a document into Qdrant, Weaviate, or Elasticsearch, you typically store something like:

{
  "id": "doc_123",
  "vector": [0.12, -0.34, ...],
  "payload": {
    "text": "Retrieval-Augmented Generation is a technique..."
  }
}

The vector gets quantized, compressed, and indexed carefully. The text field gets written as-is.

Qdrant stores payloads as JSON on RocksDB — no compression by default. Elasticsearch applies LZ4 on stored fields, which is better but still generic byte compression. Neither system knows anything about the structure of natural language text.

That's the gap.

Tokenization is Already Compression

BPE (Byte Pair Encoding) — the tokenizer behind GPT-2, GPT-4, and most modern LLMs — merges frequently co-occurring byte sequences into single tokens. "retrieval" becomes one token. "embeddings" becomes one token. Common phrases get compressed into fewer symbols.

The compression angle is underappreciated. Consider English text:

  • Average word length: ~5 characters = 5 bytes (ASCII)
  • Add a space: 6 bytes per word
  • Average BPE token spans: ~1.3 words
  • Token ID stored as uint16: 2 bytes
6 bytes/word × 1 word/token × 1 token = 6 bytes/token (raw)
2 bytes/token (uint16)
ratio: 3x

That's before applying any further compression on the token IDs themselves.

The catch: this only works if the tokenizer's vocabulary fits in uint16 (max 65,535). GPT-4's cl100k tokenizer has 100,277 tokens — that needs uint32 (4 bytes), which kills the advantage. GPT-2's r50k tokenizer has 50,257 tokens — fits cleanly in uint16.

Tokenizer      Vocab size    ID type    Bytes/token
─────────────────────────────────────────────────
cl100k (GPT-4) 100,277       uint32     4 bytes  ← no win
r50k (GPT-2)    50,257       uint16     2 bytes  ← 3x win
mxbai WP        30,522       uint16     2 bytes  ← 3x win

The Benchmark

I took a ~270-word paragraph about RAG and vector databases (representative of real payload content) and measured every combination I could think of.

import tiktoken, struct, gzip, zstandard, constriction

enc = tiktoken.get_encoding("r50k_base")
token_ids = enc.encode(text)
packed = struct.pack(f">{len(token_ids)}H", *token_ids)  # uint16 big-endian

Here's what I found:

MethodSizeRatioCER
Raw UTF-81,853 bytes1.0x0%
gzip -9984 bytes1.9x0%
zstd -22981 bytes1.9x0%
r50k BPE uint16 (raw)784 bytes2.4x0%
r50k BPE + zstd709 bytes2.6x0%
r50k BPE + ANS364 bytes5.1x0%
mxbai WordPiece + ANS328 bytes5.7x20.1%*
Entropy lower bound323–358 bytes~5.5x

*mxbai CER drops to 0% after simple post-processing — more on this below.

The headline: raw BPE uint16 already beats gzip on raw text, with zero further compression. That's just tokenization and packing.

Latency (2,000 runs, 1,853-byte doc)

Compression ratio is only half the story. Here's every component timed separately:

OperationMedianp99
gzip -9 compress29µs36µs
gzip decompress18µs25µs
zstd -22 compress182µs267µs
zstd decompress4µs5µs
lz4 compress24µs29µs
lz4 decompress1µs1µs
OperationMedianp99
r50k tokenize132µs208µs
r50k detokenize6µs10µs
r50k pack uint164µs5µs
r50k + zstd compress48µs59µs
r50k + zstd decompress3µs4µs
r50k ANS encode30µs42µs
r50k ANS decode32µs41µs

Full pipelines:

PipelineMedianp99
r50k encode (tokenize + ANS)156µs287µs
r50k decode (ANS + detokenize)41µs53µs
mxbai encode (tokenize + ANS)750µs1,270µs
mxbai decode (ANS + detokenize + fix)420µs603µs

A few things stand out:

  • The tokenizer dominates encode latency, not ANS. r50k tokenization is 132µs; ANS on top adds only 30µs. If you already tokenize documents at ingest time (which most embedding pipelines do), the extra encode cost is just those 30µs.
  • Reads are fast. r50k decode — ANS + detokenize — takes 41µs total. Fast enough for any real-time serving path.
  • zstd -22 is slow to compress (182µs) because --ultra levels trade CPU heavily for ratio. At a lower level (say zstd -3), it would be ~15µs but give up some compression.
  • ANS is symmetric — encode and decode are nearly identical (~31µs each), unlike zstd where decompress (4µs) is 45× faster than compress.

What is ANS and Why Does It Matter

zstd operates on bytes. It doesn't know that [0x01, 0x7F] is token ID 383 — it just sees two opaque bytes.

ANS (Asymmetric Numeral Systems) — the entropy coder inside zstd's compression stage, also used in LZMA and FLAC — can be applied directly on token IDs. It knows that token 198 (" the") appears 40× more than token 47291 ("embeddings"), so it assigns shorter codes to frequent tokens.

The result: ANS on token IDs gets within 6 bytes of the Shannon entropy lower bound — the theoretical minimum for lossless compression.

import constriction, numpy as np
from collections import Counter

def ans_encode(ids):
    counts = Counter(ids)
    vocab  = sorted(counts.keys())
    v2i    = {v: i for i, v in enumerate(vocab)}
    freqs  = np.array([counts[v] for v in vocab], dtype=np.float64)
    freqs /= freqs.sum()

    model   = constriction.stream.model.Categorical(freqs, perfect=False)
    encoder = constriction.stream.stack.AnsCoder()
    encoder.encode_reverse(np.array([v2i[t] for t in ids], dtype=np.int32), model)
    return encoder.get_compressed().tobytes()

The full pipeline:

text (UTF-8)
BPE tokenizer (r50k)
    │  50,257-token vocab, fits uint16
[token_id_1, token_id_2, ...]  ← 2 bytes each
ANS encoder (frequency model over token IDs)
    │  assigns short codes to frequent tokens
compressed bytes  ←  5x smaller than original text
    ▼ (decode path)
ANS decoder → token IDs → tokenizer.decode → original text

The mxbai Story: Better Compression, Surprising Catch

The mixedbread embedding model (mxbai-embed-large-v1) uses BERT's WordPiece tokenizer with 30,522 tokens. Smaller vocabulary → more common tokens → lower entropy → better compression.

mxbai + ANS reaches 328 bytes vs r50k + ANS at 364 bytes — a 10% compression win. But when I measured latency and recovery, the story got more complicated.

TokenizerCERCER (ignore case+whitespace)CER (+ fix punct)
r50k BPE0.0%0.0%0.0%
mxbai WordPiece20.1%6.4%0.0%

The 20% CER is entirely from two deterministic artifacts of BERT tokenization:

  1. Lowercasing"Qdrant""qdrant", "RAG""rag"
  2. Punctuation spacing"(content)""( content )", "up-to-date""up - to - date"

Neither is information loss at the token level. The information is there — it's just encoded with different whitespace conventions. Three regex rules fix it completely:

def fix_punct(s):
    s = re.sub(r'\( ', '(', s)
    s = re.sub(r' \)', ')', s)
    s = re.sub(r'(\w) - (\w)', r'\1-\2', s)
    return s

So mxbai is effectively lossless too, with a small detokenizer. Whether that's acceptable depends on your use case — if you need byte-perfect round trips without post-processing, r50k is the clean choice.

The latency picture makes mxbai harder to justify:

TokenizerEncodeDecodeCompressed size
r50k + ANS156µs41µs364 bytes
mxbai + ANS750µs420µs328 bytes

mxbai is 5× slower on encode and 10× slower on decode — entirely due to the HuggingFace tokenizer's Python overhead. The 10% smaller output (328 vs 364 bytes) doesn't come close to justifying that. For a storage codec, r50k is the practical choice.

On UNK tokens: WordPiece has a [UNK] token that causes actual information loss. But for standard English text — including code, URLs, numbers, gibberish ASCII — it never fires. It only fires on emojis (🚀 → [UNK]) and some rare Unicode math symbols. So for typical RAG payloads over web or document text, UNK is a non-issue.

Real-World Scale: English Wikipedia

English Wikipedia is ~7M articles, 708 words/article on average — a useful benchmark for RAG-scale corpora.

Extrapolating from the benchmark:

MethodPer doc7M docsvs raw
Raw UTF-84,497 bytes31.5 GB1x
zstd -222,382 bytes16.7 GB1.9x
r50k + zstd1,721 bytes12.1 GB2.6x
r50k + ANS884 bytes6.2 GB5.1x
mxbai + ANS797 bytes5.6 GB5.6x

At 1B docs the storage gap is more dramatic:

Method1B docsS3 ($0.02/GB/mo)Disk ($0.08/GB/mo)
Raw UTF-84.5 TB~$92~$368
r50k + ANS0.88 TB~$18~$72
Saving3.6 TB~$74/mo~$296/mo

Disk savings dominate. Vector databases keep hot payloads on ephemeral SSD for fast reads — that's the 0.08/GBcolumn.At1Bdocsyouresaving0.08/GB column. At 1B docs you're saving **296/month** on disk alone, and the I/O win is the same magnitude: 0.88 TB reads 5× faster than 4.5 TB, which directly reduces query latency when payloads are fetched during retrieval.

S3 savings are smaller in dollar terms but still meaningful at scale. S3 is cheap per GB, so the 74/monthlooksmodest.Whereitcompounds:S3ischargedforbothstorageanddatatransfer.Crossregionreplication,bulkreindexing,andbackup/restoreallmovethefullpayloadcorpusoverthewireat74/month looks modest. Where it compounds: S3 is charged for both storage _and_ data transfer. Cross-region replication, bulk re-indexing, and backup/restore all move the full payload corpus over the wire — at 0.09/GB egress, shipping 3.6 TB fewer data per operation saves ~$324 per transfer. For a system that re-indexes monthly or replicates across regions, that adds up fast.

How This Compares to State of the Art

For context, here's where this sits on the broader compression landscape (benchmarked on enwik9, 1GB Wikipedia):

Methodbits/byteRatio vs rawPractical?
Raw UTF-88.01x
gzip -9~2.63.1xYes
zstd -22~2.04xYes
r50k + ANS (this post)~1.56~5xYes
zpaq~1.45.7xSlow
fx2-cmix (Hutter Prize record, Oct 2024)0.8879xExtremely slow
Chinchilla 70B (DeepMind, 2024)0.66412xNot practical

The gap between r50k + ANS (~5x) and the Hutter Prize record (~9x) is exactly the signal that lives in conditional token probabilities. ANS here only models marginal frequency — how often each token appears globally. The top compressors model P(token | all previous tokens), which is what LLMs do. That's the remaining 2x on the table.

Getting from 5x to 9x requires a language model. Getting from 1x to 5x just requires a tokenizer and 30 lines of Python.

What Needs to Change

Nothing in the current stack prevents this. The pieces exist:

  • tiktoken is already a dependency in any LLM-adjacent stack
  • constriction is a Rust-backed library with a Python API, used in ML compression research
  • Decoding is fast — ANS is symmetric, O(n) in both directions

What would need to change in a vector DB like Qdrant:

  1. Accept encoding: "r50k" as a payload storage option
  2. Pack token IDs as uint16, run ANS on the stream
  3. Decompress transparently on read

The index, vectors, and query pipeline stay identical. This is purely a payload storage optimization.

The latency overhead is acceptable for both paths:

  • Write (encode): 156µs per doc. Ingest pipelines already spend far more than this on embedding generation (~25ms for a 780-token doc on H100). It's a rounding error.
  • Read (decode): 41µs per doc. A typical RAG query returns 10–20 docs — that's 400–800µs of decode overhead on a path that already includes vector search and reranking. Negligible.
Current                          Proposed
───────────────────────────────────────────────────────
INSERT                           INSERT
  payload: {"text": "..."}         payload: {"text": "..."}
       │                                  │
       ▼                                  ▼
  JSON → RocksDB (raw)           tokenize → uint16 IDs → ANS
                                  RocksDB (compressed)

READ                             READ
  fetch raw JSON bytes             fetch compressed bytes
       │                                  │
       ▼                                  ▼
  return as-is                   ANS decode → token IDs → decode
                                  return original text

The entire round-trip is transparent to the user. Same API, 5x smaller payloads.

Key Takeaways

  • Vector databases optimize vectors aggressively but store raw text payloads verbatim — a 5x compression opportunity.
  • BPE tokenization (r50k, vocab=50,257) produces uint16 token IDs — 2 bytes each — that already beat gzip on raw text with zero further compression.
  • Adding ANS entropy coding on token IDs brings you to 5x compression using 30 lines of Python and infrastructure already in every ML stack.
  • r50k is perfectly lossless (0% CER). mxbai WordPiece gives 5.7x but lowercases text — fixable with simple post-processing, but also 5–10× slower to tokenize.
  • Latency is not a concern: encode adds 156µs/doc (vs ~25ms for embedding generation); decode adds 41µs/doc (negligible on a query path that already does vector search + reranking).
  • The tokenizer dominates encode latency (132µs), not ANS (30µs). If you tokenize at ingest anyway, the marginal encode cost is just those 30µs.
  • The gap between this (~5x) and the Hutter Prize record (~9x) is conditional token probability — i.e., you need a language model. Getting to 5x doesn't.
  • At English Wikipedia scale (7M docs), this takes payload storage from 31.5 GB to 6.2 GB. The bigger win is I/O: 5x faster bulk reads, indexing, and replication.

Acknowledgements

The compression benchmarks were run using tiktoken, zstandard, lz4, and constriction. The Hutter Prize numbers are from Matt Mahoney's Large Text Compression Benchmark.