Published on

I Reverse-Engineered Exa.ai Infrastructure Cost with Napkin Math

Authors
Exa Napkin Math Blog Cover Image

Looking for TL;DR (Too Long; Didn’t Read)? Check key takeaways

Exa is a cool web search engine company. They recently raised 85M$ for Series B. There are multiple parts of Exa AI architecture that they have described in their blogs. I've compiled them together and added some of my own notes and lots of assumptions for infra cost napkin math as a fun exercise for myself. Please take my estimates with a grain of salt since my goal is to just get in the right ballpark. Also feel free to DM me on X or Linkedin if you see any wrong or suboptimal assumptions :)

Update: Exa recently did improvements across their pipeline to make search faster and more relevant. This means some of my assumptions might be slightly outdated now, but the fundamentals are still applicable. So enjoy!

📦 Content Storage

Let’s estimate the storage requirements for large-scale text datasets:

  • Fineweb
    • 26B documents occupy 44 TB (compressed), meaning 1B docs ≈ 1.69 TB (compressed)
    • It has 18.5T tokens and used GPT-2 tokenizer. Assuming the standard rule of 1 word = 1.3 tokens, this means 26B docs ~= 14T words.
      • 1 doc ~= 711 tokens ~= 538 words.
      • English words on avg have 5 chars and 1 char = 1 byte, so 538 * 5 bytes ~= 2.7KB / doc
      • 1B docs ~= 2.7TB (estimated uncompressed)
  • English Wikipedia
    • 7M docs occupy 24 GB (compressed), 1B docs ~= 3.42 TB (compressed)
    • It has 708 words / doc on avg. Assuming the standard rule of 1 word = 1.3 tokens, this means 920 tokens / doc on avg.
      • 1 doc ~= 920 tokens ~= 708 words
      • 708 * 6 bytes (reason) = 4248 bytes
      • 1B docs => 4.2TB (estimated uncompressed)

Given these, let’s assume Exa’s crawlers curate, sanitize, and create summaries pages (with LLMs) efficiently enough such that:

  • They are similar to Fineweb but a little more dense.
  • 1B docs require 3TB (uncompressed) and 1.9TB (compressed)
  • On avg 1 page ~= 600 words ~= 780 tokens ~= 3KB (uncompressed) ~= 1.9KB
ResourceRateVolumeCost/Month
S3 storage$0.02 / GB1.9 * 1e3 GB$38
Ephemeral SSD$0.08 / GB1.9 * 1e3 GB$152
Total$190

🔍 Lexical/Keyword search (BM25):

  • They use the standard BM25 algorithm for lexical search
  • Initially, this required about 1.8 TB RAM for 1B docs
  • After optimizations in this post it dropped to 900GB (-50%) RAM for 1B docs

Napkin math

  • Most of storage overhead in BM25 comes from the postings lists. Other data structures have relatively negligible impact.
  • As per their blog, each posting list entry takes:
    • doc_id → 4 bytes (uint32)
    • term_freq → 1 byte (fp8)
    • 5 bytes total
  • This means 1.8 TB / 5 bytes = 360 tokens are unique in a doc on avg.
    • It's 44% of the 780 tokens per doc that we assumed.
    • This percent is kinda close to the standard type-to-token ratio in high quality pages.
  • There would be a tiny overhead of vocab size
    • Even if we assume 10M unique vocab terms across a 1B corpus with each word being of 5 letters (source), it would be just 5 * 10M * 1 byte / char = 50MB
    • And note that most of this comes from unique terms accounting for misspellings, languages, and domain specific terms.
  • Note that Exa mentioned they used uint32 for doc ID that means they must be having fewer than 4.2B docs (2^32) when they wrote this blog in May 2025. They started used delta encoding so it will scale smoothly when they switch to uint64 for holding 10B docs (probably happened in Exa 2.0 launch).
  • I'm assuming that 100% of index is kept in RAM for faster iteration on posting lists (~2-5x faster compared to sequential reads in SSD - source)
ResourceRateVolumeCost/Month for 1B docs
RAM$2 / GB900GB$1800
SSD + S3$0.1 / GB900GB$90
Total$1890
Exa vector search pipeline
  • As per this, they use binary quantization (BQ) and some clever CPU tricks for vector distance calculations (SIMD + precomputing possible outputs and keeping them in CPU registers for skipping calculation).
  • They train Matryoshka embeddings with 4096 dims but only keep first 256 dims BQ vectors for index
  • They use IVF Index (also used in FAISS & pgvector) which an Approximate Nearest Neighbours Algorithm. It basically relies on k-means clustering, and during search time, finds the relevant clusters by comparing distance with respective cluster centroids, and only queries few of those clusters.
  • Then they do 1st re-ranking with 1024 dims BQ vectors and select top candidates to pass to the next stage.
  • Lastly, they do 2nd re-ranking with 2048 dims f16 vectors to compensate for all this approximation/compression.
    • Note that model originally generated 4096d vectors. They probably don't use the full 4096 at least in Exa Fast search

Napkin math

  • I think their model is a fine-tuned version of Qwen3-Embedding-8B since it fits the description well and is very popular.
    • Qwen3-Embedding-8B has a massive context length of 32K tokens. As per this quick research, most internet pages and even research papers would fit in this context length. So chunking isn't required. 1 document = 1 vector.
    • Update: With Exa 2.0 launch, they have pre-trained and finetuned the embedding model and claim to have found new embedding techniques. However, I don't expect the architecture to be wildly different from state-of-the-art OSS models like Qwen.
  • Let's estimate the storage requirements of vectors
    • Each original vector would take 4096 dims * 2 bytes (assuming fp16) = 8KB by default. This means 1B vectors would take 8 * 10^3 bytes * 10^9 = 8 * 10^12 = 8 TB for 1B docs.
    • By truncating matryoshka to 256 dims it becomes, 8 / (4096 / 256) TB = 8 / 16 TB = 512 GB for 1B vectors.
    • With binary quantization, 16 bits (since fp16) becomes 1 bit. So 512 GB / 16 = 32 GB per 1B vectors or 32 bytes (256 bits) per vector
    • Note how MRL truncation + BQ led to 16*16=256x reduction
  • For IVF index with MRL truncated 256d BQ vectors for their "neural search"
    • IVF Index needs centroids generated from k-means clustering algos, If they have 20B vectors, they should have 200k clusters, each having 100K vectors since it becomes very hard to brute-force compare vectors in segments larger than 100k within their latency budget.
    • Each cluster will have a centroid and I'm assuming they compare with uncompressed centroid vectors for maximizing recall.
    • Centroids will require 200k * 4096 * 2 bytes (f16) ~= 1.56 GB
    • Each vector ID needs to be stored in the IVF Index, 20B * 8 bytes = 160 GB. Note that I used 8 bytes because it must be uint64 to store 4.2B+ docs.
    • Each 256d truncated + BQ compressed vector will take 32bytes (256 bits), so 20B * 32 bytes = 640 GB
    • So total requirements are like 1.56 + 160 + 640 GB ~= 800 GB for 20B vectors
      • This is 40GB for 1B vectors
      • Index size is just 40/8000*100 = 0.5% of the full vectors (nice!)
  • For 1st stage re-ranking with 1024 dim MRL truncated BQ vectors:
    • Storage for 1B vectors takes 128GB (1024 / 8 * 1B / 1e9)
  • For 2nd stage re-ranking with 2048 f16 vectors:
    • Storage for 1B docs takes 4000GB (8TB/2). This is okay for disk but storing all of them in RAM would be too costly.
    • However, Pareto principle says 20% of docs will account for 80% queries so let's cache only 25% (1000 GB) in RAM
  • Some of these components must be cached in RAM for fast queries and all the components must be kept on local (ephemeral) disk for paging in/out of memory quickly and handle a node crash. The ultimate source of truth can be S3 which is regularly updated to keep the index, vectors, and content fresh.
ComponentRateVolumeCost/Month for 1B docs
Index RAM$2 / GB40GB$80
Index Ephemeral SSD + S3$0.1 / GB40GB$4
1024d BQ Vectors in RAM$2 / GB128GB$256
1024d BQ Vectors in SSD + S3$0.1 / GB128GB$12.8
2048d Vectors RAM$2 / GB1000GB$2000
2048d Vectors Ephemeral SSD + S3$0.1 / GB4000GB$400
Total$2752

Note how this is 2.8k / 1.9k = ~1.5x of lexical search and 87% of it comes from storing the original vectors in RAM for re-ranking.

🗂️ Metadata storage:

  • For each web page, they also need to store metadata like title, url, publishedDate, author, id, image, favicon, text, summary, etc.
  • Their search API example returns page attributes that can be stored in < 300 bytes. It's a tiny overhead so let's assume 512bytes (0.5KB)
  • So total metadata storage on disk + S3 is 1B docs * 0.5KB = 0.5TB for 1B doc metadata
  • Assuming that metadata for the same 25% points are cached, we need 125GB of RAM
ComponentRateVolumeCost/Month for 1B docs
Metadata RAM$2 / GB125GB$250
Metadata Ephemeral SSD + S3$0.1 / GB500GB$50
Total$300

🌐📤 Network egress costs:

The APIs respond back to the user and that also costs money. Napkin math says it's $0.1 per GB for "internet egress"

I did a basic search API call with Exa and response size was around 3KB. If they serve, 300M requests / month (only 110 RPS required, with peak 300 RPS provisioned), it would cost them

3 KB * 300M => 900 GB of egress => 900 * 0.1 $ = 90 $ / month for 300M requests which is very cheap!

🔢 Embedding generation:

  • As we previously derived/assumed, let's say each Exa doc has 780 tokens.
  • If they were to embed 1B docs with OpenAI text-embedding-3-large embedding in batches (see pricing)
    • 780 * 1B * 0.065 $ / M token = 50.7k $ for 1B docs (insane!)
    • Note that OpenAI models aren't great for retrieval and clearly are very costly
  • As per this blog, assuming 100% utilization with H100 GPUs rented at $ 2 / hr, the cost is $ 0.0166 / 1 M tokens.
    • 780 * 1000 * 0.0166 $ / M token ~= 13k $ for 1B docs.
    • This is much better! However, notice that this still surpasses all the costs.
    • Exa realised this early and that's why they bought/built their Exacluster to save massively on costs:
      • One of the cheapest H100 is offered at 1.87$ / hr. But general pricing is 2-4$ / hr.
      • Assuming a worse case margin for 20% for the above rate, we get 1.5$ / hr. This means embedding price is 0.0166 / 2 * 1.5 = 0.0125 $ / 1M tokens when owning H100s.
      • Exacluster has H200s & A100 instead of H100s. But let's assume the same rates.
      • Note that the Exacluster is also used for training embedding model and re-ranker so there's more value to be derived from the Exacluster.
Embedding modelRateTokensCost for 1B docs
OpenAI text-embedding-small$0.0650 / 1M tok780 * 1B$50.7k
Qwen on Rented H100 GPUs$0.0166 / 1M tok780 * 1B$13k
Qwen on Exacluster$0.0125 / 1M tok780 * 1B$9.75k

They are saving at least 3.25k$ / 1B docs by building Exacluster which is great! And with 20B docs it would save 65k USD annually. The savings will increase further with crawling updates.

Re-ranker:

  • They might be using something like the Qwen Rerankers
  • Public providers charge $ 0.010 / 1M tokens for the 0.6B model. Let's assume the same rates.
  • It compares 1 query (same 10 tokens) against 100 documents (each with 780 tokens avg). Total tokens: 100 * (780 + 10) = 79k
  • Note: This is purely a search time cost unlike document embeddings which have to be stored/maintained beforehand.
Re-ranker modelRateTokens / queryCost of 1k queries
Qwen3-Reranker-0.6B$0.0100 / 1M tok79k0.790 $
Qwen3-Reranker-4B$0.0250 / 1M tok79k1.975 $
Qwen3-Reranker-8B$0.0500 / 1M tok79k3.950 $

Note that Exa charges 5$ / 1k queries so it makes more sense for them go with something like Qwen3-Reranker-0.6B at least in Exa fast mode.

🕷️ Crawling updates

  • As per this research paper, web content changes meaningfully. 25% of pages show significant textual changes every month.
  • There might be a single digit monthly growth of the dataset - say 2% (assuming aggressive crawling)
  • Content disappearance: ~0.5% monthly
  • Overall let's assume 30% monthly overhead if Exa optimizes aggressively for freshness.

💰📊 Overall costs:

I'm assuming that Exa now has 20B docs with their Exa 2.0 launch

ComponentCost for 1B docs (monthly)Cost for 20B docs (one-time setup)30% overhead for 20B docs (monthly)
Content storage0.19k3.8k1.14k
Lexical storage1.89k37.8k11.34k
Vector storage2.75k55k16.51k
Metadata storage0.30k6k1.80k
Embedding cost9.75k195k58.50k
Egress0.09k1.8k0.54k
Total17.37k300k90k

Notes:

  • 65% of the cost comes from just embedding generation.
  • One-time bare minimum setup cost for 20B docs is 300K$
  • Yearly cost of maintaining Exa 2.0 index with 20B+ docs ~= 90k * 12 ~= 1.1M $
  • And I haven't even considered the cost of clustering/indexing, crawling, LLM based parsing. But it should be all done within 1.5-2M$ / year for 20 B docs.
  • Furthermore, there's will be more cost of running Exacluster, runtime CPU + GPU (query embedding/re-ranking), LLMs for websets, observability, and rest of the infra.
  • Perplexity operates at 200B scale and Exa would probably reach this scale soon with their latest funding, so their cost would shoot up to 15-20M$ / year just for search part. This makes sense given their latest funding of 85M$, they need lots of capital for few years of runway.

⏱️ Search latency analysis:

Network latency:

I queried Exa AI DNS records and looks like they are only present in AWS us-west-2 region. This has downsides because it increases latency a lot. The round trip time from India is 600-800ms. You can find ping times from all over the world here. There are two solutions to this:

  • Expand into new regions. But it's very costly to replicate whole infra in different region.
  • Use something like Cloudflare Edge network. It basically proxies the requests through Cloudflare edge network which is faster and more stable than traditional internet route. It's not zero overhead, but is definitely a big improvement.

Update: Looks like Exa has switched to Cloudflare Edge like Perplexity with Exa 2.0. This means the latency across the globe has decreased. I see that now the round trip latency for empty request from India is at 300-400ms. Good job, Exa team!

You can also use these commands to get rough idea of latency:

time curl --head https://api.exa.ai/search # from your machine
globalping http api.exa.ai from World --host api.exa.ai --path '/search' --limit 20 --latency # globally

Lexical search latency

The Exa blog clearly says they get sub 500ms latency (p50/median) for retrieving top 1000 docs with lexical search. Even if they retrieve top 10 or 100, it would just slightly faster because you need to rank the same documents. This is why they seem to only use neural search in their fast mode.

To add, it's a known fact that BM25 latency degrades faster than vector search with:

  • Longer queries (More posting lists to scan. This downside can be saturated using wAND but you still see more variance in latency)
  • More data (less pronounced. But this happens especially if you get common but non-stopword terms)

For example effect of longer queries part can be seen in this benchmark:

RetrieverAverage latency (ms)p999 latency (ms)
Lexical (WAND)16209
Vector (HNSW)1828

You can see how avg latency is similar for both vector search and lexical search but p999 is 7.5x more for lexical search. Exa uses IVF Index not HNSW so it would also vary more but I expect BM25 latency to be affected more.

Also, if Exa operates with 20B+ docs (with Exa 2.0), I can imagine that they split 20B docs into multiple "posting list shards" and query them in parallel and combine results afterwards.

Neural search latency (Exa fast)

Previously Exa fast search had p50 = 425ms and p95 = 604ms.

Now they improved it to p50 = 346ms and p95 = 462ms.

Out of this 425ms, they probably had these components:

Exa Search Pipeline
  • Network latency: They say it's 50ms. So we have 425ms-50ms ~= 375ms remaining.
  • Embedding generation with as per this with batch size = 1:
    • As per the blog, there's very little benefit from batching in embedding models.
    • 260 tokens with batch size = 1 => 37RPS => 27ms
    • 260 tokens with batch size = 8 => 64RPS => 15ms
    • Let's assume with 10 tokens (7 word query), we get 70RPS and hence 14ms latency. Assume 15ms for cleaner numbers. Remaining: 375ms-15ms = 360ms
  • Re-ranking queries-docs with cross encoder models: Let's assume another 150-200ms because it's the standard for ranking 100 items against a query. Remaining: 360-200=160ms.
  • Vector search + 2 stage re-ranking with uncompressed vectors should be done in around 160ms.
  • With Exa 2.0 fast mode, they claim a p50 diff of 79ms and p95 diff of 142ms. This must come from improvements in re-ranking model or vector search part.
    • They might have been able to afford a distilled or quantized version of the cross encoder re-ranker step because of fine-tuning without hurting precision.
    • Or they might be doing less re-ranking steps with less compressed vectors in the vector search step.

What if Exa used HNSW instead of IVF?

  • In this billion-scale KNN benchmark, a 72 vCPUs machine handles 1B 100dim BQ vectors:

    • Search latency of HNSW: 2ms with recall 50% and 4ms with 90% recall with k=10
    • Realtime indexing throughput: 15.4k WPS
    • So avg write latency: 72 / 32_000 = 4.68ms. This is just a little higher than search because HNSW mainly needs to search and then it knows where to add new links
  • The avg emperical (not guaranteed) search time complexity of HNSW is log(N) while IVF is log(sqrt(N)). In theory, it means that to go from 1M to 10B docs

    • IVF latency balloons by math.sqrt(1e10) / math.sqrt(1e6) = 100x
    • HNSW search time only increases by log(1e10) / log(1e6) = 10/6 = 1.7x
    • If we assume they perform similar at 1M scale - say 2ms with single thread. HNSW should remain around 5ms while IVF index search would take 100-200ms.
    • Note that these numbers become worse when the system has more RPS than the number of threads available because of context switching.
    • This is why Exa would have to split 10B docs into smaller 100k clusters and query only 10 of them (limited number of threads) in parallel and that's degrading recall to minimize latency.
    • Also as previously highlighted, in IVF, the cluster sizes vary and hence like BM25, the query latencies (avg latency vs p999) would vary more depending on the query.
  • Similar flaws of scaling IVF were also demonstrated in my previous benchmark blog with Nirant that compared Qdrant's HNSW and PGVector's IVF Index and our blog accelerated the implementation of HNSW in pgvector.

  • If Exa switched to HNSW, the cost of building and storing index would rise because of higher RAM and CPU requirements. But the recall and latency of HSNW will be much better. Perplexity uses HNSW and this is part of the reason why they were able to claim lower latency and higher recall against Exa 1.0 in their recent benchmarks.

  • Update: Exa has scaled to tens of billions of records (assumed 20B) with Exa 2.0 launch and seems to be doing better or similar to Perplexity without doing HNSW because they probably did optimizations in re-ranking layers as I highlighted above.

Key takeaways:

  • Exa applies heavy compression through matryoshka embeddings and binary quantization, then offsets the loss with uncompressed vectors and re-ranking.
  • In the query pipeline, vector search requires careful planning but takes less time and cost while re-ranking causes more cost and latency.
  • Documents on the web change over time and cost of re-indexing (document embedding generation + k-means clustering) is high. To mitigate this, Exa built their ExaCluster.
  • Creating a good web scale index (20B docs) is relatively cheap, 300K $ one-time setup cost + 1.3M $ / year for maintaining it.

Closing remarks:

Web search started in 1994 with Yahoo directories, then search engines like Alta Vista introduced Keyword search using TF-IDF in late 1990s, and then eventually Google came in 1998 and dominated the game with Pagerank. Over the next two decades, Google added lots of heuristics and statistics based approaches. Then around 2015 they started integrating ML models such as RankBrain and BERT, enabling semantic understanding and context-aware ranking.

And very recently, we see that even those are being surpassed by more generalizable LLMs. Infact, now you can ask more complex queries to the web which were impossible with previous techniques. It's literally Bitter lesson 101 which says approaches that scale with compute beat human designed heuristics.

We’re entering an era where startups are recreating Google scale search and number of searches will rise exponentially because of AI agents driving majority of search request. That's why products like Exa, Parallel, and Perplexity are effectively "caching" the web and serving "reads" in an optimized way. While others like Perplexity Comet, OpenAI Operator, etc want to "write" to the web (i.e. take actions). Web and search are changing fast and I'm very excited about their futures!

Acknowledgements:

Thanks to my friend Nirant for reviewing the blog.