Published on

Inference-Free SPLADE: Full Quality, 13× Faster Queries

Authors
Inference Free SPLADE Blog Cover Image

Looking for TL;DR? Skip to key takeaways

BM25 is the workhorse of lexical search. It's fast, requires no GPU, and is surprisingly competitive on most retrieval benchmarks. Its weakness is well-known: it can only match exact terms. A query for "cardiac arrest" won't retrieve a document that says "heart attack", even if that document is the best answer.

SPLADE fixes this by using a neural network to expand documents into a richer sparse representation. Here's what naver/splade-v3-doc actually activates for "heart attack":

The original tokens score highest, but semantically related terms like cardiac, stroke, disease, chest, death are all activated with meaningful weights. A query for "cardiac arrest" hits cardiac (weight 0.77) even though the document only mentions "heart attack." BM25 would miss it entirely.

The catch is query latency: running a transformer at search time adds 50–100ms per query.

Inference-free SPLADE removes that catch entirely. The model runs only at index time on documents. Queries use raw BERT token IDs: no model, no GPU, no neural inference. You pay once upfront. Queries stay fast.

Here's how it compares across six retrieval setups, with one result that surprised me.

How It Works

STANDARD SPLADESPLADE-IFquerydocSPLADE Encoder · ~50mspre-computed vecs·scorequerydocTokenizer · ~0.3mspre-computed vecs·score* doc vectors are pre-computed at index time for both variants; only query processing differs

The document vectors still carry neural expansions: "heart" activates "cardiac", "coronary", "myocardial". The query side is just tokenization: "cardiac arrest" becomes token IDs [3684, 6295] each with weight 1.0. No model call, no GPU.

def encode_query_inference_free(tokenizer, query: str):
    enc = tokenizer(query, add_special_tokens=False, truncation=True, max_length=512)
    unique_ids = list(set(enc["input_ids"]))
    return SparseVector(indices=unique_ids, values=[1.0] * len(unique_ids))

At query time: one tokenizer call + one sparse dot product. That's it.

Results

I benchmarked a full matrix of configurations on BEIR scifact (5,183 docs, 300 queries, scientific claim verification): three SPLADE models × two backends × full and inference-free mode, plus BM25 on both backends. SPLADE doc encoding ran on a Modal A10G GPU. naver/splade-v3 and naver/splade-v3-doc are the same model family (symmetric vs asymmetric training), giving the cleanest apples-to-apples measure of dropping query-side inference. The third model, opensearch-neural-sparse-encoding-doc-v3-gte, is an asymmetric doc-only encoder from OpenSearch.

Qdrant (native float sparse scoring)

MethodNDCG@10Recall@10MRR@10Total (median/p90)Query embedSearch
SPLADE-Full (naver)0.71610.82890.688657.2ms / 79.0ms50.0ms6.6ms
SPLADE-Full (PP)0.70930.82400.681160.4ms / 88.8ms53.2ms7.1ms
SPLADE-IF (naver)0.70680.82540.67484.3ms / 5.7ms0.3ms4.0ms
SPLADE-IF (OpenSearch)0.70210.82200.67174.3ms / 5.7ms0.3ms4.0ms
SPLADE-IF (PP, sym)0.68590.81210.65073.9ms / 5.7ms0.3ms3.6ms
BM250.68300.80880.64704.0ms / 5.0ms0.1ms3.8ms

The 13× latency gap comes entirely from eliminating the query encoder call. Search time is unchanged.

I also ran everything on Lucene/pyserini. Numbers come out 1–2% lower. Lucene stores doc weights as integers, so you lose some float precision. OpenSearch and Elasticsearch have the same constraint. None of them support exact float sparse scoring. The tradeoff is fine: integer arithmetic speeds up search.

Lucene/pyserini results (integer-quantized impact scoring, JsonVectorCollection --impact)
MethodNDCG@10Recall@10MRR@10Total (median/p90)Query embedSearch
SPLADE-Full (naver)0.71560.82720.688957.9ms / 82.8ms47.8ms9.8ms
SPLADE-Full (PP)0.69690.80560.670459.3ms / 87.1ms51.9ms7.4ms
SPLADE-IF (naver)0.69530.80390.67063.2ms / 4.7ms0.2ms3.0ms
SPLADE-IF (OpenSearch)0.68860.80870.65932.5ms / 3.5ms0.2ms2.3ms
SPLADE-IF (PP, sym)0.67040.79540.63672.4ms / 3.0ms0.2ms2.3ms
BM250.67890.80380.64575.5ms / 8.9ms0.0ms5.5ms

SPLADE-IF Nearly Matches Full SPLADE at 13× Lower Latency

Comparing naver/splade-v3 (full) vs naver/splade-v3-doc (inference-free) in the same model family: dropping query-side inference costs −0.0093 NDCG@10 (under 1.3% relative). The latency difference is stark: full SPLADE costs 57ms median per query vs 4.3ms for SPLADE-IF, a 13× gap, entirely from the 50ms query encoder call.

Why does inference-free work so well? naver/splade-v3-doc is an asymmetric model trained knowing that queries will be raw tokens with no weighting. So the doc encoder compensates: it expands aggressively at index time, and the raw query terms land on vocabulary that was prepared for them.

The sparsity numbers bear this out. The asymmetric IF model is the densest — all expansion work is front-loaded at index time, and it shows:

ModelTrainingAvg doc non-zerosMedian
naver/splade-v3-docasymmetric (IF)325321
naver/splade-v3asymmetric (full)286286
opensearch GTEasymmetric (IF)244244
Splade_PP_en_v1symmetric (full/IF)205205

Despite having the densest doc vectors, naver/splade-v3-doc achieves similar quality to full SPLADE at 13× lower query latency.

Both methods beat BM25-Qdrant meaningfully: +3.5% NDCG@10, +2.1% Recall@10, +4.3% MRR@10 for SPLADE-IF, at essentially the same 4ms query latency.

The Tradeoff: Free Queries, Expensive Index

At query time, SPLADE-IF and BM25 are the same operation: a sparse dot product in Qdrant. Both take ~4ms. The "inference-free" name can mislead. There's no neural model removed from the query path relative to BM25, just relative to full SPLADE.

The cost is entirely upfront. SPLADE-IF models encode at 58–83 docs/sec on an A10G GPU vs BM25's 1,439 docs/sec on CPU, about 17–25× slower. At that rate, a million-document corpus takes 3–5 hours. For a static or slowly-updated corpus this is a one-time cost. For high-churn indexes you'd need an async re-encoding pipeline, the same problem dense vector search already solves.

Once indexed: encode once, query forever.

Limitations

Model choice matters. naver/splade-v3-doc was trained for raw-token queries. Its doc encoder compensates for the missing query expansion. A symmetric model like prithivida/Splade_PP_en_v1 still beats BM25 in IF mode, but costs ~2% NDCG@10, a real quality hit for a free speedup.

No query-time adaptation. The doc encoder must anticipate every useful expansion upfront. On truly novel terms (new drug names, recent product launches, breaking-news proper nouns), SPLADE-IF degrades toward BM25 behavior. Full SPLADE handles this better via query-time expansion.

Domain sensitivity. IF works best when vocabulary bridging is the core challenge and domain vocabulary is predictable (scientific, medical, legal). On e-commerce, where product titles already closely match query terms, document-side expansion helps less and the quality gap over BM25 narrows significantly.

Denser index, GPU required at index time. The asymmetric IF model produces ~14% more non-zeros per document than the full naver model (325 vs 286, see sparsity table above). Encoding also requires a GPU. Both are one-time costs for static corpora.

Key Takeaways

  • SPLADE-IF nearly matches full SPLADE quality. Within the same model family, dropping query-side inference costs only 0.0093 NDCG@10 (under 1.3% relative) while cutting query latency 13× (57ms → 4.3ms). Both beat BM25 by +3.5% NDCG@10 at essentially the same latency.
  • Use an asymmetric model. A symmetric model in IF mode still beats BM25, but costs ~2% NDCG@10 vs a model trained for it (naver/splade-v3-doc). If you're starting fresh, pick the asymmetric one.
  • The only real tradeoff is index time. Queries are free once indexed. The GPU cost at index time (83 docs/sec on A10G) is acceptable for static corpora. High-churn indexes need an async re-encoding pipeline.