FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Latest News

📊 03/2026: A new preprint on analysing temporal drift in technical documentation in freshstack (LangChain domain): arxiv.org/abs/2603.04532.
📊 03/2026: KARLBench, a recently released benchmark by Databricks has incorporated FreshStack as one of its datasets.
🎉 11/2025: FreshStack received the Honourable Mention for the award of Best Search Project in 2025 at the BCS Search Solutions event held in London.
🎯 09/2025: Our roadmap is focused on evaluating new embedding models with extensions to ColBERT-style models soon!
🎉 09/2025: Our paper FreshStack has been accepted at NeurIPS 2025 (Datasets & Benchmarks Track). Camera-ready to come soon!
📌 09/2025: FreshStack has now been officially included in RTEB (Retrieval Embedding Benchmark), a newer version of the popular MTEB benchmark.

Abstract

We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. Existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality.

FreshStack Leaderboard

We evaluate various retrieval models on Freshstack (October 2024 version). In each type, we consider both closed- and open-source embedding models. Our evaluation is conducted under a zero-shot setting to retrieve the relevant document chunks given a query. We report the results on five datasets: LangChain, Yolo v7 & v8, Laravel 10 & 11, Angular 16, 17 & 18, and Godot4. If you wish to add your model results in the leaderboard: please submit a PR by adding your model results here: leaderboard_data.json. For all models, we choose a maximum sequence length of 2048 tokens. Queries in Freshstack can be pretty long so make sure to use an embedding model that can handle code and long queries.

Oracle (Stack Overflow Nuggets) Open Source Proprietary

Reset				Avg. (5)			LangChain			Yolo v7 & v8			Laravel 10 & 11			Angular 16, 17 & 18			Godot4
Rank	Retriever	Size	Date	α@10	C@20	R@50	α@10	C@20	R@50	α@10	C@20	R@50	α@10	C@20	R@50	α@10	C@20	R@50	α@10	C@20	R@50

Results of different retrieval models across datasets on Freshstack. The best-performing model in each category is in bold, and the second best is underlined.
α@10 (alpha-nDCG@10) measures retrieval diversity & relevance, C@20 (Coverage@20) measures fraction of nuggets supported by retrieved documents, R@50 (Recall@50) measuring % of relevant documents present in retrieved documents.

Poster

BibTeX

@inproceedings{
      thakur2025freshstack,
      title={FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents},
      author={Nandan Thakur and Jimmy Lin and Sam Havens and Michael Carbin and Omar Khattab and Andrew Drozdov},
      booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
      year={2025},
      url={https://openreview.net/forum?id=54TTgXlS2U}
      }

Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Latest News

Abstract

Video shows a step-by-step procedure followed in FreshStack for constructing evaluation datasets.

FreshStack Leaderboard

Poster

BibTeX