FreshStack Evaluation Framework

Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

1University of Waterloo      2Databricks
arXiv Code Dataset BibTeX

tldr: FreshStack is a holistic framework for building realistic & challenging RAG benchmarks from community-asked questions and answers on niche and fast-growing domains.

Abstract

We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. Existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality.

Video shows a step-by-step procedure followed in FreshStack for constructing evaluation datasets.

FreshStack Leaderboard

We evaluate various retrieval models on Freshstack (October 2024 version). In each type, we consider both closed- and open-source embedding models. Our evaluation is conducted under a zero-shot setting to retrieve the relevant document chunks given a query. We report the results on five datasets: LangChain, Yolo v7 & v8, Laravel 10 & 11, Angular 16, 17 & 18, and Godot4. If you wish to add your model results in the leaderboard: please submit a PR by adding your model results here: leaderboard_data.json. For all models, we choose a maximum sequence length of 2048 tokens. Queries in Freshstack can be pretty long so make sure to use an embedding model that can handle code and long queries.



Reset Avg. (5) LangChain Yolo v7 & v8 Laravel 10 & 11 Angular 16, 17 & 18 Godot4
Retriever Size Date α@10 C@20 R@50 α@10 C@20 R@50 α@10 C@20 R@50 α@10 C@20 R@50 α@10 C@20 R@50 α@10 C@20 R@50

Results of different retrieval models across datasets on Freshstack. The best-performing model in each category is in bold, and the second best is underlined.
α@10 (alpha-nDCG@10) measures retrieval diversity & relevance, C@20 (Coverage@20) measures fraction of nuggets supported by retrieved documents, R@50 (Recall@50) measuring % of relevant documents present in retrieved documents.

BibTeX

@misc{thakur2025freshstack,
      title={FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents}, 
      author={Nandan Thakur and Jimmy Lin and Sam Havens and Michael Carbin and Omar Khattab and Andrew Drozdov},
      year={2025},
      eprint={2504.13128},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2504.13128}, 
}