FreshStack Evaluation Framework

Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

1University of Waterloo      2Databricks
arXiv Code Dataset BibTeX

tldr: FreshStack is a holistic framework for building realistic & challenging RAG benchmarks from community-asked questions and answers on niche and fast-growing domains.

Abstract

We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. Existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality.

Video shows a step-by-step procedure followed in FreshStack for constructing evaluation datasets.

FreshStack Leaderboard

We evaluate various retrieval models on Freshstack (October 2024 version). In each type, we consider both closed- and open-source embedding models. Our evaluation is conducted under a zero-shot setting to retrieve the relevant document chunks given a query. We report the results on five datasets: LangChain, Yolo v7 & v8, Laravel 10 & 11, Angular 16, 17 & 18, and Godot4.


Using Stack Overflow Answer or Nuggets Open-Source Proprietary

Reset LangChain Yolo v7 & v8 Laravel 10 & 11 Angular 16, 17 & 18 Godot4
Retriever Size Date α-n@10 C@20 R@50 α-n@10 C@20 R@50 α-n@10 C@20 R@50 α-n@10 C@20 R@50 α-n@10 C@20 R@50

Results of different retrieval models across datasets on Freshstack. The best-performing model in each category is in bold, and the second best is underlined.

BibTeX

@misc{thakur2025freshstack,
      title={FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents}, 
      author={Nandan Thakur and Jimmy Lin and Sam Havens and Michael Carbin and Omar Khattab and Andrew Drozdov},
      year={2025},
      eprint={2504.13128},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2504.13128}, 
}