FreshStack Evaluation Framework

Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

1University of Waterloo      2Databricks
arXiv Code Dataset BibTeX

tldr: FreshStack is a holistic framework for building realistic & challenging RAG benchmarks from community-asked questions and answers on niche and fast-growing domains.

Abstract

We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. Existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality.

Video shows a step-by-step procedure followed in FreshStack for constructing evaluation datasets.

BibTeX

@misc{thakur2025freshstack,
      title={FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents}, 
      author={Nandan Thakur and Jimmy Lin and Sam Havens and Michael Carbin and Omar Khattab and Andrew Drozdov},
      year={2025},
      eprint={2504.13128},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2504.13128}, 
}