We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. Existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality.
We evaluate various retrieval models on Freshstack (October 2024 version). In each type, we consider both closed- and open-source embedding models. Our evaluation is conducted under a zero-shot setting to retrieve the relevant document chunks given a query. We report the results on five datasets: LangChain, Yolo v7 & v8, Laravel 10 & 11, Angular 16, 17 & 18, and Godot4.
Reset | LangChain | Yolo v7 & v8 | Laravel 10 & 11 | Angular 16, 17 & 18 | Godot4 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Retriever | Size | Date | α-n@10 | C@20 | R@50 | α-n@10 | C@20 | R@50 | α-n@10 | C@20 | R@50 | α-n@10 | C@20 | R@50 | α-n@10 | C@20 | R@50 |
Results of different retrieval models across datasets on Freshstack. The best-performing model in each category is in bold, and the second best is underlined.
@misc{thakur2025freshstack,
title={FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents},
author={Nandan Thakur and Jimmy Lin and Sam Havens and Michael Carbin and Omar Khattab and Andrew Drozdov},
year={2025},
eprint={2504.13128},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2504.13128},
}