Retrieval-Augmented Generation (RAG) utilizes Large Language Models (LLMs) to generate factual and correct answers to user questions by citing information available in ground-truth documents. However, at present it's a challenge to evaluate RAG systems both reliably and efficiently across multiple langauges. Therefore, it is unknown how well do current LLMs generate answers with reasoning to questions asked in a non-English language. Traditional RAG benchmarks can be classified broadly in these two categories:
MIRAGE-Bench is designed to combine the best of both worlds: heuristics and arena-based features. MIRAGE-Bench helps accurately and efficiently evaluate RAG systems answering human-generated questions in a monolingual setting, e.g., a user asking a question in Hindi and wishes to read the answer back in Hindi. MIRAGE-Bench is constructed atop of
MIRAGE-Bench contains a rich diversity of topics in human-generated questions from 18 different languages. The method is cost-effective as it only requires to train a surrogate judge, which is a regression model to compute heuristic-based features and extensible as one can retrain the regression model within few minutes on a CPU when presented with newer features, such as nugget recall or precision or additional systems for evaluation.
Quantitative Analysis: MIRAGE-Bench synthetic leaderboard by training a surrogate judge observes a high correlation (Kendall Tau = 0.909) with GPT-4o based LLM-as-a-Judge leaderboard, indicating the effectiveness of the random forest model acting as a surrogate judge.
Heatmaps: We observed that closed-sourced and large multilingual-focused LLMs dominate multilingual generation, GPT-4 and GPT-4o currently dominate answering questions in MIRACL, next topmost open-weights model is the Meta-Llama-3-70B model.
Benchmarking: MIRAGE-Bench evaluates 21 multilingual-focused LLMs on 18 languages. The benchmark is cost-effective and extensible, as it only requires training a surrogate judge to evaluate RAG systems on heuristic-based features. The surrogate judge can be retrained within minutes on a CPU when presented with newer features or additional systems for evaluation.
@article{thakur-mirage:2024,
author = {Nandan Thakur and
Suleman Kazi and
Ge Luo and
Jimmy Lin and
Amin Ahmad},
title = {MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented
Generation Systems},
journal = {CoRR},
volume = {abs/2410.13716},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2410.13716},
doi = {10.48550/ARXIV.2410.13716},
eprinttype = {arXiv},
eprint = {2410.13716},
timestamp = {Wed, 27 Nov 2024 09:01:16 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2410-13716.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}