MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

1University of Waterloo      2Vectara
NAACL 2025 Main Proceedings
MIRAGE-Bench Evaluation Framework

AI-generated image using Google Gemini showing the mirage effect.

Introduction

Retrieval-Augmented Generation (RAG) utilizes Large Language Models (LLMs) to generate factual and correct answers to user questions by citing information available in ground-truth documents. However, at present it's a challenge to evaluate RAG systems both reliably and efficiently across multiple langauges. Therefore, it is unknown how well do current LLMs generate answers with reasoning to questions asked in a non-English language. Traditional RAG benchmarks can be classified broadly in these two categories:

  • Heuristic-focused: Manual and traditional approach towards benchmarking RAG systems. One requires hand-crafted evaluation metrics which require extensive human intervention (in terms of research and experimentation) and/or a gold truth answer which is scarcely available due to very few human annotations.
  • Arena-focused: An automatic approach where two systems are evaluated against each other with a LLM-as-a-judge, e.g., MT-Bench. However, automatic evaluations don't necessarily mean they are cheap. Exhaustive pairwise comparisons by a high-performing LLM acting as a judge (like GPT-4o) can often shoot up your costs and expenses.

What is MIRAGE-Bench?

MIRAGE-Bench is designed to combine the best of both worlds: heuristics and arena-based features. MIRAGE-Bench helps accurately and efficiently evaluate RAG systems answering human-generated questions in a monolingual setting, e.g., a user asking a question in Hindi and wishes to read the answer back in Hindi. MIRAGE-Bench is constructed atop of MIRACL, a recently popular multilingual retrieval dataset, and is extended to evaluate the answer generation task in multilingual RAG.

MIRAGE-Bench contains a rich diversity of topics in human-generated questions from 18 different languages. The method is cost-effective as it only requires to train a surrogate judge, which is a regression model to compute heuristic-based features and extensible as one can retrain the regression model within few minutes on a CPU when presented with newer features, such as nugget recall or precision or additional systems for evaluation.

Video shows a step-by-step procedure followed in the construction of MIRAGE-Bench.

MIRAGE-Bench learns a synthetic arena leaderboard by training a surrogate judge using heuristic-based features, and learning to output the Bradley-Terry model coefficients distilled from LLM-as-a-Judge pairwise comparisons. Below is a step-by-step breakdown of its process:

  • Heuristic-Based Features: MIRAGE-Bench assesses 21 different SoTA multilingual-focused LLMs (OpenAI, Meta, Google, etc.) using seven heuristic RAG evaluation features. These include metrics such as support evaluation, fluency or answer overlap. By leveraging these features, MIRAGE-Bench can evaluate RAG systems on heuristics without requiring human intervention.
  • Pairwise Comparisons with GPT-4o: MIRAGE-Bench incorporates LLM-as-a-Judge exhaustive pairwise comparisons on a smaller subset of evaluation queries (maximum of 100). We found exhaustive comparison is necessary and this step provides high-quality labeled data for training the surrogate model, ensuring that the trained surrogate judge is accurate and reliable.
  • Surrogate Judge (Regression Model): A regression model, such as Random Forest, is trained on the heuristic-based features to learn to output the LLM-as-a-Judge results using a Bradley-Terry model and bootstrapping to maximize the training pairs from a small subset. During inference, the surrogate judge generates a synthetic arena-based leaderboard, which can evaluate multiple models on a leaderboard, without requiring to compute the expensive pairwise comparisons again.

MIRAGE-Bench Observations

Quantitative Analysis: MIRAGE-Bench synthetic leaderboard by training a surrogate judge observes a high correlation (Kendall Tau = 0.909) with GPT-4o based LLM-as-a-Judge leaderboard, indicating the effectiveness of the random forest model acting as a surrogate judge.

Heatmaps: We observed that closed-sourced and large multilingual-focused LLMs dominate multilingual generation, GPT-4 and GPT-4o currently dominate answering questions in MIRACL, next topmost open-weights model is the Meta-Llama-3-70B model.

Benchmarking: MIRAGE-Bench evaluates 21 multilingual-focused LLMs on 18 languages. The benchmark is cost-effective and extensible, as it only requires training a surrogate judge to evaluate RAG systems on heuristic-based features. The surrogate judge can be retrained within minutes on a CPU when presented with newer features or additional systems for evaluation.

BibTeX

@article{thakur-mirage:2024,
        author       = {Nandan Thakur and
                        Suleman Kazi and
                        Ge Luo and
                        Jimmy Lin and
                        Amin Ahmad},
        title        = {MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented
                        Generation Systems},
        journal      = {CoRR},
        volume       = {abs/2410.13716},
        year         = {2024},
        url          = {https://doi.org/10.48550/arXiv.2410.13716},
        doi          = {10.48550/ARXIV.2410.13716},
        eprinttype    = {arXiv},
        eprint       = {2410.13716},
        timestamp    = {Wed, 27 Nov 2024 09:01:16 +0100},
        biburl       = {https://dblp.org/rec/journals/corr/abs-2410-13716.bib},
        bibsource    = {dblp computer science bibliography, https://dblp.org}
      }