MIRAGE-Bench: Multilingual RAG Evaluation Benchmark

Introduction

Retrieval-Augmented Generation (RAG) utilizes Large Language Models (LLMs) to generate factual and correct answers to user questions by citing information available in ground-truth documents. However, at present it's a challenge to evaluate RAG systems both reliably and efficiently across multiple langauges. Therefore, it is unknown how well do current LLMs generate answers with reasoning to questions asked in a non-English language. Traditional RAG benchmarks can be classified broadly in these two categories:

Heuristic-focused: Manual and traditional approach towards benchmarking RAG systems. One requires hand-crafted evaluation metrics which require extensive human intervention (in terms of research and experimentation) and/or a gold truth answer which is scarcely available due to very few human annotations.
Arena-focused: An automatic approach where two systems are evaluated against each other with a LLM-as-a-judge, e.g., MT-Bench. However, automatic evaluations don't necessarily mean they are cheap. Exhaustive pairwise comparisons by a high-performing LLM acting as a judge (like GPT-4o) can often shoot up your costs and expenses.

What is MIRAGE-Bench?

MIRAGE-Bench is designed to combine the best of both worlds: heuristics and arena-based features. MIRAGE-Bench helps accurately and efficiently evaluate RAG systems answering human-generated questions in a monolingual setting, e.g., a user asking a question in Hindi and wishes to read the answer back in Hindi. MIRAGE-Bench is constructed atop of MIRACL, a recently popular multilingual retrieval dataset, and is extended to evaluate the answer generation task in multilingual RAG.

MIRAGE-Bench contains a rich diversity of topics in human-generated questions from 18 different languages. The method is cost-effective as it only requires to train a surrogate judge, which is a regression model to compute heuristic-based features and extensible as one can retrain the regression model within few minutes on a CPU when presented with newer features, such as nugget recall or precision or additional systems for evaluation.

MIRAGE-Bench Observations

Figure 1: A multilingual RAG pipeline in the Hindi language. The question asks "how many letters are there in binary code?" and the correct answer generated is "binary code contains two letters."

Figure 2: Lollipop plots plotting the average heuristic-based feature scores achieved by RAG systems in MIRAGE-Bench. Each model is grouped within the same family (e.g., Meta).

LLM as a Judge and Heuristic-based Leaderboard Scores

Figure 3: Heatmap showing the rank achieved by LLMs on MIRAGE-Bench using LLM-as-a-Judge pairwise comparisons (left) and using surrogate judge scores (right). Kendall tau correlation achieved = 0.909.

Quantitative Analysis: MIRAGE-Bench synthetic leaderboard by training a surrogate judge observes a high correlation (Kendall Tau = 0.909) with GPT-4o based LLM-as-a-Judge leaderboard, indicating the effectiveness of the random forest model acting as a surrogate judge.

Heatmaps: We observed that closed-sourced and large multilingual-focused LLMs dominate multilingual generation, GPT-4 and GPT-4o currently dominate answering questions in MIRACL, next topmost open-weights model is the Meta-Llama-3-70B model.

Benchmarking: MIRAGE-Bench evaluates 21 multilingual-focused LLMs on 18 languages. The benchmark is cost-effective and extensible, as it only requires training a surrogate judge to evaluate RAG systems on heuristic-based features. The surrogate judge can be retrained within minutes on a CPU when presented with newer features or additional systems for evaluation.

BibTeX

@article{thakur-mirage:2024,
        author       = {Nandan Thakur and
                        Suleman Kazi and
                        Ge Luo and
                        Jimmy Lin and
                        Amin Ahmad},
        title        = {MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented
                        Generation Systems},
        journal      = {CoRR},
        volume       = {abs/2410.13716},
        year         = {2024},
        url          = {https://doi.org/10.48550/arXiv.2410.13716},
        doi          = {10.48550/ARXIV.2410.13716},
        eprinttype    = {arXiv},
        eprint       = {2410.13716},
        timestamp    = {Wed, 27 Nov 2024 09:01:16 +0100},
        biburl       = {https://dblp.org/rec/journals/corr/abs-2410-13716.bib},
        bibsource    = {dblp computer science bibliography, https://dblp.org}
      }

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

AI-generated image using Google Gemini showing the mirage effect.