Speculative RAG is a framework that uses a larger generalist language model to efficiently verify multiple RAG drafts produced in parallel by a smaller, specialised distilled language model.
Each draft is based on a distinct subset of retrieved documents, providing diverse perspectives and reducing input token counts per draft.
According to the research, this method enhances comprehension and mitigates position bias over long contexts. By delegating drafting to the smaller model and having the larger model perform a single verification pass, Speculative RAG accelerates the RAG process.
Experiments show that Speculative RAG achieves state-of-the-art performance with reduced latency.
Improving accuracy by up to 12.97% and reducing latency by 51% compared to traditional RAG systems.
This new RAG framework uses a smaller specialist RAG drafter to generate high-quality draft answers.
Each draft comes from a distinct subset of retrieved documents, offering diverse perspectives and reducing input token counts per draft.
The generalist LM works with the RAG drafter without needing additional tuning.
It verifies and integrates the most promising draft into the final answer, enhancing comprehension of each subset and mitigating the lost-in-the-middle phenomenon.
Google believes this method significantly accelerates RAG by having the smaller specialist LM handle drafting, while the larger generalist LM performs a single, unbiased verification pass over the drafts in parallel.
Extensive experiments on four free-form question-answering and closed-set generation benchmarks demonstrate the superior effectiveness and efficiency of this method.
- This study is a good example of how Small Language Models are being used on a larger framework which employs model orchestration.
- SLMs are leveraged for their reasoning capabilities, for which they have been specifically created.
- SLMs are ideal in this scenario, as they are not required to be knowledge intensive in nature for this implementation. Relevant and contextual knowledge is injected at inference.
- The aim of this framework is to optimise token count and hence safe cost.
- Reducing latency by 51% compared to conventional RAG systems.
- Enhances accuracy by up to 12.97%.
- Avoid fine-tuning of models.
- Multiple RAG drafts are produced in parallel by smaller, specialised Language Models.
- This smaller, specialised RAG model, excels at reasoning over retrieved documents and can rapidly produce accurate responses. This reminds of SLMs Orca-2 and Phi-3 which were trained to have exceptional reasoning capabilities.
- The best results were achieved with the RAG drafter being the Mistral 7B SLM.
- And the verifier Mixtral 8x7B.