Salesforce propose to leverage the task of summarisation as a testbed for evaluating long-context models and RAG systems.
Summarisation requires reasoning over a long context and a careful understanding of the relative importance of content.
The Problem Identified:
Prior work on summarisation evaluation, particularly in evaluating the relevance of summaries, has focused on single-document summarisation or tasks in which the input content is on the order of 1,000–2,000 tokens.
Longer conversational and multi-document news summarisation is still often limited to around 10k tokens.
A major problem in summarisation evaluation is the reliance on low-quality reference summaries and automatic metrics that poorly correlate with human judgments.
Traditional evaluations compare candidate summaries to gold-standard references, assuming higher overlap indicates better quality. This approach is unreliable, especially for long-context settings where high-quality references are expensive to obtain. Even the best automatic metrics for content coverage often fail to correlate well with human judgments.
To address these issues, Salesforce use synthetic data generation.
Considering the image below, the approach from Salesforce involves creating a large corpus of documents (“Haystack”) on a given topic, ensuring certain signals repeat across documents.
By controlling which insights appear in which documents, Salesforce can automatically determine the relevant insights for a search query. The SummHay task requires systems to summarise these insights and cite their sources. Summaries are evaluated based on coverage of expected insights and accuracy in citing source documents.