Text generation has become a critical component in modern AI applications such as chatbots and agentic assistants. To combat hallucinations and provide trustworthy output, retrieval augmented generation (RAG) is the current norm for information-dense tasks. Report Generation is a long-form RAG task with strict attestation requirements that makes it well-suited to explore questions of RAG evaluation and multilingual generation. In this task, a long-form report summarizing the relevant information in a corpus is generated in response to a report request, which consists of a user background and an information need. The generated report should provide proper attribution to the source documents to establish trust.
RAG4Reports would like to draw the attention of the ACL community to developing systems and evaluation methods for the Multilingual Report Generation, which can lead to more general solutions for other long-form RAG problems as well.
Strong focus on groundedness and citation supports.
Incorporating information in different languages is critical for effective RAG.
Paragraphs on coherent and information-dense responeses.
RAG4Reports are interested in various aspects of the Report Generation problem. The following is an incomplete list of the topics. Please reach out to us if you are unsure about whether your work is within the scope of this workshop.
Papers can either be directly submitted to the RAG4Reports through OpenReview or committed through ARR. Submissions should use the ACL format with at most 8 pages excluding references. Reviewers will be asked to consider the paper’s length when writing their reviews.
Direct submissions will undergo a single-blinded peer-review process (i.e., reviewers will see the identity of the authors), and each paper will receive at least 3 reviews and one meta-review. For submissions proposing new resources, such as datasets or software, please provide access to the resource for the reviewers to assess its merit.
RAG4Reports will host two tasks:
We will provide system-generated reports from 2025 TREC RAGTIME submissions that have been judged by human annotators as the input for the shared task participants. The task is to provide a system ranking based on each report request (long-form query with a description of user background) as well as an overall ranking across all report requests. The submitted rankings will be evaluated on correlation to the ranking derived from human annotations. We will accept two types of submissions:
To study the effect of document languages on the evaluation, we will accept submissions using an English translation (provided by the organizers) of the corpus or using the multilingual corpus with documents in their original languages. We will use AutoARGUE as the baseline for Task a2.
Participants will receive a set of report generation responses that need to be evaluated. Each generation system will map to a JSONL file where each line is the response to a request. The file name will be the generation system ID. Please see the submission format of the Multilingual Report Generation task for details.
The output format should be a TSV with the columns:
There will be a field in the submission portal to indicate which metric you would like for the shared task. You may contain multiple metrics in the submission and pick one for the evaluation.
This task involves generating long-form reports in response to a request using information retrieved from a multilingual corpus. Report requests consist of background information about the user and a statement describing their information need in English. In contrast to other RAG tasks, reports should contain only information that is grounded in the corpus. Generated reports should consist of sentences with citations and will be given a length limit. Reports should be written in the same language as the report request. The corpus consists of four million English, Chinese, Russian, and Arabic documents sampled from Common Crawl News, evenly sampled from 2021 to 2024. The organizers will provide search services accessible through an API in addition to the corpus itself. Submitted reports will be judged automatically based on the Auto-ARGUE framework, which scores reports based on whether nuggets of related information are present and correctly cited in the report. We plan to score reports using a range of LLMs to understand their agreement.
Report requests will be distributed in JSONL format as a list of individual requests, one per line. Each request will contain the following JSON fields:
topic_id (string): A unique ID for this report requesttitle (string): A short description of the report requestbackground (string): Describes the context in which the report is being writtenproblem_statement (string): Describes what should and should not be included in the reportlimit (int): Maximum number of NFKC-normalized Unicode characters the report may includedThe submission format is a sequence of JSONL entries each representing one report. Each report is a JSON object containing three main objects :
metadata (dictionary)
topic_id (string): The unique ID of the input report requestOther metadata fields may be present but will be ignored.
responses (array): a list of sentence dictionaries.references (array): a list of reference document IDs (strings). This should be the union of all cited documents.Sentences must appear in report order. Each sentence dictionary has the following fields:
text (string): a string containing the text of the sentencecitations (dictionary): a dictionary of zero or more document IDs (strings) mapped to scores that are floating point numbers. The higher the number, the more confidence the system has in the validity of that citation.Please submit your runs through the Google Form that will be announced later. Each team can submit an unlimited number of submissions, but only the last three submissions from the team for each task will be evaluated and considered in the competition.
Each participating team is expected to submit a system paper after the results are announced. During the conference, the winner in each task will receive a slot for an oral presentation. Other teams will be invited to present at the poster session. We strongly encourage each team to participate in the poster session to share the knowledge.
HLTCOE, Johns Hopkins University
University of Glasgow
HLTCOE, Johns Hopkins University
Allen Institute for AI
HLTCOE, Johns Hopkins University
HLTCOE, Johns Hopkins University