Text generation has become a critical component in modern AI applications such as chatbots and agentic assistants. To combat hallucinations and provide trustworthy output, retrieval augmented generation (RAG) is the current norm for information-dense tasks. Report Generation is a long-form RAG task with strict attestation requirements that makes it well-suited to explore questions of RAG evaluation and multilingual generation. In this task, a long-form report summarizing the relevant information in a corpus is generated in response to a report request, which consists of a user background and an information need. The generated report should provide proper attribution to the source documents to establish trust. This task is also frequently referred to as "deep research" or "agentic search," where the system needs to acquire a deep understanding of the topic before responding to users. Here, we use "report generation" to describe the task instead of the applicable approaches to make it more accurate.
RAG4Reports would like to draw the attention of the ACL community to developing systems and evaluation methods for the Multilingual Report Generation, which can lead to more general solutions for other long-form RAG problems as well.
Strong focus on groundedness and citation supports.
Incorporating information in different languages is critical for effective RAG.
Paragraphs on coherent and information-dense responeses.
LLM-as-a-judge has become the default for evaluating open-ended generation, but the approach is riddled with silent failure modes, including position bias, verbosity bias, criterion conflation, sycophancy, and run-to-run inconsistency, that corrupt judgments without any visible signal. Mitigations exist, scattered across the LM-as-judge literature and decades of work in psychometrics and educational measurement, but every research group ends up paying a “Reinvention Tax,” reimplementing option shuffling, ensemble voting, calibration, and reliability metrics from scratch.
I will present Autorubric, an open-source framework that consolidates these best practices into a single library with opinionated defaults: analytic per-criterion decomposition, mixed criterion types, ensemble judging, length penalties, and a full suite of psychometric reliability metrics. Beyond measurement, Autorubric’s mandatory per-criterion explanations function as “textual gradients” for two downstream applications: rubric-guided prompt induction and RL with rubric rewards. Autorubric is available at autorubric.org.
Chris Callison-Burch is the Raj and Neera Singh Professor of Artificial Intelligence at the University of Pennsylvania, where he directs the online Master’s in AI and teaches Penn Engineering’s flagship AI course to more than 500 students each fall. In 2026 he received the Lindback Award for Distinguished Teaching, Penn’s highest teaching honor. He chairs the advisory board for the Human Language Technology Center of Excellence at Johns Hopkins University. He testified before Congress in 2023 on generative AI and copyright law, and in 2026 participated in the Isaac Asimov Memorial Debate at the American Museum of Natural History, moderated by Neil deGrasse Tyson. He has authored more than 200 publications with over 36,000 citations, and is a Sloan Research Fellow with research support from DARPA, IARPA, NSF, and industry partners including Google, Microsoft, and Amazon.
RAG4Reports are interested in various aspects of the Report Generation problem, which is a deep research task that research agents can be used to tackle it. The following is an incomplete list of the topics. Please reach out to us if you are unsure about whether your work is within the scope of this workshop.
Papers can either be directly submitted to the RAG4Reports through OpenReview or committed through ARR. Submissions should use the ACL format with at most 8 pages excluding references. Reviewers will be asked to consider the paper’s length when writing their reviews.
Direct submissions will undergo a single-blinded peer-review process (i.e., reviewers will see the identity of the authors), and each paper will receive at least 3 reviews and one meta-review. For submissions proposing new resources, such as datasets or software, please provide access to the resource for the reviewers to assess its merit.
Each direct submission can choose between archival and non-archival after acceptance. If the authors decide to make the paper non-archival, we will provide a slot for presentation, but not publish the paper under our workshop. If the authors decide to make the submission archival, we will publish the paper with the ACL Anthology.
RAG4Reports will host two tasks:
We will provide system-generated reports from 2025 TREC RAGTIME submissions that have been judged by human annotators as the input for the shared task participants. The task is to provide a system ranking based on each report request (long-form query with a description of user background) as well as an overall ranking across all report requests. The submitted rankings will be evaluated on correlation to the ranking derived from human annotations. We will accept two types of submissions:
To study the effect of document languages on the evaluation, we will accept submissions using an English translation (provided by the organizers) of the corpus or using the multilingual corpus with documents in their original languages. We will use AutoARGUE as the baseline for Task a2.
Participants will receive a set of report generation responses that need to be evaluated. Each generation system will map to a JSONL file where each line is the response to a request. The file name will be the generation system ID. Please see the submission format of the Multilingual Report Generation task for details.
The output format should be a TSV with the columns:
topic_id (string): the topic ID that this line is reportinggeneration_system_id (string): the generation system that this line is reportingmetric_name (string): the metric namescore (float): the numerical score of the metric for this generation system on this topicThere will be a field in the submission portal to indicate which metric you would like for the shared task. You may contain multiple metrics in the submission and pick one for the evaluation.
This task involves generating long-form reports in response to a request using information retrieved from a multilingual corpus. Report requests consist of background information about the user and a statement describing their information need in English. In contrast to other RAG tasks, reports should contain only information that is grounded in the corpus. Generated reports should consist of sentences with citations and will be given a length limit. Reports should be written in the same language as the report request. The corpus consists of four million English, Chinese, Russian, and Arabic documents sampled from Common Crawl News, evenly sampled from 2021 to 2024. The organizers will provide search services accessible through an API in addition to the corpus itself. Submitted reports will be judged automatically based on the Auto-ARGUE framework, which scores reports based on whether nuggets of related information are present and correctly cited in the report. We plan to score reports using a range of LLMs to understand their agreement.
Report requests will be distributed in JSONL format as a list of individual requests, one per line. Each request will contain the following JSON fields:
topic_id (string): A unique ID for this report requesttitle (string): A short description of the report requestbackground (string): Describes the context in which the report is being writtenproblem_statement (string): Describes what should and should not be included in the reportlimit (int): Maximum number of NFKC-normalized Unicode characters the report may includedThe submission format is a sequence of JSONL entries each representing one report. Each report is a JSON object containing three main objects :
metadata (dictionary)
topic_id (string): The unique ID of the input report requestrun_id (string): An arbitrary string to identify the run. It is recommended to include your team name as part of the run_idOther metadata fields may be present but will be ignored.
responses (array): a list of sentence dictionaries.references (array): a list of reference document IDs (strings). This should be the union of all cited documents.Sentences must appear in report order. Each sentence dictionary has the following fields:
text (string): a string containing the text of the sentencecitations (dictionary): a dictionary of zero or more document IDs (strings) mapped to scores that are floating point numbers. The higher the number, the more confidence the system has in the validity of that citation.Please submit your runs to TIRA. Each team can submit an unlimited number of submissions, but only the last three submissions from the team for each task will be evaluated and considered in the competition.
Each participating team is expected to submit a system paper after the results are announced. During the conference, the winner in each task will receive a slot for an oral presentation. Other teams will be invited to present at the poster session. We strongly encourage each team to participate in the poster session to share the knowledge.
HLTCOE, Johns Hopkins University
University of Glasgow
HLTCOE, Johns Hopkins University
Allen Institute for AI
HLTCOE, Johns Hopkins University
HLTCOE, Johns Hopkins University