RAG4Reports 2026

Keynote

Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

Chris Callison-Burch

University of Pennsylvania

LLM-as-a-judge has become the default for evaluating open-ended generation, but the approach is riddled with silent failure modes, including position bias, verbosity bias, criterion conflation, sycophancy, and run-to-run inconsistency, that corrupt judgments without any visible signal. Mitigations exist, scattered across the LM-as-judge literature and decades of work in psychometrics and educational measurement, but every research group ends up paying a “Reinvention Tax,” reimplementing option shuffling, ensemble voting, calibration, and reliability metrics from scratch.

I will present Autorubric, an open-source framework that consolidates these best practices into a single library with opinionated defaults: analytic per-criterion decomposition, mixed criterion types, ensemble judging, length penalties, and a full suite of psychometric reliability metrics. Beyond measurement, Autorubric’s mandatory per-criterion explanations function as “textual gradients” for two downstream applications: rubric-guided prompt induction and RL with rubric rewards. Autorubric is available at autorubric.org.

Biography

Chris Callison-Burch is the Raj and Neera Singh Professor of Artificial Intelligence at the University of Pennsylvania, where he directs the online Master’s in AI and teaches Penn Engineering’s flagship AI course to more than 500 students each fall. In 2026 he received the Lindback Award for Distinguished Teaching, Penn’s highest teaching honor. He chairs the advisory board for the Human Language Technology Center of Excellence at Johns Hopkins University. He testified before Congress in 2023 on generative AI and copyright law, and in 2026 participated in the Isaac Asimov Memorial Debate at the American Museum of Natural History, moderated by Neil deGrasse Tyson. He has authored more than 200 publications with over 36,000 citations, and is a Sloan Research Fellow with research support from DARPA, IARPA, NSF, and industry partners including Google, Microsoft, and Amazon.

Program

July 4, 2026 · San Diego, CA, USA · Room Harbor A&B
Co-located with ACL 2026

9:00 – 9:15

Opening Remarks and Shared Task Introduction

9:15 – 10:00

Keynote: Chris Callison-Burch (University of Pennsylvania)

10:00 – 10:30

Research Paper Orals

Decompose, Retrieve, Cite: A RAG Pipeline for Structured Report Generation from Technical Documentation Himanshu Dhurve, Sreedath Panat, Rajat Dandekar, Raj Dandekar
REFSafE: A RAG-Enabled Framework for Predictive Risk Analysis and Automated Safety Report Generation in Mission-Critical Environments Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield, Gavin Wiggins, Dana Hewit, Sudarshan Srinivasan, Prasanna Balaprakash, Robert Patton, Todd Thomas, Tirthankar Ghosal

10:30 – 11:00

Coffee Break

11:00 – 11:30

Shared Task Orals

GenAIus at RAG4Reports 2026: Citation-Aware Compression for Multilingual Report Generation Shared Task Reyyan Yeniterzi, Suveyda Yeniterzi
UNH @ Rag4Reports: A Broad Exploration of LLM-Judges for RAG Shared Task Minna Tran, Ryan McCarthy, Aiden Parsons, Jaren Unzen, Laura Dietz
Crucible @ Rag4Reports: Generating Nuggets for Report Generation and Evaluation Shared Task Laura Dietz, Eugene Yang

11:40 – 12:30

Poster Session

StructSurvey: Structured Agentic Retrieval for Automated Survey Paper Generation Paolo Pedinotti, Enrico Santus
DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval Taegyeong Lee, Jiwon Park, Seunghyun Hwang, JooYoung Jang
A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems Florin Cuconasu, Giovanni Trappolini, Nicola Tonellotto, Fabrizio Silvestri
Decompose, Retrieve, Cite: A RAG Pipeline for Structured Report Generation from Technical Documentation Himanshu Dhurve, Sreedath Panat, Rajat Dandekar, Raj Dandekar
EncouRAGe: Evaluating RAG Local, Reliable, and Efficient Jan Strich, Martin Semmann, Chris Biemann
REFSafE: A RAG-Enabled Framework for Predictive Risk Analysis and Automated Safety Report Generation in Mission-Critical Environments Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield, Gavin Wiggins, Dana Hewit, Sudarshan Srinivasan, Prasanna Balaprakash, Robert Patton, Todd Thomas, Tirthankar Ghosal
ORCHID: Orchestrated Retrieval-Augmented Classification of High-Risk Property with Intelligent Decision-Making Sanjay Das, Maria Mahbub, Vanessa Lama, Brian Starks, Christopher Polchek, Saffell Silvers, Lauren Deck, Prasanna Balaprakash, Robert Patton, Tirthankar Ghosal
A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, Maik Fröbe, Martin Potthast
UNH @ Rag4Reports: A Broad Exploration of LLM-Judges for RAG Shared Task Minna Tran, Ryan McCarthy, Aiden Parsons, Jaren Unzen, Laura Dietz
Crucible @ Rag4Reports: Generating Nuggets for Report Generation and Evaluation Shared Task Laura Dietz, Eugene Yang
GenAIus at RAG4Reports 2026: Citation-Aware Compression for Multilingual Report Generation Shared Task Reyyan Yeniterzi, Suveyda Yeniterzi
AMU at RAG4Reports 2026 Task B: A Practical Multilingual RAG Pipeline for Citation-Grounded Reports Shared Task Maciej Czajka, Piotr Jabłoński, Mateusz Czajka, Konrad Pierzyński, Krzysztof Jassem
Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation Shared Task Mann Bajpai, Pulkit Chatwal, Priyanshu Deswal, Harish Pratap Singh, Santosh Kumar Mishra
EFSG: Evidence-First Structured Generation for Multilingual RAG Report Generation Shared Task Shaurya Gupta, Jatin Bedi
Adapting AutoARGUE for Automatic Report Evaluation under Missing Citation Annotations Shared Task Divrose Kaur, Jatin Bedi, Jasmeet Singh
JU-NLP-PG at RAG4Reports 2026: Memory-Efficient Multilingual Report Generation with 4-bit Quantized LLMs Shared Task Swayam Chatterjee, Dipankar Das

Call for Paper

Submission on OpenReview (Closed)

RAG4Reports are interested in various aspects of the Report Generation problem, which is a deep research task that research agents can be used to tackle it. The following is an incomplete list of the topics. Please reach out to us if you are unsure about whether your work is within the scope of this workshop.

Report generation and long-form RAG systems
Multilingual report generation or RAG datasets
Agentic deep research systems
Analysis of various components in report generation systems, such as multilingual retrieval, multilingual generative models.
Evaluation method for report generation

Submission Guidelines and Review Process

Papers can either be directly submitted to the RAG4Reports through OpenReview or committed through ARR. Submissions should use the ACL format with at most 8 pages excluding references. Reviewers will be asked to consider the paper’s length when writing their reviews.

Direct submissions will undergo a single-blinded peer-review process (i.e., reviewers will see the identity of the authors), and each paper will receive at least 3 reviews and one meta-review. For submissions proposing new resources, such as datasets or software, please provide access to the resource for the reviewers to assess its merit.

Each direct submission can choose between archival and non-archival after acceptance. If the authors decide to make the paper non-archival, we will provide a slot for presentation, but not publish the paper under our workshop. If the authors decide to make the submission archival, we will publish the paper with the ACL Anthology.

Important Dates

~~First call for papers: December 10, 2025~~
~~Second call for papers: January 15, 2026~~
~~Third call for papers: February 20, 2026~~
~~Paper submission deadline: March 5, 2026~~
~~Pre-reviewed ARR commitment deadline: March 24, 2026~~
~~Notification of acceptance: April 28, 2026~~
~~Camera-ready paper due: May 12, 2026~~
Workshop date: July 4, 2026, San Diego, CA, USA

Shared Tasks

Submission Portal on TIRA (Closed) Data

Email Eugene Yang to join the RAG4Reports Slack Channel!

Task Description and Evaluation

RAG4Reports will host two tasks:

Automatic Report Evaluation
Multilingual Report Generation

Automatic Report Evaluation

We will provide system-generated reports from 2025 TREC RAGTIME submissions that have been judged by human annotators as the input for the shared task participants. The task is to provide a system ranking based on each report request (long-form query with a description of user background) as well as an overall ranking across all report requests. The submitted rankings will be evaluated on correlation to the ranking derived from human annotations. We will accept two types of submissions:

fully automatic evaluators: without any additional human inputs;
semi-automatic evaluators: with an additional input of human-curated essential facts (will be provided by the organizers) that should be included in a useful report

To study the effect of document languages on the evaluation, we will accept submissions using an English translation (provided by the organizers) of the corpus or using the multilingual corpus with documents in their original languages. We will use AutoARGUE as the baseline for Task a2.

Data and Submission Format

Participants will receive a set of report generation responses that need to be evaluated. Each generation system will map to a JSONL file where each line is the response to a request. The file name will be the generation system ID. Please see the submission format of the Multilingual Report Generation task for details.

The output format should be a TSV with the columns:

topic_id (string): the topic ID that this line is reporting
generation_system_id (string): the generation system that this line is reporting
metric_name (string): the metric name
score (float): the numerical score of the metric for this generation system on this topic

There will be a field in the submission portal to indicate which metric you would like for the shared task. You may contain multiple metrics in the submission and pick one for the evaluation.

Multilingual Report Generation

This task involves generating long-form reports in response to a request using information retrieved from a multilingual corpus. Report requests consist of background information about the user and a statement describing their information need in English. In contrast to other RAG tasks, reports should contain only information that is grounded in the corpus. Generated reports should consist of sentences with citations and will be given a length limit. Reports should be written in the same language as the report request. The corpus consists of four million English, Chinese, Russian, and Arabic documents sampled from Common Crawl News, evenly sampled from 2021 to 2024. The organizers will provide search services accessible through an API in addition to the corpus itself. Submitted reports will be judged automatically based on the Auto-ARGUE framework, which scores reports based on whether nuggets of related information are present and correctly cited in the report. We plan to score reports using a range of LLMs to understand their agreement.

Request and Submission Format

Report requests will be distributed in JSONL format as a list of individual requests, one per line. Each request will contain the following JSON fields:

topic_id (string): A unique ID for this report request
title (string): A short description of the report request
background (string): Describes the context in which the report is being written
problem_statement (string): Describes what should and should not be included in the report
limit (int): Maximum number of NFKC-normalized Unicode characters the report may included

The submission format is a sequence of JSONL entries each representing one report. Each report is a JSON object containing three main objects :

metadata (dictionary)
- topic_id (string): The unique ID of the input report request
- run_id (string): An arbitrary string to identify the run. It is recommended to include your team name as part of the run_id
Other metadata fields may be present but will be ignored.
responses (array): a list of sentence dictionaries.
references (array): a list of reference document IDs (strings). This should be the union of all cited documents.

Sentences must appear in report order. Each sentence dictionary has the following fields:

text (string): a string containing the text of the sentence
citations (dictionary): a dictionary of zero or more document IDs (strings) mapped to scores that are floating point numbers. The higher the number, the more confidence the system has in the validity of that citation.

Submission Instruction

Please submit your runs to TIRA. Each team can submit an unlimited number of submissions, but only the last three submissions from the team for each task will be evaluated and considered in the competition.

Each participating team is expected to submit a system paper after the results are announced. During the conference, the winner in each task will receive a slot for an oral presentation. Other teams will be invited to present at the poster session. We strongly encourage each team to participate in the poster session to share the knowledge.

Important Dates

~~Data release: December 10, 2025~~
~~Task A and B submission deadline: April 15, 2026~~
~~Result announcement: April 28, 2026~~
~~System papers due: May 12, 2026~~
Workshop date: July 4, 2026, San Diego, CA, USA

Workshop Features

Grounded Generation

Multilingual RAG

Long Generation

Keynote

Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

Chris Callison-Burch

Biography

Program

Call for Paper

Submission Guidelines and Review Process

Important Dates

Shared Tasks

Task Description and Evaluation

Automatic Report Evaluation

Data and Submission Format

Multilingual Report Generation

Request and Submission Format

Submission Instruction

Important Dates

Organizing Committee

Dawn Lawrie

Sean MacAvaney

James Mayfield

Luca Soldaini

Eugene Yang

Andrew Yates