Schedule Released
Program schedule is live; acceptance notifications have been sent.
Program

Papers

Acceptance notifications have been sent. Accepted papers are listed below in two categories: Proceedings/Archival and Non-Archival. Presentation format is shown on each paper as a pill.

Authors can see decision details in OpenReview.
Author lists and final anthology links will be added as publication metadata is finalized.
42 Accepted Papers
25 Proceedings/Archival
17 Non-Archival

Proceedings/Archival

Accepted papers selected for archival publication in the workshop proceedings.

25 papers

TabBridge: Bridging Structure and Context for Accurate Table Reasoning

Oral + Poster
Abstract

Table reasoning remains challenging for Large Language Models (LLMs) as it requires integrating structured tabular information with natural language questions. Previous SQL-based approaches rely on surface-level alignment between question keywords and column headers, often generating queries with spurious or missing column mappings. We introduce TabBridge, a framework that incorporates both structural and contextual information for accurate table reasoning. TabBridge first generates a unified textual representation called Table Specification (TabSpec), preserving the structural information through row and column analysis. In order to ensure accuracy and consistency, we also employ a reconstruction-based evaluation mechanism to verify and refine the generated TabSpec. TabSpec is subsequently used to generate SQL aligned with the contextual intent of the question, enabling accurate interpretation of column semantics that are often overlooked by previous approaches. Across three public benchmarks, TabBridge shows consistent improvements over previous SQL-based methods, achieving 73.94\% accuracy on WikiTableQuestions (+5.3 pp over the previous state of the art). TabBridge also demonstrates robust performance across diverse LLM backbones, confirming its generalizability across model architectures. Our code is available at https://github.com/raylee0519/TabBridge.

Multimodal Narrative Synthesis in Complex Documents via Omni-Parser Transformer

Oral + Poster
Abstract

Traditional vision-language models treat documents as fragmented collections of image-text pairs, losing the connective tissue that defines professional media. We propose the Omni-Parser Transformer (OPT), a paradigm shift that moves from cross-modal matching to Document-Scale Narrative Reasoning. At its core, OPT leverages the high-dimensional latent space of Llama 3 to act as a multimodal reasoning hub, capable of interpreting how a specific visual asset supports, contradicts, or expands upon non-adjacent textual segments. By introducing a Neural Layout Graph, OPT preserves the spatial intent of document creators. Experimental benchmarks on our newly curated Wiki-Global Structure dataset reveal that OPT doesn't just "find" relevant text—it understands the functional role of images within a long-form argument, achieving a 15\% improvement in complex zero-shot document reasoning over traditional VLP architectures.

Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

Poster
Abstract

We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.

More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval

Oral + Poster
Abstract

Dense retrievers powered by pretrained embeddings are widely used for document retrieval but struggle in specialized domains due to the mismatches between the training and target domain distributions. Domain adaptation typically requires costly annotation and retraining of query-document pairs. In this work, we revisit an overlooked alternative: applying PCA to domain embeddings to derive lower-dimensional representations that preserve domain-relevant features while discarding non-discriminative components. Though traditionally used for efficiency, we demonstrate that this simple embedding compression can effectively improve retrieval performance. Evaluated across 9 retrievers and 14 MTEB datasets, PCA applied solely to query embeddings improves NDCG@10 in 75.4\% of model-dataset pairs, offering a simple and lightweight method for domain adaptation.

TabFaith: Benchmarking and Improving Structural Faithfulness in LLM Table Summarization

Oral + Poster
Abstract

When large language models (LLMs) summarize tabular data, they produce fluent but systematically unfaithful text—hallucinating numerical values, misattributing entities to rows or columns, fabricating comparative rankings, and conflating temporal references. Existing faithfulness metrics (BLEU, PARENT, BERTScore) are poorly correlated with human judgments of structural faithfulness (r ≤0.60) because they are agnostic to the table's schema and cell structure. We introduce TABFAITH, a benchmark of 2,400 (table, summary, error annotation) triples across five structural error types, built from ToTTo and a new enterprise table summarization dataset (TabSum-Ent) covering financial reports, clinical notes, and operational dashboards. We further propose STAF (Structural Table-Aware Faithfulness), a reference-free metric that decomposes faithfulness verification into cell-level claim alignment using natural language inference over table cells. STAF achieves r = 0.94 with human faithfulness judgments—a +0.34 improvement over PARENT (r = 0.60) and +0.70 over BLEU (r = 0.24). Guided by STAF's fine-grained signal, we design CAVE (Cell-Anchored Verification and Editing), a training-free post-processing method that identifies unfaithful claims, traces them to specific table cells, and re-generates the offending spans. CAVE improves STAF scores by +0.14 on average across five LLMs on both ToTTo and TabSum-Ent, with the largest gains for numerical errors (+0.17)—the dominant error type for smaller models.

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

Oral + Poster
Abstract

When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1–8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families—Qwen2.5 (1.5B/3B/7B) and Llama3 (1B/3B/8B)—RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13\% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.

DSMentor: Curriculum-Guided Inference with Online Memory for Data-Science LLM Agents

Oral + Poster
Abstract

Large language model (LLM) agents have shown strong capabilities in generating code to solve complex data science problems, yet they often overlook the impact of task order during inference. We present DSMentor, an inference-time optimization framework that applies curriculum learning—progressing from easier to harder tasks—to enhance LLM performance on challenging data science tasks. Guided by a mentor and supported by a growing long-term memory, DSMentor organizes problems by difficulty, retains prior experiences, and leverages them to guide subsequent reasoning. Extensive experiments on DSEval and QRData benchmarks show that DSMentor with Claude-3.5-Sonnet improves pass rates by up to 5.2% over baseline agents and achieves an 8.8% gain over GPT-4 with Program-of-Thoughts prompting. These results highlight the effectiveness of curriculum-based inference strategies in advancing LLM agents.

Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs

Poster
Abstract

Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT) – a general-purpose foundation framework for implementing and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT’s capabilities by implementing three popular schemes – Tree of Thoughts, Graph of Thoughts, and ProbTree – within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.

StructSurvey: Structured Agentic Retrieval for Automated Survey Paper Generation

Poster
Abstract

The rapid growth of scientific publications makes it increasingly difficult to track and synthesize research progress. While Large Language Models (LLMs) can support automated survey generation, existing methods retrieve unstructured data and require models to infer conceptual, methodological, and taxonomic relations from raw text at generation time. We introduce STRUCTSURVEY, a hierarchical multiagent framework that shifts structural reasoning from generation to retrieval by dynamically constructing graph-based representations of entities, relations, and topical taxonomies. We evaluate STRUCTSURVEY on a new referencegrounded benchmark of ACL survey papers for reproducible long-form scientific summarization. Compared with embedding-only retrieval baselines, STRUCTSURVEY improves ROUGE1 recall by +2.9 and ROUGE-2 recall by +1.0 on average, without reducing precision. It also improves LLM-as-a-Judge ratings for logical structure, depth, and synthesis, showing that explicit structural retrieval yields surveys closer to human-written organization and reasoning.

Map-of-Actions: Deliberate Reasoning over Multi-Labeled Graphs

Poster
Abstract

Multi-step reasoning in large language models (LLMs) is typically expressed as unstructured text, making intermediate states difficult to organize, verify, and revise explicitly. This limitation often leads to redundant reasoning paths, error accumulation, and limited controllability in complex tasks. We propose Map-of-Actions (MoA), a neuro-symbolic reasoning framework that treats reasoning as operations over an explicit structured state space. MoA represents intermediate states as a multi-labeled graph, in which each node corresponds to a semantically labeled reasoning unit. This representation provides LLMs with structured memory, explicit state transitions, and flexible interfaces to external tools. Experiments on multiple complex question answering (QA) benchmarks show that MoA consistently outperforms strong baselines, improving accuracy by up to 17.9 percentage points.

Mixed-Policy GRPO for Text-to-SQL with Off-Policy Data Generation

Poster
Abstract

Recent advances in text-to-SQL have shown that methods such as Group Relative Policy Optimization (GRPO) can substantially improve reasoning performance, but these approaches remain inherently on-policy, limiting their ability to incorporate novel reasoning patterns. In this work, we address this limitation by leveraging existing datasets to generate high-quality off-policy rollouts, enabling mixed-policy training that exposes models to diverse and informative reasoning trajectories. We present the first application of mixed-policy GRPO to the text-to-SQL domain and introduce a systematic study of off-policy data generation strategies for this setting, including a novel method, Iterative Error Correction (IEC), which iteratively refines model outputs through targeted feedback. Our experiments show that mixed-policy GRPO outperforms both base models and on-policy GRPO, yielding average improvements of +4.7% over base models and +4.1% over on-policy GRPO across the Spider and BIRD benchmarks. Gains are particularly strong on BIRD, reaching up to +7.3% over base models and +4.5% over on-policy GRPO.

Asking language models how to represent data for fine-tuning

Poster
Abstract

Language models are often used for tasks involving structured data like tables and graphs, but there is no principled approach for choosing the best format to represent such data for fine-tuning. We address this in three steps. First, we show that format choice remains important even after fine-tuning; models learn more efficiently with specific formats rather than adapting to any format. Second, we show that a pre-trained model can suggest its own candidate formats by auto-completing partial prompts, reducing reliance on developer intuition. Third, and most importantly, we demonstrate that base model performance across formats reliably predicts post-fine-tuning performance: the format that performs best before fine-tuning remains among the top candidates after fine-tuning in 16 out of 18 settings across three data structure types, three models, and six tasks. This finding allows format selection to be done via inference alone, avoiding costly trial-and-error fine-tuning runs.

Routing End User Queries to Enterprise Databases

Poster
Abstract

We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven re-ranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.

TreeDiff: AST-Guided Code Generation with Diffusion LLMs

Poster
Abstract

Code generation is increasingly critical for real-world applications. Still, diffusion-based large language models continue to struggle with this demand. Unlike free-form text, code requires syntactic precision; even minor structural inconsistencies can render a program non-executable. Existing diffusion-based large language models rely on random token masking for corruption, leading to two key failures: they lack awareness of syntactic boundaries during the iterative denoising process, and they fail to capture the long-range hierarchical dependencies essential for program correctness. We propose TreeDiff to address both issues. Specifically, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Tree (AST) into the corruption process. Instead of masking individual tokens at random, we selectively mask tokens belonging to key AST nodes. By aligning the corruption process with the underlying structure of code, our method encourages the model to internalize the compositional nature of programming languages, enabling it to reconstruct programs that respect grammatical boundaries and capture long-range dependencies. Our method achieves a 13.3% relative improvement over the random masking training method, demonstrating its effectiveness in code generation task by leveraging underlying structures.

Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards

Poster
Abstract

Accurate chart comprehension represents a critical challenge in advancing multimodal learning systems, as extensive information is compressed into structured visual representations. However, existing vision-language models (VLMs) frequently struggle to generalize on unseen charts because it requires abstract, symbolic, and quantitative reasoning over structured visual representations. In this work, we introduce Chart-RL, an effective reinforcement learning (RL) method that employs mathematically verifiable rewards to enhance chart question answering in VLMs. Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MultiChartQA, and 11.5% on ChartInsights. We conduct robustness analysis, where Chart-RL achieves enhanced performance in 18 of 25 perturbed chart categories, demonstrating strong consistency and reasoning capability across visual variations. Furthermore, we demonstrate that task difficulty and inherent complexity are more critical than data quantity in RL training. For instance, Chart-RL trained on merely 10 complex chart-query examples significantly outperforms models trained on over 6,000 simple examples. Additionally, training on challenging reasoning tasks not only improves in-domain generalization relative to simpler tasks, but also facilitate strong transfer to out-of-domain visual mathematical problems.

SchemaScope: How Join-Hop Depth Breaks Text-to-SQL in Large Language Models, and a Decomposition-Based Remedy

Poster
Abstract

Large language models (LLMs) achieve impressive accuracy on standard Text-to-SQL benchmarks such as Spider and BIRD, yet enterprise databases, with hundreds of tables and complex foreign key graphs, remain a practical bottleneck. We hypothesize that a single, measurable property drives most of this gap: the join-hop depth ($h$) of the query, defined as the number of foreign key edges that must be traversed to gather all required columns. We introduce the Join-Hop Depth (JHD) benchmark, 410 human-annotated questions stratified by $h \in \{1, \ldots, 6\}$ over 12 enterprise-scale schemas. Experiments on five frontier LLMs confirm a sharp accuracy cliff: all models exceed $80\%$ at $h = 1$ but fall below $40\%$ at $h = 4$ and below $25\%$ at $h = 6$, the typical depth of real enterprise analytics queries. To address this, we propose SchemaScope, a decomposition framework that partitions deep queries into a sequence of sub-queries with $h \leq 2$, executes them independently, and merges the results. SchemaScope raises execution accuracy from $46.8\%$ to $67.3\%$ on JHD (GPT-4o, $h \geq 3$) and improves execution accuracy by $+9.3$ percentage points on the BIRD development set. Error analysis shows that decomposition eliminates \emph{wrong join path} errors, the dominant failure mode at high $h$, and shifts the residual error budget toward condition and aggregation mistakes that are amenable to existing post-processing methods.

Can LLMs Self-Correct Table Reasoning Errors?

Poster
Abstract

Self-correction—the ability of LLMs to detect and fix their own errors—has been studied extensively for mathematical and code reasoning, with limited prior work on table reasoning (primarily multi-agent pipelines such as Table-Critic, ACL 2025, rather than single-model structured prompting). Tables present unique challenges: errors arise from wrong cell retrieval, incorrect computation, flawed logic, and hallucination of values not present in the data. We conduct the first cross-provider single-model self-correction analysis for table reasoning across five providers (Google, Moonshot AI, Zhipu, Alibaba, MiniMax), testing five models (Gemini 3.1 Pro, Kimi K2.5, GLM 5, Qwen 3.5+, MiniMax M2.5) on WikiTableQuestions and TabFact with a multi-seed paired protocol. We propose Structured Self-Correction (SSC), a table-specific verification chain that guides models through cell verification, computation checking, logic validation, and completeness assessment. We confirm that the Accuracy-Correction Paradox (terminology from Li 2025) previously observed in math extends to tables: models with base accuracy in the mid-60s–mid-70s region benefit modestly from self-correction (multi-seed mean SCG up to +1.3% with within-seed point estimates as high as +3.4%), while stronger models above this region are systematically harmed by over-correction (multi-seed mean SCG down to -1.3%, with 95% bootstrap CIs significantly below zero). SSC reduces over-correction rates in 9 of 10 conditions, with reductions of 38–69% on TabFact. An inference-mode-controlled probe shows that SSC's qualitative direction is robust for Qwen 3.5+ across reasoning-ON and reasoning-OFF settings, while GLM 5 exhibits a substantial mode-dependent shift, indicating that mode robustness itself is model-dependent. Stronger baselines (self-consistency, self-critic, tool-augmented arithmetic verification, majority voting, and a same-family scaling probe) further characterize where SSC helps. Ablation studies reveal that answer-aware review is essential, reasoning traces aid error detection, and iterative correction shows diminishing returns. A FinQA domain transfer probe confirms a capability floor: self-correction fails when base task competence is very low (21.5% accuracy). Our primary contribution is empirical: we characterize the conditions under which self-correction helps or harms table reasoning, providing actionable guidance for practitioners.

Generalization in Graph Reasoning: A Systematic Comparison of LLM Training Approaches

Poster
Abstract

For large language models (LLMs), reasoning over graphs can help solve many problems. Prior work has tried to improve LLM graph reasoning through different training methods, but the merits of such approaches remain unclear and the limitations of each approach with respect to generalizability of reasoning are often not thoroughly explored. In this paper we systematically compare the ability of LLMs to learn fundamental graph tasks across a variety of training methods and their ability to generalize out of distribution across various dimensions. We highlight key tradeoffs between training methods, e.g., training specialized graph encoders and fusing their embeddings with LLMs consistently collapses in terms of generalizability; however, no single method shows clear superiority across all dimensions of generalizability, regardless of the size of the model.

Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness on Tax Law

Poster
Abstract

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System

Poster
Abstract

In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategies that exploits our Elo system without improving review effort. These findings show how the Elo system affects peer review and offer insights for improving AI conference evaluation. Our code is available at https://github.com/hsiangwei0903/EloReview.

StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs

Poster
Abstract

Large Language Models (LLMs) are increasingly used to generate structured outputs—JSON objects, SQL queries, and structured records—from formal schemas. While recent advances in constrained decoding and schema-aware prompting have improved syntactic compliance, the semantic reliability of these outputs remains poorly characterized. We investigate this gap through the lens of schema drift—the inevitable evolution of database schemas in production environments through column renamings, type changes, and constraint modifications. We introduce StructHallu-Drift, a benchmark and evaluation framework for studying structured hallucinations under schema evolution. We contribute: (1) a six-category hallucination taxonomy that disentangles syntactic validity from semantic fidelity; (2) a controlled evaluation suite applying realistic schema mutations at three severity levels to established NL-to-structure datasets; and (3) a systematic evaluation of four LLMs spanning 7B to 70B parameters across three structured output tasks. Experiments on 1,200 schema–model evaluation instances reveal four key findings: (i) 39–54% of structured outputs contain at least one semantic hallucination; (ii) schema drift severity has surprisingly minimal effect on hallucination rates (∼44% across all levels, p = 0.59), suggesting imperfect schema conditioning under our prompting setup; (iii) output format is the dominant factor in generation reliability, with SQL achieving ∼85% semantic validity while schema-grounded record generation drops to 7–24%; (iv) each model exhibits a distinct hallucination fingerprint, implying that mitigation strategies must be model-specific rather than universal. We publicly release our benchmark and evaluation toolkit.

TabGuard: Agentic LLM Orchestration for Adaptive Tabular Anomaly Detection via Dynamic Validator Selection and Generation

Poster
Abstract

Tabular anomaly detection is challenging because real-world tables contain heterogeneous columns, ranging from structured identifiers to free-form text. Existing methods face a fundamental trilemma: rule-based systems require extensive manual configuration and fail on novel schemas; statistical methods scale efficiently but miss semantic errors; and LLM-based approaches understand semantics but incur prohibitive per-cell inference costs. No prior method simultaneously addresses semantic heterogeneity, domain-specific validation rules, and enterprise-scale processing. We introduce TabGuard, an agentic framework that resolves this trilemma through semantic routing. Using LLM function calling, the system analyzes a small sample of each column and dynamically selects the most effective validation strategy, routing to a regex-based validator for syntactic patterns, a code-generation validator for domain-specific rules (such as Luhn checksums for credit cards), or an embedding-based validator for distributional outliers. This architecture decouples expensive cognitive reasoning ($O(m)$ LLM calls for $m$ columns) from scalable programmatic execution, enabling deployment on enterprise datasets without per-cell inference.

UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

Poster
Abstract

Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases. Our code is available at: https://github.com/coral-lab-asu/unjoin

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs

Poster
Abstract

Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. We further find that no single table format consistently yields superior performance. However, evaluating models across multiple formats is essential for a reliable assessment of their capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that reasoning over table tasks remains a significant challenge. The leaderboard, data and code are publicly available.

Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model

Poster
Abstract

Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness relies on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Models (LLMs) and Data-QuestEval. Our dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without reliance on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.

Non-Archival

Accepted presentation-only papers that will be presented at the workshop without archival publication.

17 papers

When Verification Fails: Composed Inference Breaks Structured Reasoning

Poster
Abstract

Existing verification benchmarks largely consist of cases where falsification can be achieved through direct evidence retrieval. In this regime, models perform well and often appear over-rejective, reflecting a strategy that approximates closed-world verification under limited reasoning ability. We expose what these benchmarks conceal: when falsification instead requires composed reasoning over structural relationships rather than direct retrieval, this behavior degrades and models exhibit increased acceptance of infeasible claims, revealing an asymmetric verification routing that existing evaluation cannot detect. We construct adversarial hard negatives across three modalities (clinical text, scientific tables, and charts) that preserve all local observations while embedding violations only at the level of composed inference. Across multiple model families, performance degrades consistently on these examples despite saturation on existing benchmarks, and the gap is not resolved by scaling. Through structured prompting interventions, we show models can be shifted along a ROC curve from Open-World toward Closed-World behavior without weight changes, demonstrating that the failure reflects a behavioral default rather than a capability deficit, and can be corrected through test-time routing of the verification mode.

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Poster
Abstract

We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

Poster
Abstract

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce \emph{functional equivalence methods}, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

Theory of Space: Benchmarking Structured Spatial Belief Construction and Revision in Foundation Models

Poster
Abstract

Spatial embodied intelligence under partial observability requires agents to actively acquire missing information rather than passively consume complete observations. While multimodal foundation models excel at passive perception and reasoning, their ability to support self-directed exploration for building and maintaining coherent spatial beliefs remains understudied. We propose Theory of Space, defined as an agent’s ability to construct, revise, and exploit a spatial belief through active exploration under partial observability. We implement Theory of Space as a benchmark in textual and visual environments, where the goal is curiosity-driven exploration to build a complete and accurate spatial belief. A key innovation is spatial belief probing, which prompts agents to externalize their internal spatial belief as a cognitive map at each step, enabling direct measurement of belief quality. Evaluating state-of-the-art models on downstream tasks reveals three bottlenecks: (1) the \textbf{Active-Passive Gap}, where performance drops when agents must autonomously gather information (e.g., \textsc{GPT-5.2}: $57.1{\to}46.0$); (2) \textbf{Inefficiency}, with redundant and unsystematic exploration; and (3) unstable global beliefs, where spatial knowledge degrades over time. A false-belief paradigm further reveals \textbf{Belief Inertia}, especially severe in vision-based models.

Synthetic Contrastive Reasoning for Multi-Table Q&A

Poster
Abstract

The ability to understand and reason over tabular data is critical for the enablement of Large Language Models (LLMs) in enterprise settings, where domain-specific information is often stored in relational databases. While LLMs have made progress in understanding tabular data in recent years, they still struggle to retrieve the right evidence, reason across multiple tables, and find correct answers. A promising direction for improving LLMs in this setting is training models on multi-table question-answer (Q&A) pairs with associated reasoning traces. However, there is a lack of existing resources with reasoning traces annotated for multi-table Q&A. In this work, we address this gap by generating synthetic reasoning traces for multi-table Q&A and show that they lead to better performance than using Q&A pairs alone. Through analyses and ablations, we show that performance can be further improved by adding contrastive negative traces and training with contrastive preference optimization.

Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Poster
Abstract

Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.

Do Structured Data Comprehension Skills Transfer Across Representation Types? A Systematic Study with Frontier LLMs

Poster
Abstract

Large language models (LLMs) are increasingly evaluated on structured data tasks—table question answering, chart comprehension, graph reasoning, and time series analysis—yet these benchmarks operate in isolation. We ask: does competence in one structured data format predict competence in another when the underlying data and questions are identical? We construct a controlled benchmark of 1,724 programmatically generated questions over 250 sub-tables drawn from 10 datasets spanning 7 domains. Each sub-table is rendered in five formats: Markdown table, chart image, chart text-description, entity-relationship graph, and unlabeled time series. We evaluate six frontier March 2026 LLMs and measure cross-format transfer via tetrachoric correlation on binary (correct/incorrect) outcomes. Our key findings: (1) structured data comprehension skills transfer strongly across text-based formats, with mean tetrachoric r=0.84 between table, graph, and time series representations (r=0.85 overall; r=0.84 for non-ceiling models, confirming the result is not an artifact of high accuracy); (2) chart text-descriptions show consistently lower transfer (r=0.65) despite being length-matched to tables; (3) chart image comprehension is largely siloed, with a 20-50% accuracy gap compared to equivalent text descriptions; and (4) enabling chain-of-thought reasoning improves accuracy by 24-38% for DeepSeek V3.2 but only 0-0.5% for Gemini 3.1 Pro, suggesting reasoning benefits may vary substantially across models. On hard questions (multi-hop and conditional aggregation), transfer remains substantial at r=0.78.

Graph-Regularized Agentic Context Evolution

Poster
Abstract

Deployed LLM agents rely on textual policies that must evolve from operational experience, yet over long horizons flat-text representations make verification increasingly difficult to sustain as policies grow. We propose Graph-Regularized Agentic Context Evolution (GRACE), which represents policy content as a typed semantic graph and performs scoped structural validation in the local neighborhood of modified nodes, rather than through document-wide consistency analysis. We evaluate GRACE within a fixed operational harness on the telecom domain of $\tau^2$-bench under a controlled distribution-shift protocol. GRACE improves strict reliability (pass^3) from 0.227 to 0.833 and is the only method that sustains improvement over the full trajectory. The three baselines each exhibit a different failure mode, revealing that contradiction avoidance alone is insufficient. Our results suggest that sustained policy evolution may require both a structural substrate for scoped verification and an active mechanism that consolidates accumulated knowledge.

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Poster
Abstract

Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-$N$ sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs. In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored. We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation. We integrate ORMs into a verification-driven Best-of-$N$ pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families. ORM-based selection consistently outperforms execution-based Best-of-$N$ and Majority Voting, with gains of up to +4.33% on BIRD and +2.10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries. Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code, datasets, and models are publicly available.

SParK-Eval: Evaluating Structure-Aware Knowledge Acquisition in LLMs for Domain Adaptation to Industrial Records

Poster
Abstract

Large Language Models (LLMs) often underperform in domain adaptation for industrial settings, where available corpora are limited and structurally diverse. These corpora frequently include non-natural formats such as tables, entity lists, or bullet-point instructions that hinder effective learning. To understand and improve domain adaptive pretraining on such data, we introduce SParK-Eval (Structure-aware Parametric Knowledge Evaluation), a framework that constructs question–answer pairs from pretraining data and annotates each with its input structure (e.g., natural sentence, table, list). This enables fine-grained analysis of how input structure affects parametric knowledge acquisition during DAPT. Additionally, we propose a prompt-based input normalization method that converts diverse inputs into coherent natural sentences, providing a reference for isolating structural effects. Our experiments show that LLMs acquire substantially more knowledge from natural sentences than from their structurally non-standard counterparts. These findings underscore the importance of structure-aware evaluation in diagnosing learning challenges and guiding effective domain adaptation strategies.

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Poster
Abstract

We introduce SciTaRC, an expert-authored benchmark for question answering over scientific tables that targets composite, multi-step reasoning. To enable fine-grained diagnostic analysis beyond end-task accuracy, SciTaRC pairs each question with a manually constructed reasoning plan and explicit complexity metrics. State-of-the-art models fail on at least 23\% of these questions, while highly capable open-weight models like Llama-3.3-70B collapse on 65.5\% of the benchmark. Error analysis shows that, in zero-shot settings, failures are driven primarily by question comprehension, where models misinterpret the scientific query and derive the wrong reasoning objective. To determine whether overcoming this gap is sufficient, we use structured plans to decouple strategy formulation from execution. Surprisingly, providing step-by-step oracle plans yields only limited gains and fails to eliminate the performance gap. This reveals a substantial execution bottleneck: both natural language and code-based methods struggle to reliably carry out long-horizon computational chains over structured data. Ultimately, SciTaRC serves as a rigorous diagnostic testbed for studying both planning and execution in scientific table reasoning.

DSL-R1: From SQL to DSL for Training Retrieval Agents across Structured and Unstructured Data with Reinforcement Learning

Poster
Abstract

Effective retrieval in complex domains requires bridging the gap between structured metadata and unstructured content. Existing systems typically isolate these capabilities, relying on either symbolic filtering or vector similarity, failing to capture their interplay. In this work, we propose DSL-R1, a unified framework that synergizes logical reasoning with semantic matching via a novel Domain-Specific Language (DSL). By embedding vector primitives within SQL-style operators, our approach leverages the complementary strengths of symbolic precision and semantic coverage. We further introduce a reinforcement learning mechanism where rule-based execution feedback and retrieval quality rewards jointly optimize the DSL generation, balancing structural correctness and semantic alignment. Evaluations on a large-scale industrial email benchmark demonstrate that DSL-R1 achieves a +12.3% improvement in Hit@1/3, consistently outperforming decoupled baselines and establishing a robust paradigm for hybrid retrieval.

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Poster
Abstract

We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 16.8 minutes per workflow yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning

Poster
Abstract

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graph-structured knowledge poses two key challenges: (1) navigating structured, schema-defined relations requires precise function calls rather than similarity-based retrieval, and (2) answering complex questions often demands multi-hop evidence aggregation through iterative information seeking. We propose GraphDancer, a reinforcement learning (RL) framework that teaches LLMs to navigate graphs by interleaving reasoning and function execution. To make RL effective for moderate-sized LLMs, we introduce a graph-aware curriculum that schedules training by the structural complexity of information-seeking trajectories using an easy-to-hard biased sampler. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with either a 14B backbone or GPT-4o-mini, demonstrating robust cross-domain generalization of graph exploration and reasoning skills.

SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction

Poster
Abstract

The rapid growth of scientific literature calls for automated methods to assess and predict research impact. Prior work has largely focused on citation-based metrics, leaving limited evaluation of models’ capability to reason about other impact dimensions. To this end, we introduce SciImpact, a large-scale, multi-dimensional benchmark for scientific impact prediction spanning 19 fields. SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling. It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short- (e.g., Best Paper Award) and long-term settings (e.g., Nobel Prize). We evaluate 11 widely used large language models (LLMs) on SciImpact. Results show that off-the-shelf models show substantial variability across dimensions and fields, while multi-task supervised fine-tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed-source LLMs (e.g., o4-mini). These results establish SciImpact as a challenging benchmark and demonstrate its value for multi-dimensional, multi-field scientific impact prediction.

Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

Poster
Abstract

In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.

Spatial Reasoning via Modality Switching Between Language and Symbolic Visualization

Poster
Abstract

Human reasoning is inherently multimodal: when problems become difficult, we rarely think in words alone. We often externalize our reasoning by sketching diagrams or drawing grids to better understand the underlying conceptual structure and avoid mistakes. Building upon this, our research investigates a) whether grounding multi-hop textual–spatial stories into other modalities that capture the geometry of the scene, like visualization or grids, improves reasoning abilities compared to natural language text-based inference; and b) whether the model is able to decide effectively when it should rely on natural language text-based reasoning and when it should switch between modalities. To assess this, we introduce a switching metric based on trustworthiness and complexity signals, which estimates when grounding a spatial story into structure is likely to improve performance, taking a first step towards principled modality selection in large language models' reasoning.