arXiv:2605.18932v2 Announce Type: replace-cross Abstract: In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representations with a large language model (LLM). The model is trained via supervised fine-tuning to generate a hypergraph-based textual representation that encodes spatial relationships and connectivity information within floor plans. We […]
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
arXiv:2605.02010v2 Announce Type: replace Abstract: This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value — yet AI learns from it […]
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
arXiv:2605.25252v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, verifiers are rarely perfect. Recent theoretical work predicts that verifier noise affects the rate of learning but not its final outcome, implying that sufficient compute should close any gap induced by imperfect […]
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
arXiv:2406.09250v4 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content […]
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
arXiv:2605.18840v2 Announce Type: replace-cross Abstract: Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases — and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses […]
EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization
arXiv:2508.12479v2 Announce Type: replace-cross Abstract: Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc. For these problems, gradient-based methods are well understood and enjoy strong guarantees. However, in the absence of convexity or concavity, existing approaches study convergence to an approximate saddle point or first-order stationary points, which may be […]
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
arXiv:2605.25240v1 Announce Type: cross Abstract: Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores […]
Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches
arXiv:2512.12677v3 Announce Type: replace-cross Abstract: We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pretrained causal LLM and fine-tuning it on the task, using the LLM’s final-token embedding as a sequence representation, and (2) instruction-tuning […]
Simply Stabilizing the Loop via Fully Looped Transformer
arXiv:2605.18797v2 Announce Type: replace-cross Abstract: Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural […]
Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems
arXiv:2602.03695v2 Announce Type: replace-cross Abstract: While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them […]
Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies
arXiv:2605.25235v1 Announce Type: cross Abstract: We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via LP-relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility model (implemented as a CSP feasibility-decision model), and (iii) bounds the size of a PAC-sufficient explanation with a Bonferroni-corrected Hoeffding sufficient-subset test […]
UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates
arXiv:2603.29897v2 Announce Type: replace-cross Abstract: Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image […]