arXiv:2601.17982v1 Announce Type: cross Abstract: Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) […]
TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow
arXiv:2601.17332v1 Announce Type: new Abstract: The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce textbfTheoremForge, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are textitstatement formalization, textitproof generation, textitpremise selection, textitproof correction and […]
LungCRCT: Causal Representation based Lung CT Processing for Lung Cancer Treatment
arXiv:2601.18118v1 Announce Type: cross Abstract: Due to silence in early stages, lung cancer has been one of the most leading causes of mortality in cancer patients world-wide. Moreover, major symptoms of lung cancer are hard to differentiate with other respiratory disease symptoms such as COPD, further leading patients to overlook cancer progression in early stages. […]
BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation
arXiv:2601.18253v1 Announce Type: cross Abstract: Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike […]
Is More Context Always Better? Examining LLM Reasoning Capability for Time Interval Prediction
arXiv:2601.10132v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and […]
Auditing Disability Representation in Vision-Language Models
arXiv:2601.17348v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed in socially sensitive applications, yet their behavior with respect to disability remains underexplored. We study disability aware descriptions for person centric images, where models often transition from evidence grounded factual description to interpretation shift including introduction of unsupported inferences beyond observable visual evidence. To […]
Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models
arXiv:2504.04372v3 Announce Type: replace-cross Abstract: Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL depends on a models ability to reason about program semantics beyond surface-level syntactic and lexical features. However, widely used LLM benchmarks primarily evaluate code generation, which differs fundamentally from […]
A Syllogistic Probe: Tracing the Evolution of Logic Reasoning in Large Language Models
arXiv:2601.17426v1 Announce Type: new Abstract: Human logic has gradually shifted from intuition-driven inference to rigorous formal systems. Motivated by recent advances in large language models (LLMs), we explore whether LLMs exhibit a similar evolution in the underlying logical framework. Using existential import as a probe, we for evaluate syllogism under traditional and modern logic. Through […]
A Markov Categorical Framework for Language Modeling
arXiv:2507.19247v4 Announce Type: replace-cross Abstract: Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This […]
$beta$-diversity and Graph Sheaf Laplacians
arXiv:2601.17466v1 Announce Type: new Abstract: We suggest a new approach to $beta$-diversity in ecological systems, based on the energy of the graph sheaf Laplacian associated with the sample data. This scalar quantity is easily computable using methods of linear algebra. We show using simple examples that the energy is much more informative than the generally […]
DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning
arXiv:2510.06913v2 Announce Type: replace-cross Abstract: Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi-agent settings. We identify a key […]
Lattice: Generative Guardrails for Conversational Agents
arXiv:2601.17481v1 Announce Type: new Abstract: Conversational AI systems require guardrails to prevent harmful outputs, yet existing approaches use static rules that cannot adapt to new threats or deployment contexts. We introduce Lattice, a framework for self-constructing and continuously improving guardrails. Lattice operates in two stages: construction builds initial guardrails from labeled examples through iterative simulation […]