arXiv:2601.08258v2 Announce Type: replace Abstract: We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl’s Ladder of Causality. Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases. By applying T3 to frontier models, we […]
Training Tensor Attention Efficiently: From Cubic to Almost Linear Time
arXiv:2405.16411v3 Announce Type: replace-cross Abstract: Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $O(n^3)$ time complexity of tensor attention poses a significant obstacle to its utilization in transformers, where $n$ is the input sequence length. In this […]
MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings
arXiv:2503.03008v3 Announce Type: replace-cross Abstract: Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This […]
SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
arXiv:2505.20347v2 Announce Type: replace-cross Abstract: Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose […]
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
arXiv:2506.18951v4 Announce Type: replace-cross Abstract: Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging […]
Forecasting Energy Consumption using Recurrent Neural Networks: A Comparative Analysis
arXiv:2601.17110v1 Announce Type: cross Abstract: Accurate short-term energy consumption forecasting is essential for efficient power grid management, resource allocation, and market stability. Traditional time-series models often fail to capture the complex, non-linear dependencies and external factors affecting energy demand. In this study, we propose a forecasting approach based on Recurrent Neural Networks (RNNs) and their […]
PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction
arXiv:2601.17074v1 Announce Type: cross Abstract: The accurate estimation of Arctic snow depth ($h_s$) remains a critical time-varying inverse problem due to the extreme scarcity and noise inherent in associated sea ice parameters. Existing process-based and data-driven models are either highly sensitive to sparse data or lack the physical interpretability required for climate-critical applications. To address […]
Beyond Pairwise Comparisons: A Distributional Test of Distinctiveness for Machine-Generated Works in Intellectual Property Law
arXiv:2601.18156v1 Announce Type: cross Abstract: Key doctrines, including novelty (patent), originality (copyright), and distinctiveness (trademark), turn on a shared empirical question: whether a body of work is meaningfully distinct from a relevant reference class. Yet analyses typically operationalize this set-level inquiry using item-level evidence: pairwise comparisons among exemplars. That unit-of-analysis mismatch may be manageable for […]
PC-MCL: Patient-Consistent Multi-Cycle Learning with multi-label bias correction for respiratory sound classification
arXiv:2601.17080v1 Announce Type: cross Abstract: Automated respiratory sound classification supports the diagnosis of pulmonary diseases. However, many deep models still rely on cycle-level analysis and suffer from patient-specific overfitting. We propose PC-MCL (Patient-Consistent Multi-Cycle Learning) to address these limitations by utilizing three key components: multi-cycle concatenation, a 3-label formulation, and a patient-matching auxiliary task. Our […]
Emergent Cooperation in Quantum Multi-Agent Reinforcement Learning Using Communication
arXiv:2601.18419v1 Announce Type: cross Abstract: Emergent cooperation in classical Multi-Agent Reinforcement Learning has gained significant attention, particularly in the context of Sequential Social Dilemmas (SSDs). While classical reinforcement learning approaches have demonstrated capability for emergent cooperation, research on extending these methods to Quantum Multi-Agent Reinforcement Learning remains limited, particularly through communication. In this paper, we […]
Linguistic and Argument Diversity in Synthetic Data for Function-Calling Agents
arXiv:2601.17829v1 Announce Type: cross Abstract: The construction of function calling agents has emerged as a promising avenue for extending model capabilities. A major challenge for this task is obtaining high quality diverse data for training. Prior work emphasizes diversity in functions, invocation patterns, and interaction turns, yet linguistic diversity of requests and coverage of arguments […]
Coding-Enforced Resilient and Secure Aggregation for Hierarchical Federated Learning
arXiv:2601.17995v1 Announce Type: cross Abstract: Hierarchical federated learning (HFL) has emerged as an effective paradigm to enhance link quality between clients and the server. However, ensuring model accuracy while preserving privacy under unreliable communication remains a key challenge in HFL, as the coordination among privacy noise can be randomly disrupted. To address this limitation, we […]