arXiv:2512.08121v1 Announce Type: cross Abstract: Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. […]
Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture
arXiv:2512.08130v1 Announce Type: cross Abstract: Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk […]
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
arXiv:2512.08153v1 Announce Type: cross Abstract: Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise […]
Information-Dense Reasoning for Efficient and Auditable Security Alert Triage
arXiv:2512.08169v1 Announce Type: cross Abstract: Security Operations Centers face massive, heterogeneous alert streams under minute-level service windows, creating the Alert Triage Latency Paradox: verbose reasoning chains ensure accuracy and compliance but incur prohibitive latency and token costs, while minimal chains sacrifice transparency and auditability. Existing solutions fail: signature systems are brittle, anomaly methods lack actionability, […]
Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model
arXiv:2512.08188v1 Announce Type: cross Abstract: World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in […]
PR-CapsNet: Pseudo-Riemannian Capsule Network with Adaptive Curvature Routing for Graph Learning
arXiv:2512.08218v1 Announce Type: cross Abstract: Capsule Networks (CapsNets) show exceptional graph representation capacity via dynamic routing and vectorized hierarchical representations, but they model the complex geometries of real-world graphs poorly by fixed-curvature space due to the inherent geodesical disconnectedness issues, leading to suboptimal performance. Recent works find that non-Euclidean pseudo-Riemannian manifolds provide specific inductive biases […]
MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
arXiv:2512.08228v1 Announce Type: cross Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to […]
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
arXiv:2512.08240v1 Announce Type: cross Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce […]
Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation
arXiv:2511.12779v2 Announce Type: replace-cross Abstract: We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in […]
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
arXiv:2512.08894v1 Announce Type: cross Abstract: While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a […]