December 10, 2025 – Page 9 – DIJEE Pharma Intelligence

Balanced Accuracy: The Right Metric for Evaluating LLM Judges – Explained through Youden’s J statistic

arXiv:2512.08121v1 Announce Type: cross Abstract: Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. […]

December 10, 2025

Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture

arXiv:2512.08130v1 Announce Type: cross Abstract: Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk […]

December 10, 2025

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

arXiv:2512.08153v1 Announce Type: cross Abstract: Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise […]

December 10, 2025

Information-Dense Reasoning for Efficient and Auditable Security Alert Triage

arXiv:2512.08169v1 Announce Type: cross Abstract: Security Operations Centers face massive, heterogeneous alert streams under minute-level service windows, creating the Alert Triage Latency Paradox: verbose reasoning chains ensure accuracy and compliance but incur prohibitive latency and token costs, while minimal chains sacrifice transparency and auditability. Existing solutions fail: signature systems are brittle, anomaly methods lack actionability, […]

December 10, 2025

Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

arXiv:2512.08188v1 Announce Type: cross Abstract: World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in […]

December 10, 2025

PR-CapsNet: Pseudo-Riemannian Capsule Network with Adaptive Curvature Routing for Graph Learning

arXiv:2512.08218v1 Announce Type: cross Abstract: Capsule Networks (CapsNets) show exceptional graph representation capacity via dynamic routing and vectorized hierarchical representations, but they model the complex geometries of real-world graphs poorly by fixed-curvature space due to the inherent geodesical disconnectedness issues, leading to suboptimal performance. Recent works find that non-Euclidean pseudo-Riemannian manifolds provide specific inductive biases […]

December 10, 2025

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

arXiv:2512.08228v1 Announce Type: cross Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to […]

December 10, 2025

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

arXiv:2512.08240v1 Announce Type: cross Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce […]

December 10, 2025

Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

arXiv:2511.12779v2 Announce Type: replace-cross Abstract: We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in […]

December 10, 2025

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

arXiv:2512.08894v1 Announce Type: cross Abstract: While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a […]

December 10, 2025

Subscribe for Updates