arXiv:2604.18177v2 Announce Type: replace-cross Abstract: Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark […]
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
arXiv:2604.19533v1 Announce Type: cross Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps […]
Care Trajectories Are Linked to Mental Health and Mortality in Cancer Patients
arXiv:2604.18431v2 Announce Type: replace-cross Abstract: Treatment of cancer involves heterogeneous, complex care pathways. The relationship between these longitudinal trajectories, baseline mental health, and prognostic outcomes remains poorly understood. We introduce an interpretable time-analysis framework leveraging these temporal dynamics, analyzing care patterns spanning up to 37 years for >8,000 patients. Using Dynamic Time Warping (DTW) and […]
Right for the Wrong Reasons: Epistemic Regret Minimization for LLM Causal Reasoning
arXiv:2602.11675v3 Announce Type: replace Abstract: Large language models may answer causal questions correctly for the wrong reasons, substituting associational shortcuts P(Y|X) for the interventional query P(Y|do(X)). Current RL methods reward what the model answers but not why, reinforcing these shortcuts until distribution shift exposes them. We introduce Epistemic Regret Minimization (ERM), a framework that identifies […]
LEPO: Latent Reasoning Policy Optimization for Large Language Models
arXiv:2604.17892v2 Announce Type: replace-cross Abstract: Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring […]
Album: executable building blocks for scientific imaging routines, from sharing to LLM-assisted orchestration
arXiv:2110.00601v2 Announce Type: replace-cross Abstract: Open-source scientific software is a major driver of scientific progress, yet its development and reuse remain difficult in collaborative settings. Researchers repeatedly face four recurring challenges: discovering and reproducing existing routines, adapting them for new use cases, sharing and scaling them across collaborators, and stabilizing them with reproducible execution environments. […]
When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift
arXiv:2604.19514v1 Announce Type: cross Abstract: The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random […]
Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters
arXiv:2509.18831v2 Announce Type: replace-cross Abstract: Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or […]
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
arXiv:2604.17789v2 Announce Type: replace-cross Abstract: The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier […]
Investigating the structure of emotions by analyzing similarity and association of emotion words
arXiv:2602.06430v2 Announce Type: replace-cross Abstract: In the field of natural language processing, some studies have attempted sentiment analysis on text by handling emotions as explanatory or response variables. One of the most popular emotion models used in this context is the wheel of emotion proposed by Plutchik. This model schematizes human emotions in a circular […]
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
arXiv:2604.19485v1 Announce Type: cross Abstract: Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. […]
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
arXiv:2603.01455v3 Announce Type: replace-cross Abstract: While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual […]