arXiv:2605.10980v1 Announce Type: cross Abstract: Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve […]
Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
arXiv:2605.10916v2 Announce Type: replace-cross Abstract: Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In […]
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design
arXiv:2605.10978v1 Announce Type: new Abstract: Protein design aims to compose amino-acid sequences that fold into stable three-dimensional structures while satisfying targeted functional properties. The field is increasingly shifting toward vibe protein design, where a single model is expected to generate novel sequences, engineer existing proteins, and reason about protein characteristics through flexible natural-language constraints. Large […]
MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
arXiv:2605.06940v1 Announce Type: cross Abstract: Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated […]
Dispatch-Aware Ragged Attention for Pruned Vision Transformers
arXiv:2604.15408v2 Announce Type: replace-cross Abstract: Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet standard variable-length attention APIs — including FlashAttention-2’s varlen and PyTorch’s NestedTensor SDPA — fail to translate these savings into proportional wall-clock gains at the short post-pruning sequence lengths typical of ViTs ($leq$197 […]
How Useful Is Cross-Domain Generalization for Training LLM Monitors?
arXiv:2605.12265v1 Announce Type: new Abstract: Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We […]
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
arXiv:2605.12061v1 Announce Type: new Abstract: Long-term memory is becoming a central bottleneck for language agents. Exsting RAG and GraphRAG systems largely treat memory graphs as static retrieval middleware, which limits their ability to recover complete evidence chains from partial cues, exploit reusable graph-structrual roles, and improve the memory itself through downstream feedback. We introduce SAGE, […]
MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling
arXiv:2605.12154v1 Announce Type: new Abstract: Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance […]
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
arXiv:2605.11882v1 Announce Type: new Abstract: Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving […]
Counterfactual Trace Auditing of LLM Agent Skills
arXiv:2605.11946v1 Announce Type: new Abstract: Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework […]
Continuous Flood Nowcasting in South Asia: A Multi-Sensor Ensemble Remote Sensing Framework for Flood Extent
arXiv:2605.10950v1 Announce Type: cross Abstract: Pakistan experienced an unusually severe flood season between June and December 2025, with cascading impacts on population, infrastructure, and agriculture. Existing operational flood products (e.g., UNOSAT) provide valuable episode-level snapshots but rarely deliver spatially and temporally continuous inundation maps at near-real-time latency within the country. We present a multi-sensor, ensemble-based […]
OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling
arXiv:2605.11738v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as emphoptimization-modeling hallucination […]