LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

arXiv:2601.16504v3 Announce Type: replace-cross Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that reframes commonsense reasoning as logical composition over pairs of atomic statements […]

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models

arXiv:2601.15690v2 Announce Type: replace Abstract: While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how […]

Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values

arXiv:2506.00079v2 Announce Type: replace-cross Abstract: The rapid integration of Large Language Models (LLMs) in high-stakes decision-making — such as allocating scarce resources like donor organs — raises critical questions about their alignment with human moral values. We systematically evaluate the behavior of several prominent LLMs against human preferences in kidney allocation scenarios and show that […]

AgileLog: A Forkable Shared Log for Agents on Data Streams

arXiv:2604.14590v2 Announce Type: replace-cross Abstract: In modern data-streaming systems, alongside traditional programs, a new type of entity has emerged that can interact with streaming data: AI agents. Unlike traditional programs, AI agents use LLM reasoning to accomplish high-level tasks specified in natural language over streaming data. Unfortunately, current streaming systems cannot fully support agents: they […]

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

arXiv:2604.18510v1 Announce Type: cross Abstract: Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), […]

STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing

arXiv:2604.14927v2 Announce Type: replace-cross Abstract: Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face […]

Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals

arXiv:2604.17714v1 Announce Type: cross Abstract: LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The […]

Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

arXiv:2604.17716v1 Announce Type: cross Abstract: The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = […]

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

arXiv:2604.15271v2 Announce Type: replace-cross Abstract: Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $textbfSegWithU$, a post-hoc framework that augments a […]

GeGS-PCR: Effective and Robust 3D Point Cloud Registration with Two-Stage Color-Enhanced Geometric-3DGS Fusion

arXiv:2604.17721v1 Announce Type: cross Abstract: We address the challenge of point cloud registration using color information, where traditional methods relying solely on geometric features often struggle in low-overlap and incomplete scenarios. To overcome these limitations, we propose GeGS-PCR, a novel two-stage method that combines geometric, color, and Gaussian information for robust registration. Our approach incorporates […]

Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval

arXiv:2604.17906v1 Announce Type: cross Abstract: While Large Language Models (LLMs) exhibit exceptional zero-shot relevance modeling, their high computational cost necessitates framing passage retrieval as a budget-constrained global optimization problem. Existing approaches passively rely on first-stage dense retrievers, which leads to two limitations: (1) failing to retrieve relevant passages in semantically distinct clusters, and (2) failing […]

Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report

arXiv:2604.17707v1 Announce Type: cross Abstract: Clinical personality assessment screens response validity before interpreting substantive scales. LLM evaluation does not. We apply the validity scaling framework from the PAI and MMPI-3 to metacognitive probe data from 20 frontier models across 524 items. Six validity indices are operationalised: L (maintaining confidence on errors), K (betting on errors), […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844