Adaptation to free-living drives loss of beneficial endosymbiosis through metabolic trade-offs

Symbioses are widespread (1) and underpin the function of diverse ecosystems (2-6), but their evolutionary stability is challenging to explain (7,8). Fitness trade-offs between con-trasting

Gradient-specified optimization based on muscle surface mesh and moment arm as an effect-oriented approach of automated musculotendon path modeling

There is more to musculotendon path modeling than aligning a cable to reflect the geometric features of a muscle-tendon unit. From the perspective of simulation

TREM2 deficiency causes region-specific brain effects in a mouse model of cerebral amyloid angiopathy

Cerebral amyloid angiopathy (CAA), a major vascular contributor to cognitive decline, is present in 85-95% of Alzheimer disease (AD) patients. Despite its high prevalence, the

Frontal Brain Injury Reduces Sensitivity to Reward-Predictive Cues and Remodels the Nucleus Accumbens

Traumatic brain injuries (TBIs) are more than mere lesions and generate a persistent secondary pathology. This, combined with functional reorganization of circuits post-injury, may explain

Highly replicable multisite patterns of adolescent white matter maturation

The Adolescent Brain Cognitive Development (ABCD) Study is the largest U.S.-based neuroimaging initiative of adolescent brain maturation. Diffusion MRI (dMRI) provides unique insights into white

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

April 14, 2026

arXiv:2604.09606v1 Announce Type: new
Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment often exposes a different class of risk: operational failures arising from repeated generations of the same prompt rather than broad task generalization. In high-stakes settings, response consistency and safety under repeated use are critical operational requirements. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST probes LLM behavior by repeatedly sampling identical prompts under controlled operational conditions, including temperature variation and prompt perturbation, to surface latent failure modes such as hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST characterizes them statistically as stochastic outcomes of repeated inference. We model observed safety failures using Bernoulli and binomial formulations to estimate per-inference failure probabilities, enabling quantitative comparison of operational risk across models and configurations. We apply APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024 derived safety and security prompts. While models exhibit similar performance under conventional single- or very-low-sample evaluation (N <= 3), repeated sampling reveals substantial variation in empirical failure probabilities across temperatures. These results demonstrate that shallow benchmark scores can obscure meaningful differences in reliability under sustained use.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844