arXiv:2605.19999v1 Announce Type: cross
Abstract: Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., $textitcontaminated$, which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be $textitcontamination-resistant$, i.e., $textitunlearnable$, but support $textitinference$. To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) developing supporting methods and platforms, and (iii) adopting contamination-resistant benchmarks into existing evaluation pipelines.
Explainable AI in kidney stone detection and segmentation: a mini review
Kidney stones are one of the most common renal disorders that can produce severe complications if not diagnosed and treated early. Recently, advances in AI