arXiv:2603.23516v1 Announce Type: cross Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, […]
From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians’ Medical Expertise with Lightweight LLM
arXiv:2603.23520v1 Announce Type: cross Abstract: Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians’ knowledge systems are slow to develop and hard to […]
Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language
arXiv:2603.23529v1 Announce Type: cross Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. […]
Large Language Models and Scientific Discourse: Where’s the Intelligence?
arXiv:2603.23543v1 Announce Type: cross Abstract: We explore the capabilities of Large Language Models (LLMs) by comparing the way they gather data with the way humans build knowledge. Here we examine how scientific knowledge is made and compare it with LLMs. The argument is structured by reference to two figures, one representing scientific knowledge and the […]
Emergence of unique hues from sparse coding of color in natural scenes
arXiv:2603.24293v1 Announce Type: new Abstract: Our subjective experience of color is typically described by abstract properties such as hue, saturation, and brightness that do not directly correspond to sensory signals arising from cones in the retina. Along the hue dimension, certain colors — red, green, blue, and yellow — appear unique in that they are […]
From Liar Paradox to Incongruent Sets: A Normal Form for Self-Reference
arXiv:2603.24527v1 Announce Type: new Abstract: We introduce incongruent normal form (INF), a structural representation for self-referential semantic sentences. An INF replaces a self-referential sentence with a finite family of non-self-referential sentences that are individually satisfiable but not jointly satisfiable. This transformation isolates the semantic obstruction created by self-reference while preserving classical semantics locally and is […]
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
arXiv:2603.24582v1 Announce Type: new Abstract: Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally […]
Mitigating Many-Shot Jailbreaking
arXiv:2504.09604v3 Announce Type: cross Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a “fake” assistant responding inappropriately before the final request. With enough examples, the model’s in-context learning abilities override its safety training, […]
Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking
arXiv:2603.23506v1 Announce Type: cross Abstract: The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing […]
Internal Safety Collapse in Frontier Large Language Models
arXiv:2603.23509v1 Announce Type: cross Abstract: This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers […]
DISCO: Document Intelligence Suite for COmparative Evaluation
arXiv:2603.23511v1 Announce Type: cross Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce textbfDISCO, a emphDocument Intelligence Suite for COmparative Evaluation, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, […]
Berta: an open-source, modular tool for AI-enabled clinical documentation
arXiv:2603.23513v1 Announce Type: cross Abstract: Commercial AI scribes cost $99-600 per physician per month, operate as opaque systems, and do not return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows. We developed Berta, an open-source modular scribe platform for AI-enabled clinical documentation, and deployed a customized implementation within […]