OpenAI is throwing everything into building a fully automated researcher

OpenAI is refocusing its research efforts and throwing its resources into a new grand challenge. The San Francisco firm has set its sights on building

Mind-altering substances are (still) falling short in clinical trials

This week I want to look at where we are with psychedelics, the mind-altering substances that have somehow made the leap from counterculture to major

Intellectual Stewardship: Re-adapting Human Minds for Creative Knowledge Work in the Age of AI

arXiv:2603.18117v1 Announce Type: cross Abstract: Background: Amid the opportunities and risks introduced by generative AI, learning research needs to envision how human minds and responsibilities

MCP-38: A Comprehensive Threat Taxonomy for Model Context Protocol Systems (v1.0)

arXiv:2603.18063v1 Announce Type: cross Abstract: The Model Context Protocol (MCP) introduces a structurally distinct attack surface that existing threat frameworks, designed for traditional software systems

Agentic LLM Framework for Adaptive Decision Discourse

arXiv:2502.10978v2 Announce Type: replace Abstract: Effective decision-making in complex systems requires synthesizing diverse perspectives to address multifaceted challenges under uncertainty. This study introduces an agentic

The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

March 20, 2026

arXiv:2603.18294v1 Announce Type: new
Abstract: Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the “patient” or “query” populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use.
Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent.
Results: We identified a structural “validity gap.” While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs.
Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling–analogous to clinical trial reporting–to align evaluation with the full complexity of clinical practice.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844