Analysis of intellectual property strategies across different categories of digital therapeutics

Advances in digital technology and the coronavirus disease (COVID-19) pandemic have accelerated the digital transformation of healthcare. Digital therapeutics (DTx), which deliver evidence-based interventions through

Correction: Artificial intelligence assessment of valvular disease and ventricular function by a single echocardiography view

Post Content

Comparative performance of ChatGPT-5 and DeepSeek on the Chinese ultrasound medicine senior professional title examination

BackgroundLarge language models (LLMs) have shown growing potential for medical education and assessment, but evidence on their performance in specialty certification exams in China—particularly in

Depression detection using deep learning and large language models from multimodalities

Depression is a complex psychiatric disorder that affects neural functioning, cognition, emotion, and behavior, making objective assessment a persistent clinical challenge. Traditional diagnostic methods depend

Editorial: Ethical considerations of large language models: challenges and best practices

Post Content

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

March 16, 2026

arXiv:2603.12564v1 Announce Type: cross
Abstract: Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844