From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

arXiv:2512.11724v2 Announce Type: replace-cross Abstract: While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: […]

The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems

arXiv:2512.15068v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of […]

How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

arXiv:2512.15115v1 Announce Type: cross Abstract: Sequence modeling has produced diverse architectures — from classical recurrent neural networks to modern Transformers and state space models (SSMs) — yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective […]

Scaling Causal Mediation for Complex Systems: A Framework for Root Cause Analysis

arXiv:2512.14764v1 Announce Type: cross Abstract: Modern operational systems ranging from logistics and cloud infrastructure to industrial IoT, are governed by complex, interdependent processes. Understanding how interventions propagate through such systems requires causal inference methods that go beyond direct effects to quantify mediated pathways. Traditional mediation analysis, while effective in simple settings, fails to scale to […]

IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

arXiv:2512.15635v1 Announce Type: cross Abstract: We propose textbfIC-Effect, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, […]

Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

arXiv:2512.15249v1 Announce Type: cross Abstract: Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness […]

Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

arXiv:2512.10422v3 Announce Type: replace-cross Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and […]

Leveraging Foundational Models and Simple Fusion for Multi-modal Physiological Signal Analysis

arXiv:2512.15250v1 Announce Type: cross Abstract: Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to […]

VLA-AN: An Efficient and Onboard Vision-Language-Action Framework for Aerial Navigation in Complex Environments

arXiv:2512.15258v1 Announce Type: cross Abstract: This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844