arXiv:2604.05605v1 Announce Type: cross Abstract: Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain […]
From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems
arXiv:2604.05674v1 Announce Type: cross Abstract: Cyber-physical systems often contend with incomplete architectural documentation or outdated information resulting from legacy technologies, knowledge management gaps, and the complexity of integrating diverse subsystems over extended operational lifecycles. This architectural incompleteness impedes reliable security assessment, as inaccurate or missing architectural knowledge limits the identification of system dependencies, attack surfaces, […]
What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say “I Don’t Know”
arXiv:2604.05779v1 Announce Type: cross Abstract: While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal […]
Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
arXiv:2604.05906v1 Announce Type: cross Abstract: Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target […]
LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering
arXiv:2604.06095v1 Announce Type: cross Abstract: Code decompilation analysis is a fundamental yet challenging task in malware reverse engineering, particularly due to the pervasive use of sophisticated obfuscation techniques. Although recent large language models (LLMs) have shown promise in translating low-level representations into high-level source code, most existing approaches rely on generic code pretraining and lack […]
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
arXiv:2604.06156v1 Announce Type: cross Abstract: MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format […]
Beyond Syntax: Action Semantics Learning for App Agents
arXiv:2506.17697v3 Announce Type: replace Abstract: The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller […]
TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering
arXiv:2510.07432v2 Announce Type: replace Abstract: Large language models (LLMs) exhibit strong symbolic and compositional reasoning, yet they struggle with time series question answering as the data is typically transformed into an LLM-compatible modality, e.g., serialized text, plotted images, or compressed time series embeddings. Such conversions impose representation bottlenecks, often require cross-modal alignment or finetuning, and […]
Emergent Introspection in AI is Content-Agnostic
arXiv:2603.05414v2 Announce Type: replace Abstract: Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study the mechanism of this introspection. We first extensively replicate Lindsey (2025)’s thought injection detection paradigm in large open-source models. We show that introspection in these models […]
Gradual Cognitive Externalization: From Modeling Cognition to Constituting It
arXiv:2604.04387v2 Announce Type: replace Abstract: Developers are publishing AI agent skills that replicate a colleague’s communication style, encode a supervisor’s mentoring heuristics, or preserve a person’s behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that ambient AI systems, through sustained causal coupling with users, transition from […]
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
arXiv:2410.20791v3 Announce Type: replace-cross Abstract: The rapid expansion of foundation models (FMs), such as large language models (LLMs), has given rise to FMware, software systems that integrate FM(s) as core components. While building demonstration-level FMware is relatively straightforward, transitioning to production-ready systems presents numerous challenges, including reliability, high implementation costs, scalability, and compliance with privacy […]
NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge
arXiv:2504.05995v3 Announce Type: replace-cross Abstract: The rapid progress of large language models (LLMs) raises concerns about cultural bias, fairness, and performance in diverse languages and underrepresented regions. Addressing these gaps requires large-scale resources grounded in multilingual, local, and cultural contexts. We systematize and extend the earlier NativQA framework to multimodality by adding image, audio, and […]