arXiv:2406.14294v4 Announce Type: replace-cross Abstract: Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and […]
Protecting Bystander Privacy via Selective Hearing in Audio LLMs
arXiv:2512.06380v3 Announce Type: replace-cross Abstract: Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model’s ability to attend to an […]
GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations
arXiv:2503.16683v2 Announce Type: replace-cross Abstract: Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing […]
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
arXiv:2604.19018v1 Announce Type: cross Abstract: Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically […]
Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph
arXiv:2604.18883v1 Announce Type: cross Abstract: Current AI-assisted programming tools are predominantly linear and chat-based, which deviates from the iterative and branching nature of programming itself. Our preliminary study with developers using AI assistants suggested that they often struggle to explore alternatives, manage prompting sequences, and trace changes. Informed by these insights, we created EvoGraph, an […]
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
arXiv:2604.18955v1 Announce Type: cross Abstract: In this study, we present the first comprehensive evaluation of modern LLMs – including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT – across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) […]
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
arXiv:2604.19533v1 Announce Type: cross Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps […]
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
arXiv:2604.18835v1 Announce Type: cross Abstract: We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, […]
Inductive Subgraphs as Shortcuts: Causal Disentanglement for Heterophilic Graph Learning
arXiv:2604.19186v1 Announce Type: cross Abstract: Heterophily is a prevalent property of real-world graphs and is well known to impair the performance of homophilic Graph Neural Networks (GNNs). Prior work has attempted to adapt GNNs to heterophilic graphs through non-local neighbor extension or architecture refinement. However, the fundamental reasons behind misclassifications remain poorly understood. In this […]
Evaluation-driven Scaling for Scientific Discovery
arXiv:2604.19341v1 Announce Type: cross Abstract: Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the […]
Protecting Bystander Privacy via Selective Hearing in Audio LLMs
arXiv:2512.06380v3 Announce Type: replace-cross Abstract: Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model’s ability to attend to an […]
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
arXiv:2604.18164v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and […]