arXiv:2604.16756v2 Announce Type: replace-cross Abstract: Prompt-induced cognitive biases are changes in a general-purpose AI (GPAI) system’s decisions caused solely by biased wording in the input (e.g., framing, anchors), not task logic. In software engineering (SE) decision support (where problem statements and requirements are natural language) small phrasing shifts (e.g., popularity hints or outcome reveals) can […]
CHORUS: An Agentic Framework for Generating Realistic Deliberation Data
arXiv:2604.20651v1 Announce Type: new Abstract: Understanding the intricate dynamics of online discourse depends on large-scale deliberation data, a resource that remains scarce across interactive web platforms due to restrictive accessibility policies, ethical concerns and inconsistent data quality. In this paper, we propose Chorus, an agentic framework, which orchestrates LLM-powered actors with behaviorally consistent personas to […]
Onyx: Cost-Efficient Disk-Oblivious ANN Search
arXiv:2604.20401v1 Announce Type: cross Abstract: Approximate nearest neighbor (ANN) search in AI systems increasingly handles sensitive data on third-party infrastructure. Trusted execution environments (TEEs) offer protection, but cost-efficient deployments must rely on external SSDs, which leaks user queries through disk access patterns to the host. Oblivious RAM (ORAM) can hide these access patterns but at […]
SWE-chat: Coding Agent Interactions From Real Users in the Wild
arXiv:2604.20779v1 Announce Type: new Abstract: AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset […]
Explainable Iterative Data Visualisation Refinement via an LLM Agent
arXiv:2604.15319v2 Announce Type: replace-cross Abstract: Exploratory analysis of high-dimensional data relies on embedding the data into a low-dimensional space (typically 2D or 3D), based on which visualization plot is produced to uncover meaningful structures and to communicate geometric and distributional data characteristics. However, finding a suitable algorithm configuration, particularly hyperparameter setting, to produce a visualization […]
Transparent Screening for LLM Inference and Training Impacts
arXiv:2604.19757v1 Announce Type: cross Abstract: This paper presents a transparent screening framework for estimating inference and training impacts of current large language models under limited observability. The framework converts natural-language application descriptions into bounded environmental estimates and supports a comparative online observatory of current market models. Rather than claiming direct measurement for opaque proprietary services, […]
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
arXiv:2604.20389v1 Announce Type: cross Abstract: The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information […]
What Makes a Good AI Review? Concern-Level Diagnostics for AI Peer Review
arXiv:2604.19998v1 Announce Type: new Abstract: Evaluating AI-generated reviews by verdict agreement is widely recognized as insufficient, yet current alternatives rarely audit which concerns a system identifies, how it prioritizes them, or whether those priorities align with the review rationale that shaped the final assessment. We propose concern alignment, a diagnostic framework that evaluates AI reviews […]
Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
arXiv:2604.15153v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in […]
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
arXiv:2604.20133v1 Announce Type: new Abstract: This paper proposes EvoAgent – an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub-agent delegation mechanism. EvoAgent models skills as multi-file structured capability units equipped with triggering mechanisms and evolutionary metadata, and enables continuous skill generation and optimization through a user-feedback-driven closed-loop […]
AI models of unstable flow exhibit hallucination
arXiv:2604.20372v1 Announce Type: cross Abstract: We report the first systematic evidence of hallucination in AI models of fluid dynamics, demonstrated in the canonical problem of hydrodynamically unstable transport known as viscous fingering. AI-based modeling of flow with instabilities remains challenging because rapidly evolving, multiscale fingering patterns are difficult to resolve accurately. We identify solutions that […]
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
arXiv:2604.20261v1 Announce Type: new Abstract: Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks. Recent Large Language Model (LLM)-based approaches […]