arXiv:2605.21778v1 Announce Type: new Abstract: AI sycophancy has become a prominent concern in large language model (LLM) research. Yet the term lacks a consistent definition and has been applied to behaviors ranging from agreeing with a user’s false claim to excessively praising the user to withholding corrective feedback. When researchers, companies, and policymakers use the […]
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
arXiv:2603.27355v2 Announce Type: replace Abstract: We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted […]
Drivers of Transient Dynamics and Persistence in Dengue: Insights from Sensitivity and Stochastic Modeling
arXiv:2605.21787v1 Announce Type: new Abstract: We investigate how key epidemiological parameters shape both seasonal epidemics and the persistence of dengue transmission. Our findings confirm known mechanistic drivers of epidemic variability and introduce a ranking of parameter importance in our dengue model, which in turn informs the prioritization of public health policies. We propose a stochastic […]
Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
arXiv:2605.21810v1 Announce Type: new Abstract: Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model […]
DecepChain: Inducing Deceptive Reasoning in Large Language Models
arXiv:2510.00319v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have been demonstrating strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we study an underexplored phenomenon: whether LLMs could generate incorrect yet coherent CoTs […]
Implicit Safety Alignment from Crowd Preferences
arXiv:2605.21822v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim […]
AutoBaxBuilder: Bootstrapping Code Security Benchmarking
arXiv:2512.21132v2 Announce Type: replace-cross Abstract: As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work showed that LLMs are prone to generating code with security vulnerabilities, highlighting that security is often overlooked. These insights were enabled by specialized […]
Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks
arXiv:2605.21825v1 Announce Type: new Abstract: The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a […]
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
arXiv:2602.13294v3 Announce Type: replace-cross Abstract: Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based […]
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
arXiv:2605.21832v1 Announce Type: new Abstract: Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains […]
Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks
arXiv:2604.12325v3 Announce Type: replace-cross Abstract: We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of […]
PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference
arXiv:2605.21859v1 Announce Type: new Abstract: Phylogenetic trees are hybrid objects: branch lengths vary continuously, while topologies change discretely through edge contractions and expansions. Billera-Holmes-Vogtmann (BHV) tree space provides a canonical geometry for this structure, representing each resolved topology as a Euclidean orthant and topological changes as motion across shared lower-dimensional boundaries. We introduce PhylaFlow, a […]