arXiv:2601.19956v1 Announce Type: cross Abstract: As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user’s confidential schedule to another, a privacy failure […]
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
arXiv:2509.22258v4 Announce Type: replace-cross Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet […]
MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference
arXiv:2601.19961v1 Announce Type: cross Abstract: We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached […]
Can Continuous-Time Diffusion Models Generate and Solve Globally Constrained Discrete Problems? A Study on Sudoku
arXiv:2601.20363v1 Announce Type: cross Abstract: Can standard continuous-time generative models represent distributions whose support is an extremely sparse, globally constrained discrete set? We study this question using completed Sudoku grids as a controlled testbed, treating them as a subset of a continuous relaxation space. We train flow-matching and score-based models along a Gaussian probability path […]
On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents
arXiv:2601.20404v1 Announce Type: cross Abstract: AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of AGENTS.md files on the runtime and token consumption of […]
Bench4HLS: End-to-End Evaluation of LLMs in High-Level Synthesis Code Generation
arXiv:2601.19941v1 Announce Type: cross Abstract: In last two years, large language models (LLMs) have shown strong capabilities in code generation, including hardware design at register-transfer level (RTL). While their use in high-level synthesis (HLS) remains comparatively less mature, the ratio of HLS- to RTL-focused studies has shifted from 1:10 to 2:10 in the past six […]
Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding
arXiv:2601.19929v1 Announce Type: cross Abstract: We introduce Stingy Context, a hierarchical tree-based compression scheme achieving 18:1 reduction in LLM context for auto-coding tasks. Using our TREEFRAG exploit decomposition, we reduce a real source code base of 239k tokens to 11k tokens while preserving task fidelity. Empirical results across 12 Frontier models show 94 to 97% […]
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
arXiv:2601.19935v1 Announce Type: cross Abstract: Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent’s ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory […]
Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication
arXiv:2601.19915v1 Announce Type: cross Abstract: We introduce the emphArrow Language Model, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a emphleft-nested implication chain whose structure preserves order through non-commutative composition. Next-token prediction corresponds to emphmodus ponens, and […]
Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation
arXiv:2601.19923v1 Announce Type: cross Abstract: As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as […]
Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)
arXiv:2601.20843v1 Announce Type: new Abstract: This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process […]
GTAC: A Generative Transformer for Approximate Circuits
arXiv:2601.19906v1 Announce Type: cross Abstract: Targeting error-tolerant applications, approximate circuits introduce controlled errors to significantly improve performance, power, and area (PPA) of circuits. In this work, we introduce GTAC, a novel generative Transformer-based model for producing approximate circuits. By leveraging principles of approximate computing and AI-driven EDA, our model innovatively integrates error thresholds into the […]