arXiv:2601.19922v1 Announce Type: cross Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We […]
OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
arXiv:2601.19924v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, […]
The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models
arXiv:2601.19926v1 Announce Type: cross Abstract: We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus […]
SDUs DAISY: A Benchmark for Danish Culture
arXiv:2601.19930v1 Announce Type: cross Abstract: We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within […]
Quantifying non deterministic drift in large language models
arXiv:2601.19934v1 Announce Type: cross Abstract: Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output […]
Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
arXiv:2601.19936v1 Announce Type: cross Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the divergence from the model’s top-1 prediction and local correlation between adjacent tokens. In […]
Continuous-Flow Data-Rate-Aware CNN Inference on FPGA
arXiv:2601.19940v1 Announce Type: cross Abstract: Among hardware accelerators for deep-learning inference, data flow implementations offer low latency and high throughput capabilities. In these architectures, each neuron is mapped to a dedicated hardware unit, making them well-suited for field-programmable gate array (FPGA) implementation. Previous unrolled implementations mostly focus on fully connected networks because of their simplicity, […]
Audio Deepfake Detection in the Age of Advanced Text-to-Speech models
arXiv:2601.20510v1 Announce Type: cross Abstract: Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models–Dia2, Maya1, and MeloTTS–representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using […]
Self Voice Conversion as an Attack against Neural Audio Watermarking
arXiv:2601.20432v1 Announce Type: cross Abstract: Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the […]
Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures
arXiv:2510.14616v2 Announce Type: replace-cross Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and […]
Open-Vocabulary Functional 3D Human-Scene Interaction Generation
arXiv:2601.20835v1 Announce Type: cross Abstract: Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing […]
Reinforcement Learning via Self-Distillation
arXiv:2601.20802v1 Announce Type: cross Abstract: Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such […]