BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

arXiv:2603.15692v1 Announce Type: cross Abstract: Backdoor attacks compromise model reliability by using triggers to manipulate outputs. Trigger inversion can accurately locate these triggers via a

A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering

arXiv:2603.15722v1 Announce Type: cross Abstract: The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and

Generative AI for Quantum Circuits and Quantum Code: A Technical Review and Taxonomy

arXiv:2603.16216v1 Announce Type: cross Abstract: We review thirteen generative systems and five supporting datasets for quantum circuit and quantum code generation, identified through a structured

Residual Stream Duality in Modern Transformer Architectures

arXiv:2603.16039v1 Announce Type: cross Abstract: Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model’s

PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency

arXiv:2603.16113v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support.

VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

March 18, 2026

arXiv:2603.15921v1 Announce Type: cross
Abstract: As Large Language Models shift the programming toward human-guided ”vibe coding”, agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults — a capability central to autonomous software engineering yet never systematically evaluated. We present name, the first empirical decomposition that jointly evaluates two coupled tasks: emphFault-Triggering Test Generation (FT-Test) constructing a discriminative witness that exposes a latent bug, and emphFault-targeted Program Repair (FPR), repairing it under varying diagnostic conditions. name pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down.
Evaluating 12 frontier LLMs, we find that fault-targeted reasoning does not scale with general coding ability. Models produce syntactically valid test inputs at near-ceiling rates yet collapse on discriminative generation, with fault hypothesis generation — not output validation — as the dominant bottleneck. Test-guided repair reveals a complementary insight: when self-generated tests successfully witness a fault, the resulting repair matches or outperforms repair guided by externally provided tests, but tests that fail to witness the fault actively degrade repair below unguided baselines. Together, these results reframe the challenge of autonomous debugging: the binding bottleneck is not code synthesis or test validity but fault-target reasoning, a capability that remains deficient across all frontier models. As Large Language Models shift the programming toward human-guided ”vibe coding”, agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults — a capability central to autonomous software engineering yet never systematically evaluated.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844