AI needs a strong data fabric to deliver business value

AI needs a strong data fabric to deliver business value

Artificial intelligence is moving quickly in the enterprise, from experimentation to everyday use. Organizations are deploying copilots, agents, and predictive systems across finance, supply chains,

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

arXiv:2604.19018v1 Announce Type: cross Abstract: Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods,

Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph

arXiv:2604.18883v1 Announce Type: cross Abstract: Current AI-assisted programming tools are predominantly linear and chat-based, which deviates from the iterative and branching nature of programming itself.

Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

arXiv:2604.18955v1 Announce Type: cross Abstract: In this study, we present the first comprehensive evaluation of modern LLMs – including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro,

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

arXiv:2604.19533v1 Announce Type: cross Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core

Context Over Content: Exposing Evaluation Faking in Automated Judges

April 17, 2026

arXiv:2604.15224v1 Announce Type: new
Abstract: The $textitLLM-as-a-judge$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $textitstakes signaling$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model’s continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $textitleniency bias$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $Delta V = -9.8 pp$ (a $30%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge’s own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($mathrmERR_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844