Week one of the Musk v. Altman trial: What it was like in the room

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Two of

Tailoring AI solutions for health care needs

Tailoring AI solutions for health care needs

The AI market is full of big promises of grand transformation. Health care is a prime target for those promises, beset as it is by

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

arXiv:2605.00254v1 Announce Type: cross Abstract: Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM

AlphaInventory: Evolving White-Box Inventory Policies via Large Language Models with Deployment Guarantees

arXiv:2605.00369v1 Announce Type: cross Abstract: We study how large language models can be used to evolve inventory policies in online, non-stationary environments. Our work is

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

arXiv:2605.00422v1 Announce Type: cross Abstract: Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

May 4, 2026

arXiv:2605.00245v1 Announce Type: new
Abstract: Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe Orient Decide Act (OODA) decision making framework. This structure enables systematic testing of accuracy and refusal across military relevant decision types. This benchmark features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures applied to 21 commercial LLMs. Evaluation results reveal critical gaps in safety alignment for military applications.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844