Week one of the Musk v. Altman trial: What it was like in the room

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Two of

Tailoring AI solutions for health care needs

Tailoring AI solutions for health care needs

The AI market is full of big promises of grand transformation. Health care is a prime target for those promises, beset as it is by

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

arXiv:2605.00254v1 Announce Type: cross Abstract: Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM

AlphaInventory: Evolving White-Box Inventory Policies via Large Language Models with Deployment Guarantees

arXiv:2605.00369v1 Announce Type: cross Abstract: We study how large language models can be used to evolve inventory policies in online, non-stationary environments. Our work is

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

arXiv:2605.00422v1 Announce Type: cross Abstract: Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

May 4, 2026

arXiv:2605.00347v1 Announce Type: cross
Abstract: Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20–30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844