The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

arXiv:2603.19664v1 Announce Type: cross Abstract: The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers

PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition

arXiv:2603.19565v1 Announce Type: cross Abstract: Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate

FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

arXiv:2603.19608v1 Announce Type: cross Abstract: Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging.

Global Convergence of Multiplicative Updates for the Matrix Mechanism: A Collaborative Proof with Gemini 3

arXiv:2603.19465v1 Announce Type: cross Abstract: We analyze a fixed-point iteration $v leftarrow phi(v)$ arising in the optimization of a regularized nuclear norm objective involving the

Inducing Sustained Creativity and Diversity in Large Language Models

arXiv:2603.19519v1 Announce Type: cross Abstract: We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long “search quest” for

GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

March 20, 2026

arXiv:2601.07632v4 Announce Type: replace-cross
Abstract: Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM’s capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments show that our framework improves the aggregated Average by 22.4% over the strongest baseline on HumanML3D and by 14.4% on KIT-ML, while ablations confirm the effectiveness of the tokenizer, projection, and regularization designs.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844