MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents

arXiv:2512.11147v1 Announce Type: cross Abstract: Tool calling agents are an emerging paradigm in LLM deployment, with major platforms such as ChatGPT, Claude, and Gemini adding

Reducing Fragmentation and Starvation in GPU Clusters through Dynamic Multi-Objective Scheduling

arXiv:2512.10980v1 Announce Type: cross Abstract: GPU clusters have become essential for training and deploying modern AI systems, yet real deployments continue to report average utilization

KathDB: Explainable Multimodal Database Management System with Human-AI Collaboration

arXiv:2512.11067v1 Announce Type: cross Abstract: Traditional DBMSs execute user- or application-provided SQL queries over relational data with strong semantic guarantees and advanced query optimization, but

Agile Flight Emerges from Multi-Agent Competitive Racing

arXiv:2512.11781v1 Announce Type: cross Abstract: Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed

Decoupled Q-Chunking

arXiv:2512.10926v2 Announce Type: replace-cross Abstract: Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

December 15, 2025

arXiv:2512.02551v2 Announce Type: replace-cross
Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art Nvidia’s closed-source libraries, i.e., cuBLAS, cuBLASLt. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over torch.matmul on average; +19.2% over cuBLAS using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over cuBLASLt-heuristic, which queries cuBLASLt library and selects the algorithm based on the heuristic’s suggestion; and +11.4% over the most competitive cuBLASLt-AutoTuning model, which selects the fastest algorithm from up to 100 candidates from cuBLASLt’s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844