Decoupled Q-Chunking

arXiv:2512.10926v2 Announce Type: replace-cross Abstract: Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a

  • Home
  • Uncategorized
  • Reducing Fragmentation and Starvation in GPU Clusters through Dynamic Multi-Objective Scheduling

Reducing Fragmentation and Starvation in GPU Clusters through Dynamic Multi-Objective Scheduling

arXiv:2512.10980v1 Announce Type: cross
Abstract: GPU clusters have become essential for training and deploying modern AI systems, yet real deployments continue to report average utilization near 50%. This inefficiency is largely caused by fragmentation, heterogeneous workloads, and the limitations of static scheduling policies. This work presents a systematic evaluation of these issues and introduces three specialized dynamic schedulers: Hybrid Priority (HPS), Predictive Backfill (PBS), and Smart Batch (SBS). These schedulers are designed to improve utilization, fairness, and overall throughput in multi-tenant GPU clusters. We evaluate all schedulers using a controlled simulation of 1,000 AI jobs on a 64-GPU, 8-node cluster that includes a realistic mix of training, inference, and research workloads. Static baselines (FIFO, SJF, Shortest, Shortest-GPU) achieve 45 to 67% GPU utilization and 12.5 to 18.3 jobs per hour and experience severe starvation, with as many as 156 jobs waiting longer than 30 minutes. The dynamic schedulers significantly outperform these policies. HPS achieves the highest utilization (78.2%), highest throughput (25.8 jobs per hour), and the lowest fairness variance among dynamic methods (457), reducing starvation to 12 jobs. PBS improves fragmentation handling and reaches 76.1% utilization, while SBS increases efficiency for structurally similar jobs and reaches 74.6% utilization. Across all key metrics, including throughput, job wait times, fairness variance, and starvation, dynamic multi-objective schedulers consistently outperform single-objective heuristics. These results show that targeted and transparent scheduling strategies can meaningfully increase GPU efficiency in heterogeneous AI clusters and provide a practical foundation for future production scheduling frameworks.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844