Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies

arXiv:2604.26903v1 Announce Type: cross Abstract: This paper provides a concise yet comprehensive review of recent advancements in millimeter-wave (mm-wave) oscillators below 100 GHz and sub-terahertz

ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

arXiv:2604.26637v1 Announce Type: cross Abstract: Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy

Domain-Adapted Small Language Models for Reliable Clinical Triage

arXiv:2604.26766v1 Announce Type: cross Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage

Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain

arXiv:2604.26501v1 Announce Type: cross Abstract: Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets.

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

April 27, 2026

arXiv:2508.15919v3 Announce Type: replace-cross
Abstract: Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements.
We present HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. HFX introduces a textbfscheduler that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests. HFX also integrates a textbfscaler that supports fast device-to-device (D2D) weight transfer, reducing cold-start latency. Additionally, the system supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments.
Through extensive experiments on multi-task workloads, we demonstrate consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44$times$, 65.82%, and 49.81%, respectively, compared to state-of-the-art systems. Our results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844