• Home
  • Uncategorized
  • AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

arXiv:2510.07842v2 Announce Type: replace-cross
Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods face a dilemma: off-policy distillation provides high-quality supervision but suffers from exposure bias (training inference mismatch), while on-policy approaches ensure consistency but are limited by the low quality of student-generated outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation via an adaptive switching mechanism. AdaSwitch allows the student to explore its predictions within its capability and selectively integrates teacher guidance only when divergence exceeds a context-aware threshold. This paradigm preserves generation consistency while ensuring high-quality supervision. Experiments on three datasets demonstrate that AdaSwitch consistently improves accuracy and reasoning capability with moderate overhead.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844