• Home
  • Uncategorized
  • RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

arXiv:2507.22219v3 Announce Type: replace-cross
Abstract: Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish, Chinese, Korean, and Japanese) show that RLfR consistently outperforms strong MT-SFT, DPO, and fixed-reference RL baselines, improving semantic quality and entity preservation, and also achieves superior performance under LLM-based judge evaluations.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844