Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 183 words · 1 segments analyzed
View PDF HTML (experimental) Abstract:Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.19897 [cs.LG] (or arXiv:2601.19897v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19897 arXiv-issued DOI via DataCite Submission history From: Idan Shenfeld [view email] [v1] Tue, 27 Jan 2026 18:59:08 UTC (1,240 KB)