LIFT reformulates KD as a Coarse-to-Fine framework. By parameterizing KD objective with linear regression, it first captures ①Coarse-Easy Errors, then refines the remaining ②Fine-Hard Errors, and finally achieves this Coarse-to-Fine transition through ③Adaptive Weighting.
PLACE extends LIFT from single-set coefficient estimation to multi-set coefficient estimation by grouping elements according to distillation error magnitude. Because distillation error is spatially non-uniform, estimating a single coefficient set for the entire output is often insufficient.
Single-set coefficient estimation cannot fully capture spatially diverse error patterns.
Error-based grouping partitions output elements into distinct local regimes according to their difficulty.
Multi-set coefficient estimation allows separate local fits for each group, similar to piecewise regression.
This enables locally adaptive guidance by assigning optimal coefficients to different spatial regions, allowing LIFT to more faithfully handle spatially varying distillation difficulty.
Additional results on text-to-image diffusion (SD 2.1) and DiT can be found in the main paper.
We demonstrate that in knowledge distillation for diffusion models, the teacher network’s highly complex denoising process—stemming from its substantially larger capacity—poses a significant challenge for the student model to faithfully mimic.
To address this problem, we propose a coarse-to-fine distillation framework with
LInear FiTting-based distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a “coarse” alignment and a “fine” refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, Piecewise Local Adaptive Coefficient Estimation extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance.
Our comprehensive experimental results demonstrate that ours, LIFT with PLACE, outperforms previous knowledge distillation on diffusion models based on both U-Net and DiT architectures. Furthermore, under extreme compression with a 1.3M-parameter student amounting to only 1.6% of the teacher’s parameters, conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50–200+. In contrast, our method remains stably convergent and achieves an FID of 15.73.