Icon LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

Ulsan National Institute of Science and Technology (UNIST)

A larger teacher is not always a better teacher for lightweight diffusion models. As the teacher-student Capacity Gap widens, distillation becomes harder to optimize, less stable, and ineffective.

Distillation Error Analysis


Method

LInear FiTting-based Knowledge Distillation (LIFT)

LIFT reformulates KD as a Coarse-to-Fine framework. By parameterizing KD objective with linear regression, it first captures Coarse-Easy Errors, then refines the remaining Fine-Hard Errors, and finally achieves this Coarse-to-Fine transition through Adaptive Weighting.

Parameterizing KD with Linear Regression
Conventional KD
\[ \arg\min_{\theta^{\mathcal S}} \mathcal{D}(\epsilon^{\mathcal T}, \epsilon^{\mathcal S}) \]
LIFT Parameterization
\[ \begin{aligned} \arg\min_{\theta^{\mathcal S}} \quad & \mathcal{D}(\epsilon^{\mathcal T}, \beta_0 + \beta_1 \epsilon^{\mathcal S}) \\ \text{s.t.} \quad & \beta_0 = 0,\ \beta_1 = 1. \end{aligned} \]
Coarse-Easy Alignment
Capture low-order teacher-student mismatch through coarse alignment.
\[ \mathcal{L}_{\text{coarse}} = \lvert \beta_0 \rvert + \lvert \beta_1 - 1 \rvert. \]
Fine-Hard Refinement
Refine the remaining hard residuals beyond coarse alignment.
\[ \mathcal{L}_{\text{fine}} = \left\lVert \epsilon^{\mathcal T} - (\beta_0 + \beta_1 \epsilon^{\mathcal S}) \right\rVert_2^2. \]
Adaptive Weighting
Gradually shift training from coarse alignment to fine refinement.
\[ \begin{aligned} \mathcal{L}_{\text{LIFT}} &= \mathcal{L}_{\text{coarse}} + w \cdot \mathcal{L}_{\text{fine}}, \\ w &= 1 - \min(1, \mathcal{L}_{\text{coarse}}). \end{aligned} \]

Piecewise Local Adaptive Coefficient Estimation (PLACE)

PLACE extends LIFT from single-set coefficient estimation to multi-set coefficient estimation by grouping elements according to distillation error magnitude. Because distillation error is spatially non-uniform, estimating a single coefficient set for the entire output is often insufficient.

Single-set coefficient estimation cannot fully capture spatially diverse error patterns.

Error-based grouping partitions output elements into distinct local regimes according to their difficulty.

Multi-set coefficient estimation allows separate local fits for each group, similar to piecewise regression.

PLACE as piecewise regression

This enables locally adaptive guidance by assigning optimal coefficients to different spatial regions, allowing LIFT to more faithfully handle spatially varying distillation difficulty.

Experiments

Qualitative Results

Unconditional Generation

Interpolation end reference image.


Conditional Generation

Interpolation end reference image.

Quantitative Results

Additional results on text-to-image diffusion (SD 2.1) and DiT can be found in the main paper.

Interpolation end reference image.

Abstract

We demonstrate that in knowledge distillation for diffusion models, the teacher network’s highly complex denoising process—stemming from its substantially larger capacity—poses a significant challenge for the student model to faithfully mimic.

To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTting-based distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a “coarse” alignment and a “fine” refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, Piecewise Local Adaptive Coefficient Estimation extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance.

Our comprehensive experimental results demonstrate that ours, LIFT with PLACE, outperforms previous knowledge distillation on diffusion models based on both U-Net and DiT architectures. Furthermore, under extreme compression with a 1.3M-parameter student amounting to only 1.6% of the teacher’s parameters, conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50–200+. In contrast, our method remains stably convergent and achieves an FID of 15.73.