LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

Distillation Error Analysis

Decomposing Distillation Error

Distillation error is not monolithic. A simple linear regression probe, \( \hat{\epsilon}^{\mathcal T} = \hat{\beta}_0 + \hat{\beta}_1 \epsilon^{\mathcal S} \), reveals two distinct components: Coarse-Easy Errors and Fine-Hard Errors.

(1) Coarse-Easy Errors. These errors arise from low-order statistical mismatch between teacher and student, such as mean and variance differences. They can be captured by just two linear regression coefficients, which helps stabilize the denoising process.

(2) Fine-Hard Errors. These errors are the remaining non-linear residuals beyond low-order moment alignment. They encode fine-grained teacher behavior and explain the remaining performance gap under aggressive compression.

Conventional KD learns these two components simultaneously. As a result, optimization becomes unstable under large capacity gaps, preventing even the easy errors from being learned reliably.

Spatially Non-uniform Distillation Error

Across denoising steps. Distillation error exhibits distinct spatial patterns at each denoising steps.

Across training iterations. Distillation error concentrates on different regions during training.

Distillation error is spatially non-uniform across both diffusion timesteps and training iterations. This suggests that distillation difficulty depends not only on what is being learned, but also on when and where the error arises.

Method

LInear FiTting-based Knowledge Distillation (LIFT)

LIFT reformulates KD as a Coarse-to-Fine framework. By parameterizing KD objective with linear regression, it first captures ①Coarse-Easy Errors, then refines the remaining ②Fine-Hard Errors, and finally achieves this Coarse-to-Fine transition through ③Adaptive Weighting.

Parameterizing KD with Linear Regression

Conventional KD

\[ \arg\min_{\theta^{\mathcal S}} \mathcal{D}(\epsilon^{\mathcal T}, \epsilon^{\mathcal S}) \]

→

LIFT Parameterization

\[ \begin{aligned} \arg\min_{\theta^{\mathcal S}} \quad & \mathcal{D}(\epsilon^{\mathcal T}, \beta_0 + \beta_1 \epsilon^{\mathcal S}) \\ \text{s.t.} \quad & \beta_0 = 0,\ \beta_1 = 1. \end{aligned} \]

① Coarse-Easy Alignment

Capture low-order teacher-student mismatch through coarse alignment.

\[ \mathcal{L}_{\text{coarse}} = \lvert \beta_0 \rvert + \lvert \beta_1 - 1 \rvert. \]

② Fine-Hard Refinement

Refine the remaining hard residuals beyond coarse alignment.

\[ \mathcal{L}_{\text{fine}} = \left\lVert \epsilon^{\mathcal T} - (\beta_0 + \beta_1 \epsilon^{\mathcal S}) \right\rVert_2^2. \]

③ Adaptive Weighting

Gradually shift training from coarse alignment to fine refinement.

\[ \begin{aligned} \mathcal{L}_{\text{LIFT}} &= \mathcal{L}_{\text{coarse}} + w \cdot \mathcal{L}_{\text{fine}}, \\ w &= 1 - \min(1, \mathcal{L}_{\text{coarse}}). \end{aligned} \]

Piecewise Local Adaptive Coefficient Estimation (PLACE)

PLACE extends LIFT from single-set coefficient estimation to multi-set coefficient estimation by grouping elements according to distillation error magnitude. Because distillation error is spatially non-uniform, estimating a single coefficient set for the entire output is often insufficient.

Single-set coefficient estimation cannot fully capture spatially diverse error patterns.

Error-based grouping partitions output elements into distinct local regimes according to their difficulty.

Multi-set coefficient estimation allows separate local fits for each group, similar to piecewise regression.

This enables locally adaptive guidance by assigning optimal coefficients to different spatial regions, allowing LIFT to more faithfully handle spatially varying distillation difficulty.

Experiments

Qualitative Results

Unconditional Generation

Conditional Generation

Quantitative Results

Additional results on text-to-image diffusion (SD 2.1) and DiT can be found in the main paper.

Abstract

We demonstrate that in knowledge distillation for diffusion models, the teacher network’s highly complex denoising process—stemming from its substantially larger capacity—poses a significant challenge for the student model to faithfully mimic.

To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTting-based distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a “coarse” alignment and a “fine” refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, Piecewise Local Adaptive Coefficient Estimation extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance.

Our comprehensive experimental results demonstrate that ours, LIFT with PLACE, outperforms previous knowledge distillation on diffusion models based on both U-Net and DiT architectures. Furthermore, under extreme compression with a 1.3M-parameter student amounting to only 1.6% of the teacher’s parameters, conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50–200+. In contrast, our method remains stably convergent and achieves an FID of 15.73.