Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Pan, Yueming; Feng, Ruoyu; Dai, Qi; Wang, Yuqi; Lin, Wenfeng; Guo, Mingyu; Luo, Chong; Zheng, Nanning

Semantics Lead the Way: Harmonizing Semantic and Texture
Modeling with Asynchronous Latent Diffusion

Yueming Pan^1,2*‡, Ruoyu Feng^3‡, Qi Dai², Yuqi Wang³,
Wenfeng Lin³, Mingyu Guo³, Chong Luo^2†, Nanning Zheng^1†

¹IAIR, Xi’an Jiaotong University ²Microsoft Research Asia ³ByteDance

^*This work was performed during Yueming Pan's internship at MSRA ^‡Equal contribution ^†Corresponding author

Code arXiv

(a) Overview of Semantic-First Diffusion (SFD). Semantics (dashed curve) and textures (solid curve) follow asynchronous denoising trajectories. SFD operates in three phases: Stage I – Semantic initialization, where semantic latents denoise first; Stage II – Asynchronous generation, where semantics and textures denoise jointly but asynchronously, with semantics ahead of textures; Stage III – Texture completion, where only textures continue refining. After denoising, the generated semantic latent $\mathbf{s}_1$ is discarded, and the final image is decoded solely from the texture latent $\mathbf{z}_1$. (b) Training convergence on ImageNet 256$\times$256 without guidance. SFD achieves substantially faster convergence than DiT-XL/2 and LightningDiT-XL/1 by approximately 100$\times$ and 33.3$\times$, respectively.

Abstract

Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit the texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining the compact semantic latent, which is extracted from pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet $256\times256$ with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100$\times$ faster convergence than original DiT without guidance. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling.

Method

We propose Semantic-First Diffusion (SFD), which employs asynchronous denoising to harmonize semantic and texture modeling, achieving faster convergence and superior performance without sacrificing reconstruction fidelity.

1. Composite Latent Construction

To comprehensively leverage high-level semantics from pretrained vision foundation models, we introduce a dedicated Semantic VAE (SemVAE) that compresses rich semantic features into compact latent representations while preserving spatial layout and minimizing information loss. The composite latent is constructed by combining compressed high-level semantics $\mathbf{s}_1$ and low-level textures $\mathbf{z}_1$, which are encoded via SemVAE encoder $\mathcal{E}_s$ and texture VAE encoder $\mathcal{E}_z$, respectively. Here we implement SD-VAE as the texture VAE. The two latents are concatenated along the channel dimension.

Architecture of the Semantic VAE (SemVAE).

2. Semantic-First Diffusion

Distinct timesteps for semantics and textures: during training, SFD assigns distinct timesteps to semantics and textures. For each image, we sample a semantic timestep $t_s$ from an extended interval and derive the texture timestep $t_z$ by subtracting a fixed offset $\Delta t$, followed by clamping both to $[0, 1]$ and ensure $ t_s \ge t_z $. This guarantees that the semantic latent is always less noisy than the texture latent, providing clearer structural guidance throughout denoising.

The diffusion model adopts a Transformer backbone $ \mathbf{v}_\theta(\cdot) $ that takes as input the noisy composite latent $[\,\mathbf{s}_{t_s}, \mathbf{z}_{t_z}\,]$ at different noise levels, two separate timesteps $[t_s, t_z]$, and the class label $y$. It jointly predicts the velocities $[\hat{\mathbf{v}}_s, \hat{\mathbf{v}}_z]$ for semantic and texture components:

$$[\,\hat{\mathbf{v}}_s, \hat{\mathbf{v}}_z\,] = \mathbf{v}_\theta\big([\mathbf{s}_{t_s}, \mathbf{z}_{t_z}],\,[t_s, t_z],\,y\big)$$

Input and Output of Diffusion Transformer — Input and output of Diffusion Transformer.

The training objective combines velocity prediction losses for both semantic and texture latents:

$$\mathcal{L}_{\mathrm{pred}} = \mathbb{E}_{\mathbf{s}_0, \mathbf{s}_1, \mathbf{z}_0, \mathbf{z}_1, t_s, t_z} \Big[ \big\|\hat{\mathbf{v}}_{z} - (\mathbf{z}_1 - \mathbf{z}_0)\big\|^2 + \beta \big\|\hat{\mathbf{v}}_{s} - (\mathbf{s}_1 - \mathbf{s}_0)\big\|^2 \Big]$$

Additionally, the representation alignment loss from REPA is employed, which aligns the diffusion hidden states with pretrained vision encoder representations. The final objective becomes the following:

$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{vel}} + \lambda\,\mathcal{L}_{\text{REPA}}$$

Three-phase Denoising Schedule. During inference, Semantic-First Diffusion (SFD) performs denoising in three asynchronous phases: Stage I – Semantic initialization ($t_s \in [0, \Delta t),\; t_z = 0$), where only semantic latents are denoised to establish global structure; Stage II – Asynchronous generation ($t_s \in [\Delta t, 1],\; t_z \in [0, 1-\Delta t)$), where semantics and textures are denoised jointly but asynchronously, with semantics advancing slightly ahead to provide clearer guidance; and Stage III – Texture completion ($t_s = 1,\; t_z \in [1-\Delta t, 1]$), where only textures continue refining fine-grained details to produce the final image.

Results

We evaluate Semantic-First Diffusion (SFD) against DiT, LightningDiT, and REPA on ImageNet 256×256. Without guidance, SFD achieves consistently lower FID across all model scales while substantially accelerating convergence. For instance, LightningDiT-XL/1 with SFD attains FID 3.53 at only 400K iterations, outperforming LightningDiT-XL/1 with REPA at 4M iterations (FID 5.84) and DiT-XL/2 at 7M iterations (FID 9.62), using merely 10% and 5.7% of the training cost, respectively. Notably, SFD achieves comparable performance to DiT-XL trained for 7M iterations and LightningDiT-XL/1 trained for 4M iterations in just 70K and 120K iterations, achieving 100$\times$ and 33.3$\times$ faster convergence. With guidance, SFD achieves new state-of-the-art performance: FID 1.06 with LightningDiT-XL and FID 1.04 with the 1.0B-parameter LightningDiT-XXL on ImageNet 256×256.

Generated Image Samples on ImageNet 256×256 — Qualitative samples from our model trained at 256$\times$256 resolution.

BibTeX

@article{Pan2025SFD,
  title={Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion},
  author={Pan, Yueming and Feng, Ruoyu and Dai, Qi and Wang, Yuqi and Lin, Wenfeng and Guo, Mingyu and Luo, Chong and Zheng, Nanning},
  journal={arXiv preprint arXiv:2512.04926},
  year={2025}
}