ai · 5 min read · Apr 25, 2026

StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes

Researchers build conditional image synthesis into VAR framework using blended cross-attention, achieving texture transfer while preserving content structure across multiple scales.

Source: arxiv/cs.AI · Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu · open original ↗

StyleVAR applies autoregressive modeling to style transfer by tokenizing images and conditioning generation on style and content signals through blended attention.

  • Images decomposed into multi-scale tokens via VQ-VAE, then modeled autoregressively by transformer.
  • Blended cross-attention mechanism lets target representation attend to its own history while style/content guide emphasis.
  • Scale-dependent blending coefficient balances style texture and content structure at each generation stage.
  • Two-stage training: supervised fine-tuning on triplet datasets, then reinforcement learning with DreamSim reward.
  • Outperforms AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP metrics.
  • Handles landscapes and architecture well; struggles with internet images and human faces due to content diversity gaps.
  • GRPO reinforcement stage improves perceptual metrics beyond supervised baseline.

Astrobobo tool mapping

  • Knowledge Capture Document the three key innovations (multi-scale tokenization, blended cross-attention, GRPO reward) in a structured note with diagrams of the attention flow and blending coefficient logic.
  • Reading Queue Queue the cited VQ-VAE, VAR, and DreamSim papers to understand the component architectures and reward model before attempting implementation.
  • Focus Brief Create a one-page summary of the two-stage training pipeline (SFT → GRPO) with hyperparameter ranges and dataset requirements for your own experiments.

Frequently asked

  • StyleVAR uses autoregressive discrete modeling in a learned latent space with blended cross-attention, whereas AdaIN operates in feature space via instance normalization. StyleVAR's multi-scale tokenization and reinforcement learning alignment enable better texture transfer while preserving content structure. StyleVAR also outperforms AdaIN on perceptual metrics like LPIPS and DreamSim.
Share X LinkedIn
cite
APA
Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu. (2026, April 25). StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568
MLA
Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu. "StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes." Astrobobo Content Engine, 25 Apr 2026, https://astrobobo-content-engine.vercel.app/article/stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.21052.
BibTeX
@misc{astrobobo_stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568_2026,
  author       = {Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu},
  title        = {StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568},
  note         = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.21052},
}

Related insights