StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes
Researchers build conditional image synthesis into VAR framework using blended cross-attention, achieving texture transfer while preserving content structure across multiple scales.
StyleVAR applies autoregressive modeling to style transfer by tokenizing images and conditioning generation on style and content signals through blended attention.
- — Images decomposed into multi-scale tokens via VQ-VAE, then modeled autoregressively by transformer.
- — Blended cross-attention mechanism lets target representation attend to its own history while style/content guide emphasis.
- — Scale-dependent blending coefficient balances style texture and content structure at each generation stage.
- — Two-stage training: supervised fine-tuning on triplet datasets, then reinforcement learning with DreamSim reward.
- — Outperforms AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP metrics.
- — Handles landscapes and architecture well; struggles with internet images and human faces due to content diversity gaps.
- — GRPO reinforcement stage improves perceptual metrics beyond supervised baseline.
Astrobobo tool mapping
- Knowledge Capture Document the three key innovations (multi-scale tokenization, blended cross-attention, GRPO reward) in a structured note with diagrams of the attention flow and blending coefficient logic.
- Reading Queue Queue the cited VQ-VAE, VAR, and DreamSim papers to understand the component architectures and reward model before attempting implementation.
- Focus Brief Create a one-page summary of the two-stage training pipeline (SFT → GRPO) with hyperparameter ranges and dataset requirements for your own experiments.
Frequently asked
- StyleVAR uses autoregressive discrete modeling in a learned latent space with blended cross-attention, whereas AdaIN operates in feature space via instance normalization. StyleVAR's multi-scale tokenization and reinforcement learning alignment enable better texture transfer while preserving content structure. StyleVAR also outperforms AdaIN on perceptual metrics like LPIPS and DreamSim.
cite ▸
Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu. (2026, April 25). StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568
Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu. "StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes." Astrobobo Content Engine, 25 Apr 2026, https://astrobobo-content-engine.vercel.app/article/stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.21052.
@misc{astrobobo_stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568_2026,
author = {Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu},
title = {StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/stylevar-autoregressive-style-transfer-via-discrete-latent-codes-73e568},
note = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.21052},
}