ai · 8 min read · Apr 17, 2026

Three-Phase Transformer: Structural Prior for Decoder Efficiency

A residual-stream architecture using cyclic channel partitioning and phase-aligned rotations achieves 7% perplexity gains with minimal parameter overhead.

Source: arxiv/cs.LG · Mohammad R. Abu Ayyash · open original ↗

Three-Phase Transformer partitions hidden vectors into cyclic channels with phase-respecting operations, gaining perplexity improvements at negligible parameter cost.

  • Hidden vector split into N equally-sized cyclic channels, each with per-channel RMSNorm.
  • 2D Givens rotations between attention and FFN layers rotate channels by theta + i*(2*pi/N).
  • Fixed Gabriel's horn profile injected as DC subspace, orthogonal to RoPE's relative positioning.
  • At 123M parameters, achieves 7.2% perplexity reduction over RoPE baseline with only 1,536 extra params.
  • Convergence speedup of 1.93x steps, 1.64x wall-clock time on WikiText-103.
  • N=3 phase design borrowed from balanced three-phase AC electrical systems.
  • Channel partitioning acts as self-stabilizing equilibrium without explicit geometric enforcement.
  • Rotation angle drift follows U-shaped depth profile across 12 layers.

Astrobobo tool mapping

  • Reading Queue Add this paper to a queue labeled 'Transformer Internals' alongside related work on residual-stream geometry and conservation laws in neural networks.
  • Knowledge Capture Document the three load-bearing mechanisms (channel partitioning, per-block rotation, per-phase normalization, horn DC injection) as a reusable design pattern for future architecture experiments.
  • Focus Brief Summarize the N-sweep results and statistical indistinguishability claim; flag this as a limitation when evaluating whether to implement 3PT in production models.

Frequently asked

  • Three-Phase Transformer (3PT) partitions the hidden vector into N equally-sized cyclic channels, each processed with phase-respecting operations including per-channel normalization and 2D Givens rotations. Unlike standard Transformers, it injects a fixed geometric structure (Gabriel's horn profile) into a DC subspace orthogonal to RoPE. This structural prior improves convergence and perplexity with minimal parameter overhead, rather than adding external modules.
Share X LinkedIn
cite
APA
Mohammad R. Abu Ayyash. (2026, April 17). Three-Phase Transformer: Structural Prior for Decoder Efficiency. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a
MLA
Mohammad R. Abu Ayyash. "Three-Phase Transformer: Structural Prior for Decoder Efficiency." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14430.
BibTeX
@misc{astrobobo_three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a_2026,
  author       = {Mohammad R. Abu Ayyash},
  title        = {Three-Phase Transformer: Structural Prior for Decoder Efficiency},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14430},
}

Related insights