Three-Phase Transformer: Structural Prior for Decoder Efficiency
A residual-stream architecture using cyclic channel partitioning and phase-aligned rotations achieves 7% perplexity gains with minimal parameter overhead.
Three-Phase Transformer partitions hidden vectors into cyclic channels with phase-respecting operations, gaining perplexity improvements at negligible parameter cost.
- — Hidden vector split into N equally-sized cyclic channels, each with per-channel RMSNorm.
- — 2D Givens rotations between attention and FFN layers rotate channels by theta + i*(2*pi/N).
- — Fixed Gabriel's horn profile injected as DC subspace, orthogonal to RoPE's relative positioning.
- — At 123M parameters, achieves 7.2% perplexity reduction over RoPE baseline with only 1,536 extra params.
- — Convergence speedup of 1.93x steps, 1.64x wall-clock time on WikiText-103.
- — N=3 phase design borrowed from balanced three-phase AC electrical systems.
- — Channel partitioning acts as self-stabilizing equilibrium without explicit geometric enforcement.
- — Rotation angle drift follows U-shaped depth profile across 12 layers.
Astrobobo tool mapping
- Reading Queue Add this paper to a queue labeled 'Transformer Internals' alongside related work on residual-stream geometry and conservation laws in neural networks.
- Knowledge Capture Document the three load-bearing mechanisms (channel partitioning, per-block rotation, per-phase normalization, horn DC injection) as a reusable design pattern for future architecture experiments.
- Focus Brief Summarize the N-sweep results and statistical indistinguishability claim; flag this as a limitation when evaluating whether to implement 3PT in production models.
Frequently asked
- Three-Phase Transformer (3PT) partitions the hidden vector into N equally-sized cyclic channels, each processed with phase-respecting operations including per-channel normalization and 2D Givens rotations. Unlike standard Transformers, it injects a fixed geometric structure (Gabriel's horn profile) into a DC subspace orthogonal to RoPE. This structural prior improves convergence and perplexity with minimal parameter overhead, rather than adding external modules.
cite ▸
Mohammad R. Abu Ayyash. (2026, April 17). Three-Phase Transformer: Structural Prior for Decoder Efficiency. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a
Mohammad R. Abu Ayyash. "Three-Phase Transformer: Structural Prior for Decoder Efficiency." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14430.
@misc{astrobobo_three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a_2026,
author = {Mohammad R. Abu Ayyash},
title = {Three-Phase Transformer: Structural Prior for Decoder Efficiency},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/three-phase-transformer-structural-prior-for-decoder-efficiency-a3217a},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14430},
}