Rejection-Gated Policy Optimization replaces importance weighting with learned gates
A new reinforcement learning method selects trustworthy samples via differentiable gates instead of reweighting all samples, reducing variance and improving RLHF alignment.
RGPO gates samples during policy updates rather than reweighting them, reducing variance and improving language model alignment.
- — Replaces importance sampling ratios with learned acceptance gates that filter samples during gradient computation.
- — Provides unified framework showing TRPO, PPO, REINFORCE as special cases of gate function choices.
- — Bounds gradient variance even when importance ratios are heavy-tailed, where standard importance sampling fails.
- — Achieves higher reward and lower KL divergence than PPO-RLHF in Qwen2.5 fine-tuning experiments.
- — Uses dual-ratio gate anchoring to both previous policy and reference model for preference alignment.
- — Maintains PPO computational cost without requiring second-order optimization.
- — Incurs only bounded, controllable bias while providing approximate monotonic improvement guarantee.
Astrobobo tool mapping
- Knowledge Capture Record the core insight: gate functions as learned sample filters. Capture the unified view of TRPO/PPO/REINFORCE as instances of gate design.
- Reading Queue Queue related papers on variance reduction in RL (e.g., control variates, clipped objectives) to understand how RGPO fits the landscape.
- Focus Brief Summarize the dual-ratio gate mechanism and its empirical gains (+14.8% reward, -16% KL) for team discussion on RLHF improvements.
Frequently asked
- PPO clips importance ratios to a fixed range (e.g., [0.8, 1.2]) uniformly across all samples. RGPO learns a differentiable gate function that varies per sample based on its importance ratio, allowing the optimizer to adaptively decide which samples to trust. The gate participates in gradient computation, whereas PPO's clipping is a static heuristic applied before gradients.
cite ▸
Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li. (2026, April 17). Rejection-Gated Policy Optimization replaces importance weighting with learned gates. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/rejection-gated-policy-optimization-replaces-importance-weighting-with-learned-g-66a4b2
Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li. "Rejection-Gated Policy Optimization replaces importance weighting with learned gates." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/rejection-gated-policy-optimization-replaces-importance-weighting-with-learned-g-66a4b2. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14895.
@misc{astrobobo_rejection-gated-policy-optimization-replaces-importance-weighting-with-learned-g-66a4b2_2026,
author = {Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li},
title = {Rejection-Gated Policy Optimization replaces importance weighting with learned gates},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/rejection-gated-policy-optimization-replaces-importance-weighting-with-learned-g-66a4b2},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14895},
}