RL vs. SFT for Mathematical Reasoning in LLMs
GMPO achieves 74.2% on GSM8K (vs. SFT 76.7%), demonstrating RL can match SFT without step-by-step supervision.
Compute-controlled comparison of PPO, GRPO, GMPO, RLOO against SFT on Qwen3-8B. GMPO achieves 74.2% on GSM8K (vs. SFT 76.7%), demonstrating RL can match SFT without step-by-step supervision.