2025 Featured

RL vs. SFT for Mathematical Reasoning in LLMs

GMPO achieves 74.2% on GSM8K (vs. SFT 76.7%), demonstrating RL can match SFT without step-by-step supervision.

Compute-controlled comparison of PPO, GRPO, GMPO, RLOO against SFT on Qwen3-8B. GMPO achieves 74.2% on GSM8K (vs. SFT 76.7%), demonstrating RL can match SFT without step-by-step supervision.

Technologies & Topics

RLLLMMath ReasoningGMPO