Similar Items: AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
- RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
- Verifier-Backed Hard Problem Generation for Mathematical Reasoning
- Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
- Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
- Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime