Similar Items: Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
- AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
- RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
- Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
- Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
- Do Sparse Autoencoders Capture Concept Manifolds?