Similar Items: RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
- Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
- Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
- Exploration Hacking: Can LLMs Learn to Resist RL Training?
- Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters
- Verifier-Backed Hard Problem Generation for Mathematical Reasoning
- Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring