Similar Items: Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
- Misaligned by Reward: Socially Undesirable Preferences in LLMs
- Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
- Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
- Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
- Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
- Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning