Similar Items: Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning
- How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
- Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
- Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
- Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives
- Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
- GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning