Similar Items: Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
- Misaligned by Reward: Socially Undesirable Preferences in LLMs
- Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning
- Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives
- Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
- GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
- Reasoning over Object Descriptions Improves Coreference Resolution in Task-Based Dialogue Systems