Similar Items: KL for a KL: On-Policy Distillation with Control Variate Baseline
- Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
- Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs
- Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception
- Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning
- Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
- Exponential families from a single KL identity