Similar Items: Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
- Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
- OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
- Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts
- Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
- Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems
- To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling