Similar Items: Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
- Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
- LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
- Misaligned by Reward: Socially Undesirable Preferences in LLMs
- Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
- Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs