Similar Items: When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
- LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
- When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
- Misaligned by Reward: Socially Undesirable Preferences in LLMs
- FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
- Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
- Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs