Similar Items: Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
- Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
- Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
- LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
- Misaligned by Reward: Socially Undesirable Preferences in LLMs
- MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
- MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge