Channels - Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs :: FRELIP Discovery

Similar Items: Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Quick Look
Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Quick Look
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
Quick Look
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Quick Look
Misaligned by Reward: Socially Undesirable Preferences in LLMs
Quick Look
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
Quick Look
MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge
Quick Look
SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification
Quick Look
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Quick Look
mdok-style at SemEval-2026 Task 10: Finetuning LLMs for Conspiracy Detection
Quick Look
A multilingual hallucination benchmark: MultiWikiQHalluA
Quick Look
mdok-style at SemEval-2026 Task 9: Finetuning LLMs for Multilingual Polarization Detection
Quick Look
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
Quick Look
Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs
Quick Look
ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks
Quick Look
CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
Quick Look
Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLMs in Low-Resource Settings
Quick Look
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
Quick Look
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
Quick Look
Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus
Quick Look
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
Quick Look
Beyond Semantics: Measuring Fine-Grained Emotion Preservation in Small Language Model-Based Machine Translation
Quick Look
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
Quick Look
Unintended Negative Impacts of Promotional Language in Patent Evaluation
Quick Look
TriBench-Ko: Evaluating LLM Risks in Judicial Workflows