Similar Items: Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals
- Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
- Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation
- Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
- MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
- The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
- OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories