Similar Items: TriBench-Ko: Evaluating LLM Risks in Judicial Workflows
- FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios
- MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
- ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks
- CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
- Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
- MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge