Similar Items: Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
- FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios
- CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
- Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives
- Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors
- Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
- Why Expert Alignment Is Hard: Evidence from Subjective Evaluation