Similar Items: Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
- Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
- How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
- Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
- Bolek: A Multimodal Language Model for Molecular Reasoning
- Verifier-Backed Hard Problem Generation for Mathematical Reasoning
- Generating Statistical Charts with Validation-Driven LLM Workflows