Similar Items: Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
- From Intent to Execution: Composing Agentic Workflows with Agent Recommendation
- What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
- NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
- Towards Open World Sound Event Detection
- Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents
- AI Co-Mathematician: Accelerating Mathematicians with Agentic AI