Similar Items: What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
- Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
- Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
- From Intent to Execution: Composing Agentic Workflows with Agent Recommendation
- NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
- Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents
- AI Co-Mathematician: Accelerating Mathematicians with Agentic AI