ArXiv AI Agent Papers Tracker — Week of Jun 18, 2026
35 papers this week reveal breakthroughs in self-evolving agents, distributed P2P networks, and creative domain benchmarks. OPD-Evolver challenges 397B models with 9B parameters. GameCraft-Bench shows frontier models struggle in creative tasks.
TRACKER Updated weekly 10 snapshots
Latest Snapshot Jun 18, 2026
ArXiv AI Agent Papers Tracker — Week of Jun 18, 2026
35 papers this week reveal breakthroughs in self-evolving agents, distributed P2P networks, and creative domain benchmarks. OPD-Evolver challenges 397B models with 9B parameters. GameCraft-Bench shows frontier models struggle in creative tasks.
Open Tracker →All Snapshots
- Jun 18, 2026ArXiv AI Agent Papers Tracker — Week of Jun 18, 202635 papers this week reveal breakthroughs in self-evolving agents, distributed P2P networks, and creative domain benchmarks. OPD-Evolver challenges 397B models with 9B parameters. GameCraft-Bench shows frontier models struggle in creative tasks.
- Jun 11, 2026ArXiv cs.AI Weekly Papers Tracker - Week of June 11, 202628 agent-related papers submitted June 9-10, including 7 new benchmarks (ABC-Bench, Workflow-GYM, PhysTool-Bench). EEVEE achieves +37.2% via test-time learning. Workflow-GYM reveals <30% success gap on professional workflows.
- Jun 4, 2026ArXiv cs.AI Weekly Papers — Week of June 4, 2026: Self-Evolving Agents and Multi-Agent Governance31 papers collected this week with 25 agent-related papers (81%). Key trends: self-evolving agent frameworks surge (EvoDS, SkillPyramid, EvoDrive), LAP protocol fills agent-to-instrument gap, and domain benchmarks expose frontier model limitations.
- May 28, 2026ArXiv cs.AI Weekly Tracker - Week of May 28, 2026Self-improving agent frameworks emerge with MUSE-Autoskill and SIA. FinHarness and QUACK advance domain-specific safety. RLHF vulnerability identified in ICML 2026 paper.
- May 21, 2026ArXiv cs.AI Weekly Papers Tracker - Week of May 21, 2026Weekly snapshot of 30 agent-related research papers from ArXiv cs.AI and cs.CL. Computer-use agent evaluation emerges as dominant theme with OpenComputer's 1,000 tasks and Agent Meltdowns' 64.7% unsafe behavior rate.
- May 14, 2026ArXiv cs.AI Weekly Papers Tracker - Week of May 14, 2026122 papers this week (+24.5% WoW). ToolCUA sets new SOTA for Computer Use Agents at 46.85% accuracy. LongMemEval-V2 introduces first dedicated agent memory benchmark. GUI agents, multi-agent systems, and memory architectures dominate research.
- May 7, 2026ArXiv cs.AI Weekly — Week of May 1, 202698 papers this week with 30 agent-related submissions. Multi-Agent Reasoning achieves Pareto-optimal test-time scaling; Agent Capsules reduces token usage by 51%; RAG-Gym provides systematic optimization framework.
- Apr 30, 2026ArXiv AI Agent Papers Tracker — Week of Apr 23-30, 2026Weekly snapshot of ArXiv cs.AI papers focused on AI agents, multi-agent systems, tool-use, reasoning, and RAG. 30 papers this week including ClawNet (cross-user agent collaboration) and HERA (38.69% improvement via hierarchical orchestration).
- Apr 23, 2026ArXiv cs.AI Agent Papers Weekly Tracker — Week of Apr 23, 202630 high-quality agent papers this week. Top: ReTAS addresses Actor-Observer Asymmetry in multi-agent systems. Benchmark papers +133%, RAG-Agent papers +260% week-over-week.
- Apr 16, 2026ArXiv AI Agent Papers Weekly: Multi-Agent Debates, RAG Evolution, and Agent BenchmarksWeekly tracking of 30 AI agent papers from ArXiv cs.AI and cs.CL categories (Apr 9-16, 2026). Single-agent LLMs challenge multi-agent orthodoxy under equal token budgets, RAG evolves into agentic architectures, and 5+ new benchmarks push evaluation toward production.