ArXiv AI Agent Papers Tracker — Week of Jun 18, 2026
35 papers this week reveal breakthroughs in self-evolving agents, distributed P2P networks, and creative domain benchmarks. OPD-Evolver challenges 397B models with 9B parameters. GameCraft-Bench shows frontier models struggle in creative tasks.
Data Overview
- Snapshot Week: 2026-06-11 to 2026-06-18
- Tracker: ArXiv AI Agent Papers Tracker (view all snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI RSS, ArXiv cs.CL RSS, HuggingFace Daily Papers
Key Facts
- Who: 35 papers total, 28 agent-related (80%), 6 multi-agent systems, 3 self-evolving agents
- What: 7 new benchmarks introduced; average trend score for agent papers reaches 8.1 (up from 7.4 last week)
- When: Week of June 18, 2026
- Impact: OPD-Evolver, GameCraft-Bench, and Distributed Agent Networks emerge as top-scoring papers (trend score 10/10)
Methodology
This tracker monitors ArXiv cs.AI and cs.CL RSS feeds weekly, filtering for agent-related research. Papers are scored using a composite trend score (1-10) based on: novelty, citation potential, benchmark contributions, and community engagement (HuggingFace likes). Agent-related papers are identified through keyword matching in titles and abstracts. Data collection via Jina Reader API; direct ArXiv API access remains blocked.
This Week’s Metrics
| Metric | This Week | Last Week | Δ |
|---|---|---|---|
| Total papers | 35 | 31 | +4 |
| Agent-related | 28 | 28 | 0 |
| Agent percentage | 80% | 90% | -10pp |
| New benchmarks | 7 | 7 | 0 |
| Avg trend score (agent) | 8.1 | 7.4 | +0.7 |
| Multi-agent papers | 6 | 4 | +2 |
| Self-evolving agents | 3 | 2 | +1 |
Top Papers This Week
| Title | ArXiv ID | Trend Score | Key Topics |
|---|---|---|---|
| OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation | 2606.17628 | 10 | agent evolution, self-evolving agents, memory hierarchy |
| Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes | 2606.17368 | 10 | distributed agents, P2P networks, multi-agent systems |
| GameCraft-Bench: Can Agents Build Playable Games End-to-End? | 2606.17861 | 10 | game generation agents, coding benchmarks, creative agents |
| Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search | 2606.17209 | 9 | agentic search, multi-hop reasoning, query diversification |
| When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval | 2606.17220 | 9 | self-evolving agents, legal AI, rule evolution |
| From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning | 2606.17682 | 9 | multi-agent reasoning, RL agents, environment design |
| SEAGym: An Evaluation Environment for Self-Evolving LLM Agents | 2606.17546 | 9 | self-evolving agents, agent evaluation, evolution tracking |
| EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks | 2606.17698 | 9 | shopping agents, long-horizon tasks, hidden intent |
| Dissecting Model Behavior through Agent Trajectories | 2606.17454 | 9 | trajectory analysis, agent behavior, harness design |
Notable Benchmarks This Week
| Benchmark | ArXiv ID | Domain | Key Insight |
|---|---|---|---|
| GameCraft-Bench | 2606.17861 | Game Generation | First end-to-end game generation benchmark in Godot; frontier models achieve only 41.46% success |
| EComAgentBench | 2606.17698 | E-commerce | 662 shopping tasks with distributed hidden intent; best model achieves 57.1% accuracy |
| SEAGym | 2606.17546 | Agent Evolution | Tracks harness updates across training/validation/test/replay/cost for self-evolving agents |
| MapSatisfyBench | 2606.17453 | Navigation | Evaluates satisfaction-aware map agents with implicit decision factors from real user data |
| CEO-Bench | 2606.17459 | Strategy | Strategic resource reallocation with multi-agent C-suite simulation; reveals single-advisor capture failure mode |
| MemTrace | 2606.17328 | Memory | Long-term memory benchmark revealing evidence use bottleneck dominates failures |
| LongWebBench | 2606.17727 | Web Generation | 490 structural + 507 functional tasks for long-horizon webpage generation |
Trending Topics
| Topic | Paper Count | Avg Trend Score | Notable Papers |
|---|---|---|---|
| Self-evolving agents | 3 | 9.3 | OPD-Evolver, When Rules Learn, SEAGym |
| Distributed agents | 1 | 10.0 | Distributed General-Purpose Agent Networks |
| Multi-agent systems | 6 | 8.2 | CEO-Bench, Trainee to Trainer, Parasocial Scripts |
| Agent benchmarks | 7 | 7.9 | GameCraft-Bench, EComAgentBench, SEAGym |
| Agent memory | 4 | 7.5 | MemSlides, FinAcumen, MemTrace |
| Agentic search | 1 | 9.0 | DivInit |
🔺 Scout Intel: What Others Missed
Confidence: high | Novelty Score: 62/100
While individual papers receive attention on HuggingFace, the collective signal across this week’s 35 papers reveals three structural shifts that most coverage misses:
1. Self-evolving agents are closing the parameter gap. OPD-Evolver’s 9B parameter model surpasses ReasoningBank by 11.5% and Skill0 by 5.8%, directly challenging 397B frontier models. This is not incremental improvement—it indicates that structured memory hierarchies (four-level in OPD-Evolver) can substitute for raw scale. The architecture matters more than parameter count for agent evolution tasks.
2. Creative domain benchmarks expose frontier model limitations. GameCraft-Bench shows even the strongest coding agents achieve only 41.46% success on end-to-end game generation. EComAgentBench’s best model hits 57.1% on shopping tasks with scattered requirements. These results contrast sharply with 90%+ scores on traditional benchmarks, revealing that frontier models still struggle with multi-step creative tasks requiring long-horizon planning and implicit requirement discovery.
3. Distributed P2P agent networks emerge as architectural alternative. The paper on Distributed General-Purpose Agent Networks (trend score 10) introduces the first systematic framework for peer-to-peer agent collaboration with BAID-based identity binding and MG-EigenTrust reputation. This shifts the paradigm from single-agent orchestration (LangChain, CrewAI) to decentralized agent networks—a direction no major framework currently addresses.
Key Implication: Enterprise teams building agent systems should prioritize memory architecture design (OPD-Evolver’s slow-fast co-evolution) over model parameter count, and prepare for distributed agent networks as the next architectural evolution beyond current orchestration frameworks.
Trends & Observations
-
Self-evolving frameworks surge: Three papers this week focus on self-evolving agents with explicit memory hierarchies, up from two last week. The +11.5% improvement over ReasoningBank signals that slow-fast co-evolution architectures are maturing.
-
Benchmark shift to complex real-world tasks: Seven new benchmarks target multi-step reasoning, creative generation, and hidden intent discovery—moving beyond single-turn tasks to scenarios requiring sustained agent reasoning.
-
Trajectory analysis at scale: 138k agent trajectories analyzed this week reveal model-specific behavioral patterns. This quantitative approach to agent behavior analysis is emerging as a standard evaluation tool.
-
Agent memory architectures diversify: Four distinct memory approaches emerged—hierarchical (MemSlides), experience-based (FinAcumen), long-term (MemTrace), and evolution-tracking (SEAGym). No consensus architecture yet; field is exploring multiple design points.
-
Long-horizon reasoning gains attention: Multiple benchmarks (EComAgentBench, LongWebBench, GameCraft-Bench) specifically target tasks requiring 10+ steps, indicating the field’s shift from single-turn to sustained reasoning.
Week-over-Week Summary
| Metric | This Week | Last Week | Δ |
|---|---|---|---|
| Papers tracked | 35 | 31 | +4 |
| Agent-related papers | 28 | 28 | 0 |
| Agent percentage | 80% | 90% | -10pp |
| Avg trend score (agent) | 8.1 | 7.4 | +0.7 |
| Multi-agent papers | 6 | 4 | +2 |
| Self-evolving agents | 3 | 2 | +1 |
| Benchmarks introduced | 7 | 7 | 0 |
| Trend score ≥ 9 | 9 papers | 4 papers | +5 |
Notable change: Average trend score for agent papers jumped +0.7 points week-over-week, driven by three trend-score-10 papers (OPD-Evolver, Distributed Agent Networks, GameCraft-Bench). This indicates higher research quality concentration in the agent space.
Full Paper List
| Title | Authors | Category | Published | Score | ArXiv | HF |
|---|---|---|---|---|---|---|
| OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation | NUS Research Team | cs.AI | 2026-06-17 | 10 | 2606.17628 | Link |
| Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes | Multiple authors | cs.AI | 2026-06-17 | 10 | 2606.17368 | — |
| GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? | CUHKSZ | cs.AI | 2026-06-17 | 10 | 2606.17861 | Link |
| Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search | CMU Research Team | cs.AI | 2026-06-17 | 9 | 2606.17209 | — |
| When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17220 | — |
| From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17682 | — |
| SEAGym: An Evaluation Environment for Self-Evolving LLM Agents | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17546 | — |
| EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17698 | — |
| Dissecting Model Behavior through Agent Trajectories | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17454 | — |
| Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17519 | — |
| Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17459 | — |
| Environment-Grounded Automated Prompt Optimization for LLM Game Agents | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17838 | — |
| MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation | Ye Jin, Yangyang Xu, Jun Zhu, Yibo Yang | cs.CL | 2026-06-17 | 8 | 2606.17162 | — |
| MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17453 | — |
| Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17591 | — |
| StepGuard: Guarding Web Navigation via Single-Step Calibration | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17871 | — |
| FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17642 | — |
| Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17645 | — |
| Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17577 | — |
| DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17821 | — |
| LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17507 | — |
| AIPatient Arena: EHR-grounded evaluation of LLMs in clinical consultation workflows | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17474 | — |
| From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities | Mohammadsadegh Abolhasani et al. | cs.CL | 2026-06-17 | 7 | 2606.17174 | — |
| LecturaAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning | Multiple authors | cs.CL | 2026-06-15 | 7 | 2606.16428 | Link |
| DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17574 | — |
| FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17856 | — |
| MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation | Multiple authors | cs.CL | 2026-06-17 | 7 | 2606.17449 | — |
| Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17637 | — |
| LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17727 | — |
| MemTrace: Probing What Final Accuracy Misses in Long-Term Memory | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17328 | — |
| PromptMN: Pseudo Prompting Language | Enkhzol Dovdon | cs.CL | 2026-06-17 | 6 | 2606.17164 | — |
| LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling | 19 authors | cs.AI | 2026-06-17 | 6 | 2606.18023 | Link |
| Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients | NVIDIA | cs.AI | 2026-06-17 | 6 | 2606.18216 | Link |
| ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining | CUHK | cs.AI | 2026-06-17 | 6 | 2606.17200 | Link |
Previous Snapshots
Sources
- ArXiv cs.AI RSS Feed — ArXiv, 2026-06-18
- ArXiv cs.CL RSS Feed — ArXiv, 2026-06-18
- HuggingFace Daily Papers — HuggingFace, 2026-06-17
ArXiv AI Agent Papers Tracker — Week of Jun 18, 2026
35 papers this week reveal breakthroughs in self-evolving agents, distributed P2P networks, and creative domain benchmarks. OPD-Evolver challenges 397B models with 9B parameters. GameCraft-Bench shows frontier models struggle in creative tasks.
Data Overview
- Snapshot Week: 2026-06-11 to 2026-06-18
- Tracker: ArXiv AI Agent Papers Tracker (view all snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI RSS, ArXiv cs.CL RSS, HuggingFace Daily Papers
Key Facts
- Who: 35 papers total, 28 agent-related (80%), 6 multi-agent systems, 3 self-evolving agents
- What: 7 new benchmarks introduced; average trend score for agent papers reaches 8.1 (up from 7.4 last week)
- When: Week of June 18, 2026
- Impact: OPD-Evolver, GameCraft-Bench, and Distributed Agent Networks emerge as top-scoring papers (trend score 10/10)
Methodology
This tracker monitors ArXiv cs.AI and cs.CL RSS feeds weekly, filtering for agent-related research. Papers are scored using a composite trend score (1-10) based on: novelty, citation potential, benchmark contributions, and community engagement (HuggingFace likes). Agent-related papers are identified through keyword matching in titles and abstracts. Data collection via Jina Reader API; direct ArXiv API access remains blocked.
This Week’s Metrics
| Metric | This Week | Last Week | Δ |
|---|---|---|---|
| Total papers | 35 | 31 | +4 |
| Agent-related | 28 | 28 | 0 |
| Agent percentage | 80% | 90% | -10pp |
| New benchmarks | 7 | 7 | 0 |
| Avg trend score (agent) | 8.1 | 7.4 | +0.7 |
| Multi-agent papers | 6 | 4 | +2 |
| Self-evolving agents | 3 | 2 | +1 |
Top Papers This Week
| Title | ArXiv ID | Trend Score | Key Topics |
|---|---|---|---|
| OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation | 2606.17628 | 10 | agent evolution, self-evolving agents, memory hierarchy |
| Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes | 2606.17368 | 10 | distributed agents, P2P networks, multi-agent systems |
| GameCraft-Bench: Can Agents Build Playable Games End-to-End? | 2606.17861 | 10 | game generation agents, coding benchmarks, creative agents |
| Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search | 2606.17209 | 9 | agentic search, multi-hop reasoning, query diversification |
| When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval | 2606.17220 | 9 | self-evolving agents, legal AI, rule evolution |
| From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning | 2606.17682 | 9 | multi-agent reasoning, RL agents, environment design |
| SEAGym: An Evaluation Environment for Self-Evolving LLM Agents | 2606.17546 | 9 | self-evolving agents, agent evaluation, evolution tracking |
| EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks | 2606.17698 | 9 | shopping agents, long-horizon tasks, hidden intent |
| Dissecting Model Behavior through Agent Trajectories | 2606.17454 | 9 | trajectory analysis, agent behavior, harness design |
Notable Benchmarks This Week
| Benchmark | ArXiv ID | Domain | Key Insight |
|---|---|---|---|
| GameCraft-Bench | 2606.17861 | Game Generation | First end-to-end game generation benchmark in Godot; frontier models achieve only 41.46% success |
| EComAgentBench | 2606.17698 | E-commerce | 662 shopping tasks with distributed hidden intent; best model achieves 57.1% accuracy |
| SEAGym | 2606.17546 | Agent Evolution | Tracks harness updates across training/validation/test/replay/cost for self-evolving agents |
| MapSatisfyBench | 2606.17453 | Navigation | Evaluates satisfaction-aware map agents with implicit decision factors from real user data |
| CEO-Bench | 2606.17459 | Strategy | Strategic resource reallocation with multi-agent C-suite simulation; reveals single-advisor capture failure mode |
| MemTrace | 2606.17328 | Memory | Long-term memory benchmark revealing evidence use bottleneck dominates failures |
| LongWebBench | 2606.17727 | Web Generation | 490 structural + 507 functional tasks for long-horizon webpage generation |
Trending Topics
| Topic | Paper Count | Avg Trend Score | Notable Papers |
|---|---|---|---|
| Self-evolving agents | 3 | 9.3 | OPD-Evolver, When Rules Learn, SEAGym |
| Distributed agents | 1 | 10.0 | Distributed General-Purpose Agent Networks |
| Multi-agent systems | 6 | 8.2 | CEO-Bench, Trainee to Trainer, Parasocial Scripts |
| Agent benchmarks | 7 | 7.9 | GameCraft-Bench, EComAgentBench, SEAGym |
| Agent memory | 4 | 7.5 | MemSlides, FinAcumen, MemTrace |
| Agentic search | 1 | 9.0 | DivInit |
🔺 Scout Intel: What Others Missed
Confidence: high | Novelty Score: 62/100
While individual papers receive attention on HuggingFace, the collective signal across this week’s 35 papers reveals three structural shifts that most coverage misses:
1. Self-evolving agents are closing the parameter gap. OPD-Evolver’s 9B parameter model surpasses ReasoningBank by 11.5% and Skill0 by 5.8%, directly challenging 397B frontier models. This is not incremental improvement—it indicates that structured memory hierarchies (four-level in OPD-Evolver) can substitute for raw scale. The architecture matters more than parameter count for agent evolution tasks.
2. Creative domain benchmarks expose frontier model limitations. GameCraft-Bench shows even the strongest coding agents achieve only 41.46% success on end-to-end game generation. EComAgentBench’s best model hits 57.1% on shopping tasks with scattered requirements. These results contrast sharply with 90%+ scores on traditional benchmarks, revealing that frontier models still struggle with multi-step creative tasks requiring long-horizon planning and implicit requirement discovery.
3. Distributed P2P agent networks emerge as architectural alternative. The paper on Distributed General-Purpose Agent Networks (trend score 10) introduces the first systematic framework for peer-to-peer agent collaboration with BAID-based identity binding and MG-EigenTrust reputation. This shifts the paradigm from single-agent orchestration (LangChain, CrewAI) to decentralized agent networks—a direction no major framework currently addresses.
Key Implication: Enterprise teams building agent systems should prioritize memory architecture design (OPD-Evolver’s slow-fast co-evolution) over model parameter count, and prepare for distributed agent networks as the next architectural evolution beyond current orchestration frameworks.
Trends & Observations
-
Self-evolving frameworks surge: Three papers this week focus on self-evolving agents with explicit memory hierarchies, up from two last week. The +11.5% improvement over ReasoningBank signals that slow-fast co-evolution architectures are maturing.
-
Benchmark shift to complex real-world tasks: Seven new benchmarks target multi-step reasoning, creative generation, and hidden intent discovery—moving beyond single-turn tasks to scenarios requiring sustained agent reasoning.
-
Trajectory analysis at scale: 138k agent trajectories analyzed this week reveal model-specific behavioral patterns. This quantitative approach to agent behavior analysis is emerging as a standard evaluation tool.
-
Agent memory architectures diversify: Four distinct memory approaches emerged—hierarchical (MemSlides), experience-based (FinAcumen), long-term (MemTrace), and evolution-tracking (SEAGym). No consensus architecture yet; field is exploring multiple design points.
-
Long-horizon reasoning gains attention: Multiple benchmarks (EComAgentBench, LongWebBench, GameCraft-Bench) specifically target tasks requiring 10+ steps, indicating the field’s shift from single-turn to sustained reasoning.
Week-over-Week Summary
| Metric | This Week | Last Week | Δ |
|---|---|---|---|
| Papers tracked | 35 | 31 | +4 |
| Agent-related papers | 28 | 28 | 0 |
| Agent percentage | 80% | 90% | -10pp |
| Avg trend score (agent) | 8.1 | 7.4 | +0.7 |
| Multi-agent papers | 6 | 4 | +2 |
| Self-evolving agents | 3 | 2 | +1 |
| Benchmarks introduced | 7 | 7 | 0 |
| Trend score ≥ 9 | 9 papers | 4 papers | +5 |
Notable change: Average trend score for agent papers jumped +0.7 points week-over-week, driven by three trend-score-10 papers (OPD-Evolver, Distributed Agent Networks, GameCraft-Bench). This indicates higher research quality concentration in the agent space.
Full Paper List
| Title | Authors | Category | Published | Score | ArXiv | HF |
|---|---|---|---|---|---|---|
| OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation | NUS Research Team | cs.AI | 2026-06-17 | 10 | 2606.17628 | Link |
| Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes | Multiple authors | cs.AI | 2026-06-17 | 10 | 2606.17368 | — |
| GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? | CUHKSZ | cs.AI | 2026-06-17 | 10 | 2606.17861 | Link |
| Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search | CMU Research Team | cs.AI | 2026-06-17 | 9 | 2606.17209 | — |
| When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17220 | — |
| From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17682 | — |
| SEAGym: An Evaluation Environment for Self-Evolving LLM Agents | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17546 | — |
| EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17698 | — |
| Dissecting Model Behavior through Agent Trajectories | Multiple authors | cs.AI | 2026-06-17 | 9 | 2606.17454 | — |
| Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17519 | — |
| Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17459 | — |
| Environment-Grounded Automated Prompt Optimization for LLM Game Agents | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17838 | — |
| MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation | Ye Jin, Yangyang Xu, Jun Zhu, Yibo Yang | cs.CL | 2026-06-17 | 8 | 2606.17162 | — |
| MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17453 | — |
| Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17591 | — |
| StepGuard: Guarding Web Navigation via Single-Step Calibration | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17871 | — |
| FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17642 | — |
| Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns | Multiple authors | cs.AI | 2026-06-17 | 8 | 2606.17645 | — |
| Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17577 | — |
| DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17821 | — |
| LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17507 | — |
| AIPatient Arena: EHR-grounded evaluation of LLMs in clinical consultation workflows | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17474 | — |
| From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities | Mohammadsadegh Abolhasani et al. | cs.CL | 2026-06-17 | 7 | 2606.17174 | — |
| LecturaAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning | Multiple authors | cs.CL | 2026-06-15 | 7 | 2606.16428 | Link |
| DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17574 | — |
| FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17856 | — |
| MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation | Multiple authors | cs.CL | 2026-06-17 | 7 | 2606.17449 | — |
| Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17637 | — |
| LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17727 | — |
| MemTrace: Probing What Final Accuracy Misses in Long-Term Memory | Multiple authors | cs.AI | 2026-06-17 | 7 | 2606.17328 | — |
| PromptMN: Pseudo Prompting Language | Enkhzol Dovdon | cs.CL | 2026-06-17 | 6 | 2606.17164 | — |
| LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling | 19 authors | cs.AI | 2026-06-17 | 6 | 2606.18023 | Link |
| Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients | NVIDIA | cs.AI | 2026-06-17 | 6 | 2606.18216 | Link |
| ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining | CUHK | cs.AI | 2026-06-17 | 6 | 2606.17200 | Link |
Previous Snapshots
Sources
- ArXiv cs.AI RSS Feed — ArXiv, 2026-06-18
- ArXiv cs.CL RSS Feed — ArXiv, 2026-06-18
- HuggingFace Daily Papers — HuggingFace, 2026-06-17
Related Intel
LLM Product Release Weekly Tracker — Week of Jun 16, 2026
Anthropic dominates with Fable 5/Mythos 5 release and immediate export control suspension. Google deprecates Imagen 4 and Veo. Anthropic confidential S-1 signals IPO. 11 entries, 5 high-impact events.
AI Agent Market Transformation: IDE Consolidation, Capital Concentration, Evaluation Gap 2026
Three structural changes define June 2026: Windsurf split signals AI IDE oligopoly formation; 67% of Q1 funding to three frontier labs; CLEAR framework addresses 37% lab-to-production gap. Enterprise deployment requires fundamental strategy shift.
GitHub AI Agent Stars Tracker — Week of Jun 8, 2026
Weekly snapshot tracking 152 AI agent repositories with >1k stars. santifer/career-ops leads growth at +7.85%, ecosystem adds 5 new repos, Python dominates at 43%.