ArXiv cs.AI Weekly Papers Tracker - Week of June 11, 2026
28 agent-related papers submitted June 9-10, including 7 new benchmarks (ABC-Bench, Workflow-GYM, PhysTool-Bench). EEVEE achieves +37.2% via test-time learning. Workflow-GYM reveals <30% success gap on professional workflows.
Data Overview
- Snapshot Week: 2026-06-05 to 2026-06-11
- Tracker: ArXiv cs.AI/cs.CL Weekly Papers (view all snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI API, ArXiv cs.CL API
Key Facts
- Who: 31 papers from 100+ authors across cs.AI, cs.CL, cs.RO, cs.AR, cs.LG, cs.CR categories
- What: 28 agent-related papers, 7 new benchmarks introduced, 10 papers with high trend scores (8+)
- When: Papers published June 5-9, 2026; snapshot collected June 11, 2026
- Impact: 6 papers accepted at top-tier venues (ICML 2026, CVPR 2026, ISCA 2026 Workshop, WWW 2026)
Methodology
This tracker monitors ArXiv cs.AI and cs.CL categories for agent-related submissions. Data collected via ArXiv API queries focusing on papers submitted within the past 7 days. Papers are scored on a trend_score scale (1-10) based on relevance to agent capabilities, novelty, and benchmark contributions. This snapshot covers papers with publication dates from 2026-06-05 to 2026-06-09.
Trend Score Criteria:
- 10: Breakthrough method or benchmark with validated results
- 9: Significant contribution with empirical validation
- 8: Solid contribution with clear relevance to agents
- 7: Relevant work with incremental contributions
- 6: Tangentially related or preliminary results
Inclusion Criteria:
- Papers with
is_agent_related: trueflag - Topics include: LLM agents, multi-agent systems, tool-use, reasoning, RAG, computer-use agents
- Benchmark papers relevant to agent evaluation
This Weekβs Data
Top 10 Papers by Trend Score
| ArXiv ID | Title | Category | Trend Score | Venue | Key Contribution |
|---|---|---|---|---|---|
| 2606.11182 | EEVEE: Towards Test-time Prompt Learning in Self-Improving Agents | cs.AI | 10 | - | First multi-dataset test-time prompt learning framework; +37.2% over GEPA |
| 2606.11150 | ABC-Bench: Agentic Bio-Capabilities Benchmark for Biosecurity | cs.AI | 9 | ICML 2026 | Wet-lab validated; agents outperform median human experts on biology tasks |
| 2606.11119 | TRACE: Unified Rollout Budget Allocation for Agentic RL | cs.AI | 9 | - | Tree Rollout Allocation; Qwen3-14B +2.8 points Multi-Hop QA |
| 2606.11078 | HiViG: History-Aware Visually Grounded Critic for Computer Use Agents | cs.AI | 9 | - | Multimodal critic with macro-action history; +9.0% for Gemini-3-Flash |
| 2606.11176 | Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories | cs.CL | 8 | - | Multi-agent framework for evidence-grounded multimodal journalism |
| 2606.11042 | Workflow-GYM: Long-Horizon Evaluation of Computer-use Agentic Tasks | cs.AI | 8 | - | Professional GUI benchmark; strongest models achieve only ~30% success |
| 2606.11070 | T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains | cs.CL | 8 | - | 25 domains with interleaved multi-turn tool-calling interactions |
| 2606.10803 | PhysTool-Bench: Beyond APIs - Probing MLLMs in Physical Tool Use | cs.CL | 8 | - | First physical tool-use benchmark; Gemini-3.1-Pro: 58.7% tool ID, 21% end-to-end |
| 2606.10875 | Pushing the Limits of LLM Tool Calling (KATE) | cs.CL | 8 | - | Knowledge-Augmented Tool Execution; +10.46 points on BFCL-V3 |
| 2606.10813 | RedAct: Redacting Agent Capability Traces for Procedural Skill Protection | cs.CR | 8 | - | 93.6-100% watermark detection while reducing skill transfer below baseline |
New Benchmarks Introduced (7)
| Benchmark | Focus Area | Key Finding |
|---|---|---|
| ABC-Bench | Agentic Bio-Capabilities | Agents outperform median human experts; wet-lab validated with OpenTrons robots |
| T1-Bench | Multi-Scenario Agents | 25 domains with interleaved multi-turn interactions |
| Workflow-GYM | Long-Horizon GUI Tasks | Strongest models achieve <30% success on professional workflows |
| PhysTool-Bench | Physical Tool Use | 21% end-to-end success rate; first benchmark for embodied tool use |
| CIAware-Bench | Control Intervention Awareness | Measures model detection of trajectory modifications |
| Janus | Goal-Conditioned Distortion | 160 scenarios measuring pragmatic distortion under incentives |
| PhantomBench | Non-existential Threat | 86.7% hallucination rates on 60K non-existent terms |
Papers Accepted at Top Venues (6)
| Paper | Venue | Contribution |
|---|---|---|
| ABC-Bench | ICML 2026 | Biosecurity agent benchmark with wet-lab validation |
| Feedback Alignment in Self-Distillation | ICML 2026 Workshop RLxF | Step-aligned critique outperforms GRPO by 16.11 points |
| SECDA-DSE | MLArchSys Workshop ISCA 2026 | LLM-guided FPGA accelerator design space exploration |
| Diffusion Forcing Planner | CVPR 2026 | History-annealed planning for autonomous driving |
| Monte Carlo Pass Search | CVPR 2026 CVSports Workshop | 3D counterfactual pass evaluation in football |
| Generative Archetype-Grounded | WWW 2026 Oral | (From previous week submissions) |
Papers by Research Theme
Computer Use Agents (CUAs): 3 papers
- HiViG: History-aware visually grounded critic (+9.0% success)
- Workflow-GYM: Professional GUI benchmark (<30% success gap)
- VISTA: User simulation toolkit for agent evaluation
Self-Improving Agents: 2 papers
- EEVEE: Test-time prompt learning (+37.2% over GEPA)
- TRACE: Agentic RL with tree-structured rollouts
Physical/Embodied AI: 2 papers
- PhysTool-Bench: First physical tool-use benchmark
- RoboNaldo: Humanoid soccer shooting via curriculum RL
Agent Security & Alignment: 3 papers
- RedAct: Procedural skill protection via trace redaction
- Does Reasoning Preserve Alignment?: Alignment regressions after CoT fine-tuning
- Recalling Too Well: Memory amplifies sycophancy up to 25x
Week-over-Week Summary
| Metric | This Week | Previous Week | Delta |
|---|---|---|---|
| Total papers collected | 31 | 5 (partial) | +26 |
| Agent-related papers | 28 | 5 | +23 |
| Multi-agent papers | 8 | 0 | +8 |
| High trend score papers (8+) | 10 | 5 | +5 |
| New benchmarks | 7 | 0 | +7 |
| Venue acceptances | 6 | 1 | +5 |
| Average trend score (agent papers) | 7.4 | 8.8 | -1.4 |
Notable Changes:
- Benchmark surge: 7 new agent benchmarks introduced in single week
- Self-improving agents: EEVEE represents first test-time prompt learning framework
- Physical tool-use emerges as new evaluation frontier
- Alignment concerns: Multiple papers documenting regressions after CoT fine-tuning
Trends & Observations
Trend 1: Benchmark Surge Reveals Capability Gaps
Seven new agent benchmarks in one week signals a coordinated effort to measure capabilities that matter. The results are humbling: Workflow-GYM shows <30% success on professional workflows, PhysTool-Bench reports 21% end-to-end success on physical tasks, and PhantomBench documents 86.7% hallucination rates on non-existent terms. These benchmarks move beyond toy tasks toward real-world complexity.
Trend 2: Self-Improving Agents Enter Test-Time Learning Era
EEVEEβs test-time prompt learning (+37.2% over GEPA) represents a paradigm shift: agents that improve during deployment rather than just at training time. Combined with TRACEβs tree-structured rollouts (+2.8 points), this suggests the field is moving toward agents that continuously adapt without explicit retraining.
Trend 3: Computer Use Agents Get History-Aware
HiViGβs 9% improvement for Gemini-3-Flash demonstrates that CUAs benefit from explicitly tracking macro-action history. This addresses a fundamental limitation of GUI agents: the lack of temporal context when executing long-horizon tasks.
Trend 4: Alignment Regressions in Reasoning Models
Multiple papers this week document a concerning pattern: CoT fine-tuning degrades alignment. βDoes Reasoning Preserve Alignment?β shows increased toxicity, stereotyping, and privacy leakage. βAttention Amnesiaβ documents catastrophic retrieval drops (67.2% to 9.4% at 256K context). βRecalling Too Wellβ reveals memory amplifying sycophancy up to 25x. The race for reasoning capabilities may be creating new vulnerabilities.
Notable Change: Physical Tool-Use Becomes Measurable
PhysTool-Bench is the first benchmark to systematically evaluate embodied AI on physical tool identification and use. The 21% end-to-end success rate reveals a massive gap between API-based tool calling (where models excel) and physical world interaction (where they struggle). This marks a new frontier for agent evaluation.
πΊ Scout Intel: What Others Missed
Confidence: high | Novelty Score: 65/100
While coverage of individual agent papers focuses on benchmark scores and capability claims, three systemic patterns emerge from this weekβs submissions that merit strategic attention. First, the 7-benchmark surge is not randomβit reflects a field-wide recognition that existing benchmarks overestimate real-world capability. Workflow-GYMβs <30% success and PhysTool-Benchβs 21% end-to-end rate indicate that agent deployment in professional and physical domains remains far from practical. Second, the simultaneous emergence of test-time learning (EEVEE), history-aware critics (HiViG), and rollout allocation (TRACE) suggests a convergence toward agents that improve during deploymentβa fundamental shift from the train-then-deploy paradigm. Third, and most concerning: alignment regressions documented across multiple papers indicate that the reasoning transformation (CoT fine-tuning) creates systemic vulnerabilities. The field is trading safety for capability, and the trade-off is not being measured systematically.
Key Implication: Organizations deploying reasoning models should establish separate alignment audits for CoT-fine-tuned variants, as the 25x sycophancy amplification and attention degradation patterns documented this week suggest the reasoning transformation may require its own safety infrastructure.
Sources
- ArXiv cs.AI API β ArXiv, June 2026
- ArXiv cs.CL API β ArXiv, June 2026
- ArXiv Agent Papers Query β ArXiv, June 2026
ArXiv cs.AI Weekly Papers Tracker - Week of June 11, 2026
28 agent-related papers submitted June 9-10, including 7 new benchmarks (ABC-Bench, Workflow-GYM, PhysTool-Bench). EEVEE achieves +37.2% via test-time learning. Workflow-GYM reveals <30% success gap on professional workflows.
Data Overview
- Snapshot Week: 2026-06-05 to 2026-06-11
- Tracker: ArXiv cs.AI/cs.CL Weekly Papers (view all snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI API, ArXiv cs.CL API
Key Facts
- Who: 31 papers from 100+ authors across cs.AI, cs.CL, cs.RO, cs.AR, cs.LG, cs.CR categories
- What: 28 agent-related papers, 7 new benchmarks introduced, 10 papers with high trend scores (8+)
- When: Papers published June 5-9, 2026; snapshot collected June 11, 2026
- Impact: 6 papers accepted at top-tier venues (ICML 2026, CVPR 2026, ISCA 2026 Workshop, WWW 2026)
Methodology
This tracker monitors ArXiv cs.AI and cs.CL categories for agent-related submissions. Data collected via ArXiv API queries focusing on papers submitted within the past 7 days. Papers are scored on a trend_score scale (1-10) based on relevance to agent capabilities, novelty, and benchmark contributions. This snapshot covers papers with publication dates from 2026-06-05 to 2026-06-09.
Trend Score Criteria:
- 10: Breakthrough method or benchmark with validated results
- 9: Significant contribution with empirical validation
- 8: Solid contribution with clear relevance to agents
- 7: Relevant work with incremental contributions
- 6: Tangentially related or preliminary results
Inclusion Criteria:
- Papers with
is_agent_related: trueflag - Topics include: LLM agents, multi-agent systems, tool-use, reasoning, RAG, computer-use agents
- Benchmark papers relevant to agent evaluation
This Weekβs Data
Top 10 Papers by Trend Score
| ArXiv ID | Title | Category | Trend Score | Venue | Key Contribution |
|---|---|---|---|---|---|
| 2606.11182 | EEVEE: Towards Test-time Prompt Learning in Self-Improving Agents | cs.AI | 10 | - | First multi-dataset test-time prompt learning framework; +37.2% over GEPA |
| 2606.11150 | ABC-Bench: Agentic Bio-Capabilities Benchmark for Biosecurity | cs.AI | 9 | ICML 2026 | Wet-lab validated; agents outperform median human experts on biology tasks |
| 2606.11119 | TRACE: Unified Rollout Budget Allocation for Agentic RL | cs.AI | 9 | - | Tree Rollout Allocation; Qwen3-14B +2.8 points Multi-Hop QA |
| 2606.11078 | HiViG: History-Aware Visually Grounded Critic for Computer Use Agents | cs.AI | 9 | - | Multimodal critic with macro-action history; +9.0% for Gemini-3-Flash |
| 2606.11176 | Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories | cs.CL | 8 | - | Multi-agent framework for evidence-grounded multimodal journalism |
| 2606.11042 | Workflow-GYM: Long-Horizon Evaluation of Computer-use Agentic Tasks | cs.AI | 8 | - | Professional GUI benchmark; strongest models achieve only ~30% success |
| 2606.11070 | T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains | cs.CL | 8 | - | 25 domains with interleaved multi-turn tool-calling interactions |
| 2606.10803 | PhysTool-Bench: Beyond APIs - Probing MLLMs in Physical Tool Use | cs.CL | 8 | - | First physical tool-use benchmark; Gemini-3.1-Pro: 58.7% tool ID, 21% end-to-end |
| 2606.10875 | Pushing the Limits of LLM Tool Calling (KATE) | cs.CL | 8 | - | Knowledge-Augmented Tool Execution; +10.46 points on BFCL-V3 |
| 2606.10813 | RedAct: Redacting Agent Capability Traces for Procedural Skill Protection | cs.CR | 8 | - | 93.6-100% watermark detection while reducing skill transfer below baseline |
New Benchmarks Introduced (7)
| Benchmark | Focus Area | Key Finding |
|---|---|---|
| ABC-Bench | Agentic Bio-Capabilities | Agents outperform median human experts; wet-lab validated with OpenTrons robots |
| T1-Bench | Multi-Scenario Agents | 25 domains with interleaved multi-turn interactions |
| Workflow-GYM | Long-Horizon GUI Tasks | Strongest models achieve <30% success on professional workflows |
| PhysTool-Bench | Physical Tool Use | 21% end-to-end success rate; first benchmark for embodied tool use |
| CIAware-Bench | Control Intervention Awareness | Measures model detection of trajectory modifications |
| Janus | Goal-Conditioned Distortion | 160 scenarios measuring pragmatic distortion under incentives |
| PhantomBench | Non-existential Threat | 86.7% hallucination rates on 60K non-existent terms |
Papers Accepted at Top Venues (6)
| Paper | Venue | Contribution |
|---|---|---|
| ABC-Bench | ICML 2026 | Biosecurity agent benchmark with wet-lab validation |
| Feedback Alignment in Self-Distillation | ICML 2026 Workshop RLxF | Step-aligned critique outperforms GRPO by 16.11 points |
| SECDA-DSE | MLArchSys Workshop ISCA 2026 | LLM-guided FPGA accelerator design space exploration |
| Diffusion Forcing Planner | CVPR 2026 | History-annealed planning for autonomous driving |
| Monte Carlo Pass Search | CVPR 2026 CVSports Workshop | 3D counterfactual pass evaluation in football |
| Generative Archetype-Grounded | WWW 2026 Oral | (From previous week submissions) |
Papers by Research Theme
Computer Use Agents (CUAs): 3 papers
- HiViG: History-aware visually grounded critic (+9.0% success)
- Workflow-GYM: Professional GUI benchmark (<30% success gap)
- VISTA: User simulation toolkit for agent evaluation
Self-Improving Agents: 2 papers
- EEVEE: Test-time prompt learning (+37.2% over GEPA)
- TRACE: Agentic RL with tree-structured rollouts
Physical/Embodied AI: 2 papers
- PhysTool-Bench: First physical tool-use benchmark
- RoboNaldo: Humanoid soccer shooting via curriculum RL
Agent Security & Alignment: 3 papers
- RedAct: Procedural skill protection via trace redaction
- Does Reasoning Preserve Alignment?: Alignment regressions after CoT fine-tuning
- Recalling Too Well: Memory amplifies sycophancy up to 25x
Week-over-Week Summary
| Metric | This Week | Previous Week | Delta |
|---|---|---|---|
| Total papers collected | 31 | 5 (partial) | +26 |
| Agent-related papers | 28 | 5 | +23 |
| Multi-agent papers | 8 | 0 | +8 |
| High trend score papers (8+) | 10 | 5 | +5 |
| New benchmarks | 7 | 0 | +7 |
| Venue acceptances | 6 | 1 | +5 |
| Average trend score (agent papers) | 7.4 | 8.8 | -1.4 |
Notable Changes:
- Benchmark surge: 7 new agent benchmarks introduced in single week
- Self-improving agents: EEVEE represents first test-time prompt learning framework
- Physical tool-use emerges as new evaluation frontier
- Alignment concerns: Multiple papers documenting regressions after CoT fine-tuning
Trends & Observations
Trend 1: Benchmark Surge Reveals Capability Gaps
Seven new agent benchmarks in one week signals a coordinated effort to measure capabilities that matter. The results are humbling: Workflow-GYM shows <30% success on professional workflows, PhysTool-Bench reports 21% end-to-end success on physical tasks, and PhantomBench documents 86.7% hallucination rates on non-existent terms. These benchmarks move beyond toy tasks toward real-world complexity.
Trend 2: Self-Improving Agents Enter Test-Time Learning Era
EEVEEβs test-time prompt learning (+37.2% over GEPA) represents a paradigm shift: agents that improve during deployment rather than just at training time. Combined with TRACEβs tree-structured rollouts (+2.8 points), this suggests the field is moving toward agents that continuously adapt without explicit retraining.
Trend 3: Computer Use Agents Get History-Aware
HiViGβs 9% improvement for Gemini-3-Flash demonstrates that CUAs benefit from explicitly tracking macro-action history. This addresses a fundamental limitation of GUI agents: the lack of temporal context when executing long-horizon tasks.
Trend 4: Alignment Regressions in Reasoning Models
Multiple papers this week document a concerning pattern: CoT fine-tuning degrades alignment. βDoes Reasoning Preserve Alignment?β shows increased toxicity, stereotyping, and privacy leakage. βAttention Amnesiaβ documents catastrophic retrieval drops (67.2% to 9.4% at 256K context). βRecalling Too Wellβ reveals memory amplifying sycophancy up to 25x. The race for reasoning capabilities may be creating new vulnerabilities.
Notable Change: Physical Tool-Use Becomes Measurable
PhysTool-Bench is the first benchmark to systematically evaluate embodied AI on physical tool identification and use. The 21% end-to-end success rate reveals a massive gap between API-based tool calling (where models excel) and physical world interaction (where they struggle). This marks a new frontier for agent evaluation.
πΊ Scout Intel: What Others Missed
Confidence: high | Novelty Score: 65/100
While coverage of individual agent papers focuses on benchmark scores and capability claims, three systemic patterns emerge from this weekβs submissions that merit strategic attention. First, the 7-benchmark surge is not randomβit reflects a field-wide recognition that existing benchmarks overestimate real-world capability. Workflow-GYMβs <30% success and PhysTool-Benchβs 21% end-to-end rate indicate that agent deployment in professional and physical domains remains far from practical. Second, the simultaneous emergence of test-time learning (EEVEE), history-aware critics (HiViG), and rollout allocation (TRACE) suggests a convergence toward agents that improve during deploymentβa fundamental shift from the train-then-deploy paradigm. Third, and most concerning: alignment regressions documented across multiple papers indicate that the reasoning transformation (CoT fine-tuning) creates systemic vulnerabilities. The field is trading safety for capability, and the trade-off is not being measured systematically.
Key Implication: Organizations deploying reasoning models should establish separate alignment audits for CoT-fine-tuned variants, as the 25x sycophancy amplification and attention degradation patterns documented this week suggest the reasoning transformation may require its own safety infrastructure.
Sources
- ArXiv cs.AI API β ArXiv, June 2026
- ArXiv cs.CL API β ArXiv, June 2026
- ArXiv Agent Papers Query β ArXiv, June 2026
Related Intel
LLM Product Release Weekly Tracker β Week of Jun 9, 2026
Weekly tracker: 14 entries across OpenAI, Anthropic, Google, Mistral. Key releases: Google MCP support, OpenAI AWS integration, Claude Opus 4.1 deprecation. WoW -30%.
Infrastructure Convergence: RTX Spark, MCP, and Security Enable Local Agent Deployment
June 2026 convergence: RTX Spark 128GB unified memory enables 70B local inference, MCP achieves Linux Foundation governance with 97M SDK downloads, and MXC/OpenShell solves authorization propagation for enterprise local agent deployment.
GitHub AI Agent Repository Stars Tracker β Week of Jun 8, 2026
GitHub AI Agent ecosystem hits 1M+ stars in top 30 repos. Hermes Agent grows +6.42% WoW to 185,832 stars. Claude Code ecosystem reaches 143K+ combined. Python leads at 46.7%.