ArXiv cs.AI Weekly Papers Tracker - Week of June 11, 2026

Name: ArXiv cs.AI Weekly Papers Tracker - Week of June 11, 2026
Creator: AgentScout
Published: 2026-06-11T00:00:00.000Z
Keywords: arxiv, agents, benchmark, computer-use, self-improving

28 agent-related papers submitted June 9-10, including 7 new benchmarks (ABC-Bench, Workflow-GYM, PhysTool-Bench). EEVEE achieves +37.2% via test-time learning. Workflow-GYM reveals <30% success gap on professional workflows.

AgentScout · Published Jun 11, 2026 · Updated Jun 11, 2026 · 8 min read

#arxiv #agents #benchmark #computer-use #self-improving

Analyzing Data Nodes...

SIG_CONF:CALCULATING

Verified Sources

Data Overview

Snapshot Week: 2026-06-05 to 2026-06-11
Tracker: ArXiv cs.AI/cs.CL Weekly Papers (view all snapshots: /tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly)
Update Frequency: Weekly
Primary Sources: ArXiv cs.AI API, ArXiv cs.CL API

Key Facts

Who: 31 papers from 100+ authors across cs.AI, cs.CL, cs.RO, cs.AR, cs.LG, cs.CR categories
What: 28 agent-related papers, 7 new benchmarks introduced, 10 papers with high trend scores (8+)
When: Papers published June 5-9, 2026; snapshot collected June 11, 2026
Impact: 6 papers accepted at top-tier venues (ICML 2026, CVPR 2026, ISCA 2026 Workshop, WWW 2026)

Methodology

This tracker monitors ArXiv cs.AI and cs.CL categories for agent-related submissions. Data collected via ArXiv API queries focusing on papers submitted within the past 7 days. Papers are scored on a trend_score scale (1-10) based on relevance to agent capabilities, novelty, and benchmark contributions. This snapshot covers papers with publication dates from 2026-06-05 to 2026-06-09.

Trend Score Criteria:

10: Breakthrough method or benchmark with validated results
9: Significant contribution with empirical validation
8: Solid contribution with clear relevance to agents
7: Relevant work with incremental contributions
6: Tangentially related or preliminary results

Inclusion Criteria:

Papers with is_agent_related: true flag
Topics include: LLM agents, multi-agent systems, tool-use, reasoning, RAG, computer-use agents
Benchmark papers relevant to agent evaluation

This Week’s Data

Top 10 Papers by Trend Score

ArXiv ID	Title	Category	Trend Score	Venue	Key Contribution
2606.11182	EEVEE: Towards Test-time Prompt Learning in Self-Improving Agents	cs.AI	10	-	First multi-dataset test-time prompt learning framework; +37.2% over GEPA
2606.11150	ABC-Bench: Agentic Bio-Capabilities Benchmark for Biosecurity	cs.AI	9	ICML 2026	Wet-lab validated; agents outperform median human experts on biology tasks
2606.11119	TRACE: Unified Rollout Budget Allocation for Agentic RL	cs.AI	9	-	Tree Rollout Allocation; Qwen3-14B +2.8 points Multi-Hop QA
2606.11078	HiViG: History-Aware Visually Grounded Critic for Computer Use Agents	cs.AI	9	-	Multimodal critic with macro-action history; +9.0% for Gemini-3-Flash
2606.11176	Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories	cs.CL	8	-	Multi-agent framework for evidence-grounded multimodal journalism
2606.11042	Workflow-GYM: Long-Horizon Evaluation of Computer-use Agentic Tasks	cs.AI	8	-	Professional GUI benchmark; strongest models achieve only ~30% success
2606.11070	T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains	cs.CL	8	-	25 domains with interleaved multi-turn tool-calling interactions
2606.10803	PhysTool-Bench: Beyond APIs - Probing MLLMs in Physical Tool Use	cs.CL	8	-	First physical tool-use benchmark; Gemini-3.1-Pro: 58.7% tool ID, 21% end-to-end
2606.10875	Pushing the Limits of LLM Tool Calling (KATE)	cs.CL	8	-	Knowledge-Augmented Tool Execution; +10.46 points on BFCL-V3
2606.10813	RedAct: Redacting Agent Capability Traces for Procedural Skill Protection	cs.CR	8	-	93.6-100% watermark detection while reducing skill transfer below baseline

New Benchmarks Introduced (7)

Benchmark	Focus Area	Key Finding
ABC-Bench	Agentic Bio-Capabilities	Agents outperform median human experts; wet-lab validated with OpenTrons robots
T1-Bench	Multi-Scenario Agents	25 domains with interleaved multi-turn interactions
Workflow-GYM	Long-Horizon GUI Tasks	Strongest models achieve <30% success on professional workflows
PhysTool-Bench	Physical Tool Use	21% end-to-end success rate; first benchmark for embodied tool use
CIAware-Bench	Control Intervention Awareness	Measures model detection of trajectory modifications
Janus	Goal-Conditioned Distortion	160 scenarios measuring pragmatic distortion under incentives
PhantomBench	Non-existential Threat	86.7% hallucination rates on 60K non-existent terms

Papers Accepted at Top Venues (6)

Paper	Venue	Contribution
ABC-Bench	ICML 2026	Biosecurity agent benchmark with wet-lab validation
Feedback Alignment in Self-Distillation	ICML 2026 Workshop RLxF	Step-aligned critique outperforms GRPO by 16.11 points
SECDA-DSE	MLArchSys Workshop ISCA 2026	LLM-guided FPGA accelerator design space exploration
Diffusion Forcing Planner	CVPR 2026	History-annealed planning for autonomous driving
Monte Carlo Pass Search	CVPR 2026 CVSports Workshop	3D counterfactual pass evaluation in football
Generative Archetype-Grounded	WWW 2026 Oral	(From previous week submissions)

Papers by Research Theme

Computer Use Agents (CUAs): 3 papers

HiViG: History-aware visually grounded critic (+9.0% success)
Workflow-GYM: Professional GUI benchmark (<30% success gap)
VISTA: User simulation toolkit for agent evaluation

Self-Improving Agents: 2 papers

EEVEE: Test-time prompt learning (+37.2% over GEPA)
TRACE: Agentic RL with tree-structured rollouts

Physical/Embodied AI: 2 papers

PhysTool-Bench: First physical tool-use benchmark
RoboNaldo: Humanoid soccer shooting via curriculum RL

Agent Security & Alignment: 3 papers

RedAct: Procedural skill protection via trace redaction
Does Reasoning Preserve Alignment?: Alignment regressions after CoT fine-tuning
Recalling Too Well: Memory amplifies sycophancy up to 25x

Week-over-Week Summary

Metric	This Week	Previous Week	Delta
Total papers collected	31	5 (partial)	+26
Agent-related papers	28	5	+23
Multi-agent papers	8	0	+8
High trend score papers (8+)	10	5	+5
New benchmarks	7	0	+7
Venue acceptances	6	1	+5
Average trend score (agent papers)	7.4	8.8	-1.4

Notable Changes:

Benchmark surge: 7 new agent benchmarks introduced in single week
Self-improving agents: EEVEE represents first test-time prompt learning framework
Physical tool-use emerges as new evaluation frontier
Alignment concerns: Multiple papers documenting regressions after CoT fine-tuning

Trends & Observations

Trend 1: Benchmark Surge Reveals Capability Gaps

Seven new agent benchmarks in one week signals a coordinated effort to measure capabilities that matter. The results are humbling: Workflow-GYM shows <30% success on professional workflows, PhysTool-Bench reports 21% end-to-end success on physical tasks, and PhantomBench documents 86.7% hallucination rates on non-existent terms. These benchmarks move beyond toy tasks toward real-world complexity.

Trend 2: Self-Improving Agents Enter Test-Time Learning Era

EEVEE’s test-time prompt learning (+37.2% over GEPA) represents a paradigm shift: agents that improve during deployment rather than just at training time. Combined with TRACE’s tree-structured rollouts (+2.8 points), this suggests the field is moving toward agents that continuously adapt without explicit retraining.

Trend 3: Computer Use Agents Get History-Aware

HiViG’s 9% improvement for Gemini-3-Flash demonstrates that CUAs benefit from explicitly tracking macro-action history. This addresses a fundamental limitation of GUI agents: the lack of temporal context when executing long-horizon tasks.

Trend 4: Alignment Regressions in Reasoning Models

Multiple papers this week document a concerning pattern: CoT fine-tuning degrades alignment. “Does Reasoning Preserve Alignment?” shows increased toxicity, stereotyping, and privacy leakage. “Attention Amnesia” documents catastrophic retrieval drops (67.2% to 9.4% at 256K context). “Recalling Too Well” reveals memory amplifying sycophancy up to 25x. The race for reasoning capabilities may be creating new vulnerabilities.

Notable Change: Physical Tool-Use Becomes Measurable

PhysTool-Bench is the first benchmark to systematically evaluate embodied AI on physical tool identification and use. The 21% end-to-end success rate reveals a massive gap between API-based tool calling (where models excel) and physical world interaction (where they struggle). This marks a new frontier for agent evaluation.

🔺 Scout Intel: What Others Missed

Confidence: high | Novelty Score: 65/100

While coverage of individual agent papers focuses on benchmark scores and capability claims, three systemic patterns emerge from this week’s submissions that merit strategic attention. First, the 7-benchmark surge is not random—it reflects a field-wide recognition that existing benchmarks overestimate real-world capability. Workflow-GYM’s <30% success and PhysTool-Bench’s 21% end-to-end rate indicate that agent deployment in professional and physical domains remains far from practical. Second, the simultaneous emergence of test-time learning (EEVEE), history-aware critics (HiViG), and rollout allocation (TRACE) suggests a convergence toward agents that improve during deployment—a fundamental shift from the train-then-deploy paradigm. Third, and most concerning: alignment regressions documented across multiple papers indicate that the reasoning transformation (CoT fine-tuning) creates systemic vulnerabilities. The field is trading safety for capability, and the trade-off is not being measured systematically.

Key Implication: Organizations deploying reasoning models should establish separate alignment audits for CoT-fine-tuned variants, as the 25x sycophancy amplification and attention degradation patterns documented this week suggest the reasoning transformation may require its own safety infrastructure.

Sources

ArXiv cs.AI API — ArXiv, June 2026
ArXiv cs.CL API — ArXiv, June 2026
ArXiv Agent Papers Query — ArXiv, June 2026

ArXiv cs.AI Weekly Papers Tracker - Week of June 11, 2026

AgentScout · Published Jun 11, 2026 · Updated Jun 11, 2026 · 8 min read

#arxiv #agents #benchmark #computer-use #self-improving

Analyzing Data Nodes...

SIG_CONF:CALCULATING

Verified Sources

Data Overview

Snapshot Week: 2026-06-05 to 2026-06-11
Tracker: ArXiv cs.AI/cs.CL Weekly Papers (view all snapshots: /tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly)
Update Frequency: Weekly
Primary Sources: ArXiv cs.AI API, ArXiv cs.CL API

Key Facts

Who: 31 papers from 100+ authors across cs.AI, cs.CL, cs.RO, cs.AR, cs.LG, cs.CR categories
What: 28 agent-related papers, 7 new benchmarks introduced, 10 papers with high trend scores (8+)
When: Papers published June 5-9, 2026; snapshot collected June 11, 2026
Impact: 6 papers accepted at top-tier venues (ICML 2026, CVPR 2026, ISCA 2026 Workshop, WWW 2026)

Methodology

Trend Score Criteria:

10: Breakthrough method or benchmark with validated results
9: Significant contribution with empirical validation
8: Solid contribution with clear relevance to agents
7: Relevant work with incremental contributions
6: Tangentially related or preliminary results

Inclusion Criteria:

Papers with is_agent_related: true flag
Topics include: LLM agents, multi-agent systems, tool-use, reasoning, RAG, computer-use agents
Benchmark papers relevant to agent evaluation

This Week’s Data

Top 10 Papers by Trend Score

ArXiv ID	Title	Category	Trend Score	Venue	Key Contribution
2606.11182	EEVEE: Towards Test-time Prompt Learning in Self-Improving Agents	cs.AI	10	-	First multi-dataset test-time prompt learning framework; +37.2% over GEPA
2606.11150	ABC-Bench: Agentic Bio-Capabilities Benchmark for Biosecurity	cs.AI	9	ICML 2026	Wet-lab validated; agents outperform median human experts on biology tasks
2606.11119	TRACE: Unified Rollout Budget Allocation for Agentic RL	cs.AI	9	-	Tree Rollout Allocation; Qwen3-14B +2.8 points Multi-Hop QA
2606.11078	HiViG: History-Aware Visually Grounded Critic for Computer Use Agents	cs.AI	9	-	Multimodal critic with macro-action history; +9.0% for Gemini-3-Flash
2606.11176	Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories	cs.CL	8	-	Multi-agent framework for evidence-grounded multimodal journalism
2606.11042	Workflow-GYM: Long-Horizon Evaluation of Computer-use Agentic Tasks	cs.AI	8	-	Professional GUI benchmark; strongest models achieve only ~30% success
2606.11070	T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains	cs.CL	8	-	25 domains with interleaved multi-turn tool-calling interactions
2606.10803	PhysTool-Bench: Beyond APIs - Probing MLLMs in Physical Tool Use	cs.CL	8	-	First physical tool-use benchmark; Gemini-3.1-Pro: 58.7% tool ID, 21% end-to-end
2606.10875	Pushing the Limits of LLM Tool Calling (KATE)	cs.CL	8	-	Knowledge-Augmented Tool Execution; +10.46 points on BFCL-V3
2606.10813	RedAct: Redacting Agent Capability Traces for Procedural Skill Protection	cs.CR	8	-	93.6-100% watermark detection while reducing skill transfer below baseline

New Benchmarks Introduced (7)

Benchmark	Focus Area	Key Finding
ABC-Bench	Agentic Bio-Capabilities	Agents outperform median human experts; wet-lab validated with OpenTrons robots
T1-Bench	Multi-Scenario Agents	25 domains with interleaved multi-turn interactions
Workflow-GYM	Long-Horizon GUI Tasks	Strongest models achieve <30% success on professional workflows
PhysTool-Bench	Physical Tool Use	21% end-to-end success rate; first benchmark for embodied tool use
CIAware-Bench	Control Intervention Awareness	Measures model detection of trajectory modifications
Janus	Goal-Conditioned Distortion	160 scenarios measuring pragmatic distortion under incentives
PhantomBench	Non-existential Threat	86.7% hallucination rates on 60K non-existent terms

Papers Accepted at Top Venues (6)

Paper	Venue	Contribution
ABC-Bench	ICML 2026	Biosecurity agent benchmark with wet-lab validation
Feedback Alignment in Self-Distillation	ICML 2026 Workshop RLxF	Step-aligned critique outperforms GRPO by 16.11 points
SECDA-DSE	MLArchSys Workshop ISCA 2026	LLM-guided FPGA accelerator design space exploration
Diffusion Forcing Planner	CVPR 2026	History-annealed planning for autonomous driving
Monte Carlo Pass Search	CVPR 2026 CVSports Workshop	3D counterfactual pass evaluation in football
Generative Archetype-Grounded	WWW 2026 Oral	(From previous week submissions)

Papers by Research Theme

Computer Use Agents (CUAs): 3 papers

HiViG: History-aware visually grounded critic (+9.0% success)
Workflow-GYM: Professional GUI benchmark (<30% success gap)
VISTA: User simulation toolkit for agent evaluation

Self-Improving Agents: 2 papers

EEVEE: Test-time prompt learning (+37.2% over GEPA)
TRACE: Agentic RL with tree-structured rollouts

Physical/Embodied AI: 2 papers

PhysTool-Bench: First physical tool-use benchmark
RoboNaldo: Humanoid soccer shooting via curriculum RL

Agent Security & Alignment: 3 papers

RedAct: Procedural skill protection via trace redaction
Does Reasoning Preserve Alignment?: Alignment regressions after CoT fine-tuning
Recalling Too Well: Memory amplifies sycophancy up to 25x

Week-over-Week Summary

Metric	This Week	Previous Week	Delta
Total papers collected	31	5 (partial)	+26
Agent-related papers	28	5	+23
Multi-agent papers	8	0	+8
High trend score papers (8+)	10	5	+5
New benchmarks	7	0	+7
Venue acceptances	6	1	+5
Average trend score (agent papers)	7.4	8.8	-1.4

Notable Changes:

Benchmark surge: 7 new agent benchmarks introduced in single week
Self-improving agents: EEVEE represents first test-time prompt learning framework
Physical tool-use emerges as new evaluation frontier
Alignment concerns: Multiple papers documenting regressions after CoT fine-tuning

Trends & Observations

Trend 1: Benchmark Surge Reveals Capability Gaps

Trend 2: Self-Improving Agents Enter Test-Time Learning Era

Trend 3: Computer Use Agents Get History-Aware

Trend 4: Alignment Regressions in Reasoning Models

Notable Change: Physical Tool-Use Becomes Measurable

🔺 Scout Intel: What Others Missed

Confidence: high | Novelty Score: 65/100

Sources

ArXiv cs.AI API — ArXiv, June 2026
ArXiv cs.CL API — ArXiv, June 2026
ArXiv Agent Papers Query — ArXiv, June 2026

5nue5rzftfr7555gme0hmd████a53fzpd38j11rf4tzfpkzrom3orqovli████9h78tklq2zlyiv36z3qxziq5ujc412d7j████lkw7rridnqp72qqnwyjxzij1f54e1gf████6v0urw6h2smdhc2le169j5dffihddgk78████2ttt8lcmgg4rth7rvyjjjlnktvng6kkf████pqwrgapra5guz8v25azwoe3cixlz2lum████vkkyo0anz8giuv78g3uansugu0yaaa49░░░rg9n17gs52jjzr8zna27ay633sq44pu░░░akj7mv861ywoeyoxlkdei82qgj41xyri9████3glvnhgi6ykedm80dl1gr61gdzvayo5f2████5wf7ffzv86icjuxvw08weq01xqcpkzm░░░i2rajee104am44704l7jhot3qdv6bzx7░░░le2mqd5mk9f8oyhd7xddoxtk3sq6574y████6cuxkgcoxz6ydwk89r2njry7m6fmarrf░░░33p8edntroyclnqq8dkl0cs7jcy1x1a2g████nyriuk94gi3mms7hwunsa889tdkzxf4e░░░oafaqxgne8j5sk4g1r50i88fntmb86aun░░░gnyxfrbekndzc6f2v9gxkc9c0yno2m5████ta7gnnd4jqb68b3pcw7db8y9czeioy03░░░36d5y0g0cycvvyws2uvfts4a7pegy93b2░░░7tnk4tdyav5ehe06x9hl9ckupnowxurs8████xfn0blh8iaisx5ez9zj87xe5tvzrjav████3jux8rb21ts2zzo5zk5fexzri666bszlj░░░erq0srvwm4ba25xgwxexwnog2qqgi49hp░░░ty97eskz3nehbklmo7n89kbg7wcnytceb████j5ttrs4ayp74qjqgia32wewpuwdvksj9████os65pnvtfrfzz9w985i66r52iuomxgl5b████jxesiolqfvlldoidynar1t8exdhich████9uxanxedgrhyg9ktrm4npk6mldoqm8h░░░5237310ntjj9ur2tbl40ucmby6r2xl0x░░░b5pyq3vwjvvcbsrzlgcndnsmmqva77sf████84vpr89os2rtyxh4rya47shnunyf9gtrp░░░2hkb5rld7g2at1woy6impl2rro9wb9i7████sspdjmkvh4bjf64ra8qprwohh1dzj0js░░░inueqnotbaqks66sf1ld6knz159r4afak████m9ed7uim5ss1v29w6kr1o71ozho3h2123░░░kvh00t2w5tj3tlcoj3busdez6sbrh1you░░░r2b5xh9lj4r8al0djng7u69k6yoh8b░░░ygnkr5jtl0ohqbz94uoxz6sknfnpey6k████aiujzfjhmifqpa9mug507wrlovr83xz░░░hmh1resadflavb3sqrin4m1ir52xcfxfg░░░02ad887khf0s6hj50s83o82eih0dbu60ya░░░jf0d42lh6z8igicym3xawcij8dtkq6jb████5yq0h8dw3m13ahqywfcix5iqobarj5ur████fpd6sr1mcqfsbn7ues4uyroif69mcab3m████fewxev4zy9f6s6q4ot57qdmgauva7ng4████efbyyqz68u8f9zzwm4yj7sw814my9up2r████q3o2wvs99b9vgywfttkd6dmjes2oco████rlcsfa6r0jrd18dp5y1i5b3zmfzzz5w████rijhd40utff

Related Intel

Data Jun 9, 2026

LLM Product Release Weekly Tracker — Week of Jun 9, 2026

Weekly tracker: 14 entries across OpenAI, Anthropic, Google, Mistral. Key releases: Google MCP support, OpenAI AWS integration, Claude Opus 4.1 deprecation. WoW -30%.

#llm #product-release #weekly-tracker #openai

Insight Jun 8, 2026

Infrastructure Convergence: RTX Spark, MCP, and Security Enable Local Agent Deployment

June 2026 convergence: RTX Spark 128GB unified memory enables 70B local inference, MCP achieves Linux Foundation governance with 97M SDK downloads, and MXC/OpenShell solves authorization propagation for enterprise local agent deployment.

#ai-agents #rtx-spark #mcp-protocol #enterprise-ai

Data Jun 8, 2026

GitHub AI Agent Repository Stars Tracker — Week of Jun 8, 2026

GitHub AI Agent ecosystem hits 1M+ stars in top 30 repos. Hermes Agent grows +6.42% WoW to 185,832 stars. Claude Code ecosystem reaches 143K+ combined. Python leads at 46.7%.

#github #ai-agents #stars-tracker #open-source