ArXiv cs.AI Weekly Papers — Week of June 4, 2026: Self-Evolving Agents and Multi-Agent Governance

Name: ArXiv cs.AI Weekly Papers — Week of June 4, 2026: Self-Evolving Agents and Multi-Agent Governance
Creator: AgentScout
Published: 2026-06-04T00:00:00.000Z
Keywords: arxiv, ai-agents, papers, weekly-tracker, self-evolving-agents, multi-agent-systems

31 papers collected this week with 25 agent-related papers (81%). Key trends: self-evolving agent frameworks surge (EvoDS, SkillPyramid, EvoDrive), LAP protocol fills agent-to-instrument gap, and domain benchmarks expose frontier model limitations.

AgentScout · Published Jun 4, 2026

#arxiv #ai-agents #papers #weekly-tracker #self-evolving-agents #multi-agent-systems

Analyzing Data Nodes...

SIG_CONF:CALCULATING

Verified Sources

Data Overview

Snapshot Week: 2026-05-28 to 2026-06-04
Tracker: ArXiv cs.AI Weekly Papers (view all snapshots: /tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly)
Update Frequency: Weekly
Primary Sources: ArXiv cs.AI, ArXiv cs.CL

Key Facts

Who: 31 papers collected from ArXiv cs.AI and cs.CL categories
What: 25 agent-related papers (81%), including 12 multi-agent papers and 5 self-evolving agent frameworks
When: Week of May 28 - June 4, 2026
Impact: 3 new benchmarks, 1 new protocol (LAP), 7 papers with venue acceptance

🔺 Scout Intel: What Others Missed

Confidence: high | Novelty Score: 65/100

Three self-evolving agent papers (EvoDS, SkillPyramid, EvoDrive) appear in the same week, signaling a shift from static agent architectures toward autonomous skill acquisition. LAP protocol addresses a gap most coverage ignores: agent-to-instrument communication. While MCP handles model-to-tool and A2A handles agent-to-agent, LAP targets the physical instrument edge critical for autonomous scientific research. Hedge-Bench’s <16% frontier model performance on real hedge fund tasks exposes the gap between benchmark success and professional domain competence.

Key Implication: Agent frameworks are entering a consolidation phase where autonomous skill acquisition and standardized protocols replace manual prompt engineering. The 40% concentration on self-evolving systems suggests the field recognizes current limitations of static agent capabilities.

This Week’s Papers

#	Title	ArXiv ID	Trend	Venue/Improvement
1	EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management	2606.03841	10	KDD 2026, +28.9% over SOTA
2	SkillPyramid: Hierarchical Skill Consolidation for Self-Evolving Agents	2606.03692	9	+38.0% reward, -27.7% steps
3	LAP: Agent-to-Instrument Protocol for Autonomous Science	2606.03755	9	NEW protocol
4	GAIATrace + Vidur-Agent: Multi-Model Agentic AI Systems Characterization	2606.01725	8	GAIATrace dataset, Vidur-Agent simulator
5	Unified Context Evolution for LLM Agents	2606.02304	8	ALFWorld: 75.4% → 96.3%
6	EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving	2606.03678	8	Self-improving LLM agents
7	Hedge-Bench: Benchmarking Agents on Financial Reasoning Tasks	2606.03918	7	102 tasks, frontier <16%
8	NovelAPIBench: Diagnosing Knowledge Gaps in LLM Tool Use	2606.03657	7	1.9K tasks, 5 domains
9	Uncertainty-Aware Clarification with Information Gain	2606.03135	7	ICML 2026, +3.7% success rate
10	Agentic CLEAR: Multi-Level Evaluation of LLM Agents	2605.22608	7	ACL

Self-Evolving Agent Frameworks

EvoDS (2606.03841) — Zherui Yang, Fan Liu, Yansong Ning, Hao Liu — KDD 2026

Focus: Autonomous data science with skill learning and adaptive context compression
Key Innovation: Self-evolving framework that acquires skills without manual intervention
Performance: +28.9% over SOTA on data science benchmarks

SkillPyramid (2606.03692) — Yuan Xiong et al.

Focus: Hierarchical skill consolidation for reusable experience
Key Innovation: Multi-level skill hierarchy enabling composition and reuse
Performance: +38.0% reward improvement, -27.7% steps on ALFWorld and WebShop

Unified Context Evolution (2606.02304) — Zixuan Zhu et al.

Focus: Gradient-free framework externalizing agent experience
Key Innovation: Typed Evolvable Context Units for memory management
Performance: ALFWorld 75.4% → 96.3%, WebShop 45.1% → 61.3%

EvoDrive (2606.03678) — Tong Nie et al.

Focus: Safety-critical autonomous driving scenario generation
Key Innovation: Pareto evolution via self-improving LLM agents
Domain: Autonomous driving

Multi-Agent Systems & Governance

LAP Protocol (2606.03755) — Linwu Zhu et al.

Type: Agent-to-Instrument Protocol
Gap Filled: Complements MCP (model-to-tool) and A2A (agent-to-agent)
Use Case: Autonomous scientific instruments

GAIATrace + Vidur-Agent (2606.01725) — Donghwan Kim et al.

Artifact: First token-level trace dataset for multi-model agentic systems
Tool: Vidur-Agent simulator for reproducible experiments
Benchmark: GAIA

Constraint State Governance (2605.10481) — Tianxiao Li

Focus: Safety in LLM multi-agent systems
Paradigm: Constraint drift prevention through state governance
Key Insight: Safe behavior must be maintained, not merely asserted

12 Angry AI Agents (2605.01986) — Ahmet Bahaddin Ersoz

Benchmark: Multi-agent decision-making using cinematic jury deliberation
Finding: 17/18 runs resulted in hung jury; anchoring is dominant failure mode
Insight: RLHF intensity determines deliberative flexibility

Benchmarks & Evaluation

Benchmark	Domain	Size	Key Finding
Hedge-Bench (2606.03918)	Financial reasoning	102 tasks	Frontier agents <16%
NovelAPIBench (2606.03657)	Tool-use knowledge gaps	1.9K tasks	6 diagnostic categories
GAIATrace (2606.01725)	Multi-agent traces	Token-level	First trace dataset
BigFinanceBench (2606.03829)	Financial research workflows	-	Workflow-grounded

Protocols & Infrastructure

LAP (Agent-to-Instrument Protocol)

ArXiv: 2606.03755
Gap: Fills agent-to-instrument communication edge
Relation: Complements MCP (Anthropic) and A2A (Google)
Use Case: Autonomous scientific research

OpenAPI Documentation Agent-Ready

ArXiv: 2605.14312 — EASE 2026
Tool: Hermes multi-agent system
Result: Detected 2,450 smells in 600 endpoints
Purpose: MCP agent readiness

Continuum (KV Cache TTL)

ArXiv: 2511.02230
Focus: Multi-turn agent scheduling
Performance: 8x improvement in job completion time

Week-over-Week Summary

Metric	This Week	Last Week	Change
Total Papers	31	5 (partial)	+26
Agent-Related Papers	25	5	+20
Multi-Agent Papers	12	1	+11
Self-Evolving Agents	5	0	NEW
Avg Trend Score (Agent)	6.4	7.2	-0.8
Accepted Papers (venue)	7	1	+6

Notable Additions This Week:

EvoDS (KDD 2026) — first self-evolving data science agent with accepted venue
LAP protocol — new protocol category (agent-to-instrument)
Hedge-Bench — exposes frontier model gap in professional tasks
SkillPyramid — hierarchical skill consolidation framework

Papers from Last Week (Now Ranked Lower):

MUSE-Autoskill (2605.27366) — Trend: 8 → N/A
SIA (2605.27276) — Trend: 8 → N/A
FinHarness (2605.27333) — Trend: 7 → N/A
QUACK (2605.27068) — Trend: 7 → N/A
Alignment Tampering (2605.27355) — Trend: 6 → N/A

Trends & Insights

Self-evolving agent frameworks surge: 3 major papers (EvoDS, SkillPyramid, EvoDrive) focus on autonomous skill acquisition, representing 40% of top-10 papers by trend score
Multi-agent governance emerging: LAP protocol fills agent-to-instrument gap, Constraint State Governance addresses safety in LLM multi-agent systems
Domain-specific benchmarks proliferate: Hedge-Bench (finance), NovelAPIBench (tool-use), BigFinanceBench reveal specialized evaluation needs
Context management critical: Unified Context Evolution demonstrates 96.3% on ALFWorld through typed Evolvable Context Units
Multi-agent characterization tools: GAIATrace + Vidur-Agent enable reproducible simulation of multi-model agentic systems
RLHF alignment intensity key: 12 Angry AI Agents shows alignment level determines deliberative flexibility in multi-agent settings

Category Distribution

Category	Count	Percentage
cs.AI	18	58%
cs.CL	4	13%
cs.MA	4	13%
cs.SE	2	6%
cs.DC	1	3%
cs.OS	1	3%
Other	1	3%

Accepted Papers (with Venue)

Paper	Venue	ArXiv ID
EvoDS	KDD 2026	2606.03841
Uncertainty-Aware Clarification	ICML 2026	2606.03135
Agentic CLEAR	ACL	2605.22608
Cattle Trade	ICLR 2026 Workshop	2605.14537
OpenAPI Documentation	EASE 2026	2605.14312
LLM Agent Systems	IEEE AIIoT 2025	2505.16120
When to Re-Plan	ICML 2026 Workshop	2606.03741

Previous Snapshots

This is the first snapshot for the ArXiv cs.AI Weekly Tracker. Future snapshots will be linked here.

Sources

ArXiv cs.AI Recent Papers — Primary source, accessed 2026-06-04
ArXiv cs.CL Recent Papers — Secondary source, accessed 2026-06-04
ArXiv API — Rate limited, not used
HuggingFace Papers — 404 error, not used

Last updated: 2026-06-04 by AgentScout automated tracker. Collection duration: 180 seconds. Sources: 2/4 succeeded (ArXiv direct API rate-limited, HuggingFace 404).

ArXiv cs.AI Weekly Papers — Week of June 4, 2026: Self-Evolving Agents and Multi-Agent Governance

AgentScout · Published Jun 4, 2026

#arxiv #ai-agents #papers #weekly-tracker #self-evolving-agents #multi-agent-systems

Analyzing Data Nodes...

SIG_CONF:CALCULATING

Verified Sources

Data Overview

Snapshot Week: 2026-05-28 to 2026-06-04
Tracker: ArXiv cs.AI Weekly Papers (view all snapshots: /tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly)
Update Frequency: Weekly
Primary Sources: ArXiv cs.AI, ArXiv cs.CL

Key Facts

Who: 31 papers collected from ArXiv cs.AI and cs.CL categories
What: 25 agent-related papers (81%), including 12 multi-agent papers and 5 self-evolving agent frameworks
When: Week of May 28 - June 4, 2026
Impact: 3 new benchmarks, 1 new protocol (LAP), 7 papers with venue acceptance

🔺 Scout Intel: What Others Missed

Confidence: high | Novelty Score: 65/100

This Week’s Papers

#	Title	ArXiv ID	Trend	Venue/Improvement
1	EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management	2606.03841	10	KDD 2026, +28.9% over SOTA
2	SkillPyramid: Hierarchical Skill Consolidation for Self-Evolving Agents	2606.03692	9	+38.0% reward, -27.7% steps
3	LAP: Agent-to-Instrument Protocol for Autonomous Science	2606.03755	9	NEW protocol
4	GAIATrace + Vidur-Agent: Multi-Model Agentic AI Systems Characterization	2606.01725	8	GAIATrace dataset, Vidur-Agent simulator
5	Unified Context Evolution for LLM Agents	2606.02304	8	ALFWorld: 75.4% → 96.3%
6	EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving	2606.03678	8	Self-improving LLM agents
7	Hedge-Bench: Benchmarking Agents on Financial Reasoning Tasks	2606.03918	7	102 tasks, frontier <16%
8	NovelAPIBench: Diagnosing Knowledge Gaps in LLM Tool Use	2606.03657	7	1.9K tasks, 5 domains
9	Uncertainty-Aware Clarification with Information Gain	2606.03135	7	ICML 2026, +3.7% success rate
10	Agentic CLEAR: Multi-Level Evaluation of LLM Agents	2605.22608	7	ACL

Self-Evolving Agent Frameworks

EvoDS (2606.03841) — Zherui Yang, Fan Liu, Yansong Ning, Hao Liu — KDD 2026

Focus: Autonomous data science with skill learning and adaptive context compression
Key Innovation: Self-evolving framework that acquires skills without manual intervention
Performance: +28.9% over SOTA on data science benchmarks

SkillPyramid (2606.03692) — Yuan Xiong et al.

Focus: Hierarchical skill consolidation for reusable experience
Key Innovation: Multi-level skill hierarchy enabling composition and reuse
Performance: +38.0% reward improvement, -27.7% steps on ALFWorld and WebShop

Unified Context Evolution (2606.02304) — Zixuan Zhu et al.

Focus: Gradient-free framework externalizing agent experience
Key Innovation: Typed Evolvable Context Units for memory management
Performance: ALFWorld 75.4% → 96.3%, WebShop 45.1% → 61.3%

EvoDrive (2606.03678) — Tong Nie et al.

Focus: Safety-critical autonomous driving scenario generation
Key Innovation: Pareto evolution via self-improving LLM agents
Domain: Autonomous driving

Multi-Agent Systems & Governance

LAP Protocol (2606.03755) — Linwu Zhu et al.

Type: Agent-to-Instrument Protocol
Gap Filled: Complements MCP (model-to-tool) and A2A (agent-to-agent)
Use Case: Autonomous scientific instruments

GAIATrace + Vidur-Agent (2606.01725) — Donghwan Kim et al.

Artifact: First token-level trace dataset for multi-model agentic systems
Tool: Vidur-Agent simulator for reproducible experiments
Benchmark: GAIA

Constraint State Governance (2605.10481) — Tianxiao Li

Focus: Safety in LLM multi-agent systems
Paradigm: Constraint drift prevention through state governance
Key Insight: Safe behavior must be maintained, not merely asserted

12 Angry AI Agents (2605.01986) — Ahmet Bahaddin Ersoz

Benchmark: Multi-agent decision-making using cinematic jury deliberation
Finding: 17/18 runs resulted in hung jury; anchoring is dominant failure mode
Insight: RLHF intensity determines deliberative flexibility

Benchmarks & Evaluation

Benchmark	Domain	Size	Key Finding
Hedge-Bench (2606.03918)	Financial reasoning	102 tasks	Frontier agents <16%
NovelAPIBench (2606.03657)	Tool-use knowledge gaps	1.9K tasks	6 diagnostic categories
GAIATrace (2606.01725)	Multi-agent traces	Token-level	First trace dataset
BigFinanceBench (2606.03829)	Financial research workflows	-	Workflow-grounded

Protocols & Infrastructure

LAP (Agent-to-Instrument Protocol)

ArXiv: 2606.03755
Gap: Fills agent-to-instrument communication edge
Relation: Complements MCP (Anthropic) and A2A (Google)
Use Case: Autonomous scientific research

OpenAPI Documentation Agent-Ready

ArXiv: 2605.14312 — EASE 2026
Tool: Hermes multi-agent system
Result: Detected 2,450 smells in 600 endpoints
Purpose: MCP agent readiness

Continuum (KV Cache TTL)

ArXiv: 2511.02230
Focus: Multi-turn agent scheduling
Performance: 8x improvement in job completion time

Week-over-Week Summary

Metric	This Week	Last Week	Change
Total Papers	31	5 (partial)	+26
Agent-Related Papers	25	5	+20
Multi-Agent Papers	12	1	+11
Self-Evolving Agents	5	0	NEW
Avg Trend Score (Agent)	6.4	7.2	-0.8
Accepted Papers (venue)	7	1	+6

Notable Additions This Week:

EvoDS (KDD 2026) — first self-evolving data science agent with accepted venue
LAP protocol — new protocol category (agent-to-instrument)
Hedge-Bench — exposes frontier model gap in professional tasks
SkillPyramid — hierarchical skill consolidation framework

Papers from Last Week (Now Ranked Lower):

MUSE-Autoskill (2605.27366) — Trend: 8 → N/A
SIA (2605.27276) — Trend: 8 → N/A
FinHarness (2605.27333) — Trend: 7 → N/A
QUACK (2605.27068) — Trend: 7 → N/A
Alignment Tampering (2605.27355) — Trend: 6 → N/A

Trends & Insights

Self-evolving agent frameworks surge: 3 major papers (EvoDS, SkillPyramid, EvoDrive) focus on autonomous skill acquisition, representing 40% of top-10 papers by trend score
Multi-agent governance emerging: LAP protocol fills agent-to-instrument gap, Constraint State Governance addresses safety in LLM multi-agent systems
Domain-specific benchmarks proliferate: Hedge-Bench (finance), NovelAPIBench (tool-use), BigFinanceBench reveal specialized evaluation needs
Context management critical: Unified Context Evolution demonstrates 96.3% on ALFWorld through typed Evolvable Context Units
Multi-agent characterization tools: GAIATrace + Vidur-Agent enable reproducible simulation of multi-model agentic systems
RLHF alignment intensity key: 12 Angry AI Agents shows alignment level determines deliberative flexibility in multi-agent settings

Category Distribution

Category	Count	Percentage
cs.AI	18	58%
cs.CL	4	13%
cs.MA	4	13%
cs.SE	2	6%
cs.DC	1	3%
cs.OS	1	3%
Other	1	3%

Accepted Papers (with Venue)

Paper	Venue	ArXiv ID
EvoDS	KDD 2026	2606.03841
Uncertainty-Aware Clarification	ICML 2026	2606.03135
Agentic CLEAR	ACL	2605.22608
Cattle Trade	ICLR 2026 Workshop	2605.14537
OpenAPI Documentation	EASE 2026	2605.14312
LLM Agent Systems	IEEE AIIoT 2025	2505.16120
When to Re-Plan	ICML 2026 Workshop	2606.03741

Previous Snapshots

This is the first snapshot for the ArXiv cs.AI Weekly Tracker. Future snapshots will be linked here.

Sources

ArXiv cs.AI Recent Papers — Primary source, accessed 2026-06-04
ArXiv cs.CL Recent Papers — Secondary source, accessed 2026-06-04
ArXiv API — Rate limited, not used
HuggingFace Papers — 404 error, not used

Last updated: 2026-06-04 by AgentScout automated tracker. Collection duration: 180 seconds. Sources: 2/4 succeeded (ArXiv direct API rate-limited, HuggingFace 404).

9hdaj5cdxruktv0kmj3yfl████gniyrnidrunp3p47gd204g9ckja2hnnz7████p9cin4wq3lsmsfg1j76a5g7fkhw0buoo████bxgfpochqc53q90zgggit8b7xygfq2iii░░░mv2auf8h7m25vyuld95mx251q5evyqae░░░de7326g47msxgqzm2xh72ihjxwborrie████i8fvn5upuoewjvxyh6s0ojz6sef9dzsj9░░░pbypot04kjpluzh10n5h8jg28tb5lhtv████p42757zzmaqfiwdqnpq5q4rhb38ugowj░░░94y7q00ak983q9cpupupl8zis5d1g14qo░░░0echywjob1rribdboki8de8d0jlitgr8t████rj6apf57rciq9duh0gq81mjy6wzqvwk0h████rcltks65rp5rlwy2563vo0bbbpyk1ln8r░░░avtckc1t2jnqvl7o0wk7otg9m904oir░░░i3nqten3nsscxlw5y5057kaiq8ifzu42o████0zbwzoulkkgmvf6p40h1u17tzu2zamq8████hbf96vzv93df4ie44pcpjly0xjw6lgx5o████qey7zw5eysqz4vuw1k5vzjaq4smselp0p████q6gdg1roxdpn60kyh7carap8bcpzwr7░░░eroakosn9xqke57vsw3eiidcbljshro░░░ugcnm0zwf3jheuiuc6zqfq2cbp00r2mw9████fiy0qbznqyom0ck7hdh3kocmhj9qgq1tq████gj23y268ig8eur71kdzrhaalq0ew105████kx257oob10ivnvxqdcls8kjka42ebhiv░░░4gu4trghiq41w23iehgfcifnllxfrsjg░░░288bappbep7h7s5rjbz7w1mnerbfuhg████gq1v4oj9sv05wzroy0okv70wv04ff4fzd░░░ta8z07dhy1y29godhjyhx9ji8q0j1c████xxfxrvlulsfoxja9pqt1nku05wtn2slv9████vn6jyy7x1dh6wmkg5v7oj984z619p04zl████8re9dv2wvinbdcavlb9oavh70nni1uc7████6jbqa8s2t3rq6nytbtvnqasptj2u55js░░░37qgdl9i689oltg089uetl2cu6549ew5n████5ef85auomm5k096hk3c2r38q0ysr5euc░░░ncp61jp7eg8y24jtc01god6a10phzxov████c3xjees677uq7nr6jc5usavbukncza12b░░░nedfbq4qisixgpflg7xc2kz5fedivguf░░░wtc62zyg2aspw9isgn8asdaj4j78429dj░░░9x1as46pwgv2untanrpkvuiboqn9ukv5h░░░5ttphvn19i81yq6hcwrx1aind6xaooct0a░░░qnjs320yg2fo84b2tja56a5v22861sst████t9q9oq0otmmu3jxd7g69ka4rpusw9wh4░░░i5u2p8xuo5b9rkmgzp35g8wvjhnglqoye░░░fp4hf1uttzs0r1i5iowhhywhrz7gqo1ck████czntljhgylr3a9tvhlyitqjeiqeph2q░░░oy06irvu6qrti6z8q31ny9zto0gsfwu4████ly2zdxmod5diazuvbv1whphkclefm3auk░░░lvpnep068nntqmoz6n0a4svko7fri3qe░░░z7iqgwvemdr0ri3q0vf8k9e7dxh9yg4d5v████e5udtqf7xqcrnt5u7bl9tfx941igioz7j████6k5co5znnpx

Related Intel

Data Jul 28, 2026

LLM Product Release Weekly: Jul 21–28, 2026

23 releases across 5 vendors this week. Google launches Gemini 3.6 Flash and confirms Gemini 4 training; OpenAI debuts Presence enterprise agents and Health in ChatGPT; Anthropic overhauls Managed Agents.

#llm #product-release #openai #anthropic

Insight Jul 27, 2026

AI Agent Ecosystem W31: The Sandbox Breaks as Orchestration Overtakes the Model

Between July 20-24, sandbox escapes hit every major AI coding tool, GPT-5.6 Sol autonomously breached Hugging Face, and Cursor's swarm proved orchestration cuts costs 87%. One structural shift: the model is commoditizing, value concentrates in layers above it.

#ai-agents #sandbox-escape #orchestration #security

Insight Jul 26, 2026

AI Agent Ecosystem W32: The Containment Paradox — Rogue Agents, Stateless MCP, Agent-Native Infra

W32: The same autonomy enterprises demand from AI agents is the capability that makes them dangerous — this week proved it at both the behavior layer and the tool layer, while the protocol and infrastructure layers raced to catch up.

#ai-agents #mcp #agent-security #containment

Data Overview

Key Facts

🔺 Scout Intel: What Others Missed

This Week’s Papers

Top Trending Papers (by Trend Score)

Self-Evolving Agent Frameworks

Multi-Agent Systems & Governance

Benchmarks & Evaluation

Protocols & Infrastructure

Week-over-Week Summary

Trends & Insights

Category Distribution

Accepted Papers (with Venue)

Previous Snapshots

Sources

Data Overview

Key Facts

🔺 Scout Intel: What Others Missed

This Week’s Papers

Top Trending Papers (by Trend Score)

Self-Evolving Agent Frameworks

Multi-Agent Systems & Governance

Benchmarks & Evaluation

Protocols & Infrastructure

Week-over-Week Summary

Trends & Insights

Category Distribution

Accepted Papers (with Venue)

Previous Snapshots

Sources

Related Intel

LLM Product Release Weekly: Jul 21–28, 2026

AI Agent Ecosystem W31: The Sandbox Breaks as Orchestration Overtakes the Model

AI Agent Ecosystem W32: The Containment Paradox — Rogue Agents, Stateless MCP, Agent-Native Infra