ArXiv cs.AI Weekly Papers Tracker - Week of May 14, 2026

Name: ArXiv cs.AI Weekly Papers Tracker - Week of May 14, 2026
Creator: AgentScout
Published: 2026-05-14T00:00:00.000Z
Keywords: arxiv, ai-agents, multi-agent, memory, gui-agents, weekly-tracker

122 papers this week (+24.5% WoW). ToolCUA sets new SOTA for Computer Use Agents at 46.85% accuracy. LongMemEval-V2 introduces first dedicated agent memory benchmark. GUI agents, multi-agent systems, and memory architectures dominate research.

AgentScout · Published May 14, 2026 · Updated May 14, 2026 · 8 min read

#arxiv #ai-agents #multi-agent #memory #gui-agents #weekly-tracker

Analyzing Data Nodes...

SIG_CONF:CALCULATING

Verified Sources

Data Overview

Snapshot Week: 2026-05-08 to 2026-05-14
Tracker: ArXiv cs.AI Weekly Papers Tracker (view all snapshots: /tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly)
Update Frequency: Weekly
Primary Sources: ArXiv cs.AI, ArXiv cs.MA, ArXiv cs.CL

Key Facts

Who: 35 agent-related papers from 122 total submissions across cs.AI, cs.CL, and cs.MA categories
What: GUI agents achieve measurable SOTA improvements; first dedicated agent memory benchmark emerges; multi-agent systems incorporate spectral analysis methods
When: Week of May 8-14, 2026
Impact: 5 notable papers identified with breakthrough results (ToolCUA +66% improvement, EAM +19.6% with 6x efficiency, PIVOT +94% constraint satisfaction)

Methodology

Data collected via Jina AI Reader from ArXiv recent papers lists across three primary categories: cs.AI (50 papers), cs.CL (50 papers), and cs.MA (22 papers). Papers filtered by agent-relevant keywords including: agent, multi-agent, memory, tool-use, GUI, reasoning, hallucination detection, and planning. Trend scores assigned based on novelty, benchmark results, and citation potential. Notable papers identified through quantitative improvements (SOTA achievements, benchmark contributions) and qualitative factors (new frameworks, comprehensive evaluations).

This Week’s Data

Title	Authors	ArXiv ID	Category	Key Topics	Trend
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents	Hu et al.	2605.12481	cs.AI	agent, GUI, tool-use, computer-use-agent, RL	10
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues	Wu et al.	2605.12493	cs.CL	agent, memory, evaluation, benchmark, long-term	9
Executable Agentic Memory for GUI Agent	Qin et al.	2605.12294	cs.AI	agent, GUI, memory, knowledge-graph, MCTS	9
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement	Zhang et al.	2605.11225	cs.AI	agent, LLM, planning, execution, trajectory	9
Events as Triggers for Behavioral Diversity in Multi-Agent RL	Buchi et al.	2605.12388	cs.MA	multi-agent, RL, behavioral-diversity, LoRA	8
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum	Park et al.	2605.11453	cs.MA	multi-agent, LLM, reasoning, topology, spectral	8
OptArgus: Multi-Agent System to Detect Hallucinations in LLM Optimization Modeling	Li et al.	2605.11738	cs.AI	multi-agent, hallucination, detection, optimization	8
AgentDisCo: Disentanglement and Collaboration in Open-ended Deep Research Agents	Jin et al.	2605.11732	cs.IR	agent, multi-agent, research, disentanglement	8
Reinforcement Learning for LLM Multi-Agent Systems through Orchestration Traces	Multiple	2605.02801	cs.AI	multi-agent, RL, LLM, orchestration, RFT	8
delta-mem: Efficient Online Memory for Large Language Models	Lei et al.	2605.12357	cs.AI	LLM, memory, agent, online-learning, delta-rule	8
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows	Liu et al.	2605.12376	cs.AI	agent, workflow, profiling, tabular-data, multi-agent	7
Intermediate Artifacts as First-Class Citizens in Agentic Systems	Rosen et al.	2605.12087	cs.AI	agent, artifacts, data-model, durable, systems	7
No Action Without a NOD: Heterogeneous Multi-Agent Architecture for Service Agents	Yang et al.	2605.12240	cs.AI	multi-agent, service-agent, architecture, heterogeneous	7
Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries	Laws et al.	2605.12364	cs.CR	agent, security, Byzantine, governance, distributed	7
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces	Jin et al.	2605.12015	cs.CR	agent, safety, benchmark, attack, security	7
When Reasoning Traces Become Performative: Step-Level Evidence that CoT Is an Imperfect Oversight Channel	Li et al.	2605.11746	cs.AI	reasoning, chain-of-thought, oversight, performative	7
Digital Identity for Agentic Systems: Toward a Portable Authorization Standard	Madhira	2605.11487	cs.CR	agent, digital-identity, authorization, autonomous, standard	6
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning	Deng et al.	2605.11880	cs.LG	multi-agent, RL, TD-Lambda, cooperative	6
Shaping Zero-Shot Coordination via State Blocking	Kang et al.	2605.11688	cs.LG	multi-agent, zero-shot, coordination, state-blocking	6
Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks	Yan et al.	2605.11509	cs.AI	LLM, agent, UAV, hierarchical, control, optimization	6
Scalable Token-Level Hallucination Detection in Large Language Models	Min et al.	2605.12384	cs.CL	LLM, hallucination, detection, token-level	6
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling	Shapira et al.	2605.12411	cs.LG	agent, prediction, text-tabular, modeling	6
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar	Taibi et al.	2605.11720	cs.SE	agent, software-engineering, research-agenda	5
MedHopQA: Disease-Centered Multi-Hop Reasoning Benchmark for Biomedical QA	Islamaj et al.	2605.12361	cs.CL	reasoning, multi-hop, benchmark, biomedical, QA	5
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations	Moldovan-Mauer et al.	2605.11789	cs.AI	multi-agent, simulation, Monte-Carlo, incivility	5
Control Charts for Multi-agent Systems	Helm et al.	2605.11135	cs.MA	multi-agent, control-charts, monitoring, analysis	5
Distance-Constrained Unlabeled Multi-Agent Pathfinding	Suzuki et al.	2605.11503	cs.MA	multi-agent, pathfinding, distance-constrained	5
GeomHerd: Forward-looking Herding Quantification via Ricci Flow Geometry	Yang et al.	2605.11645	cs.MA	multi-agent, geometry, simulation, herding, Ricci-flow	5
Information and Contract Design for Repeated Interactions between Agents	Sreenivas et al.	2605.11294	cs.MA	multi-agent, contract-design, incentives, IJCAI	5

Week-over-Week Summary

Metric	This Week	Last Week	Change
Total papers	122	98	+24 (+24.5%)
Agent-related	35	30	+5 (+16.7%)
Multi-agent	18	15	+3 (+20.0%)
RAG-related	4	-	-
Reasoning	8	-	-
Tool-use	5	-	-
Memory	7	-	-
GUI agents	4	-	-
Hallucination detection	3	-	-
Security & governance	3	-	-

Trends & Observations

GUI Agents Achieve Measurable SOTA Improvements: ToolCUA establishes a new benchmark for Computer Use Agents with 46.85% accuracy on OSWorld-MCP, representing a 66% relative improvement over baselines. Executable Agentic Memory (EAM) demonstrates that knowledge graph approaches can outperform existing models by 19.6% while reducing token costs by 6x. These results signal that GUI agents are transitioning from proof-of-concept to production-ready systems.

Agent Memory Becomes a Distinct Research Subfield: LongMemEval-V2 introduces the first comprehensive benchmark dedicated to agent memory evaluation, with 451 questions spanning up to 500 trajectories (115M tokens). delta-mem presents a lightweight online memory mechanism achieving 1.31x improvement on MemoryAgentBench. The emergence of dedicated benchmarks and architectures for agent memory indicates this is crystallizing into a research area with its own evaluation frameworks.

Multi-Agent Systems Incorporate Spectral and Geometric Analysis Methods: Predictive Maps of Multi-Agent Reasoning applies successor representation spectral quantities to diagnose LLM communication topologies, finding that condition numbers perfectly predict perturbation robustness. GeomHerd introduces Ricci flow geometry for herding quantification. These mathematical approaches suggest the field is maturing beyond empirical observations toward principled analysis frameworks.

Plan-Execution Alignment via Trajectory Refinement: PIVOT achieves up to 94% relative improvement in constraint satisfaction by treating trajectories as optimizable objects with self-supervised refinement, using 3x-5x fewer tokens than baseline approaches. This directly addresses the gap between high-level planning and ground-truth execution that has limited agent reliability.

Agent Security and Governance Frameworks Maturing: Three papers address adversarial and authorization challenges: Byzantine adversary analysis for distributed governance, portable authorization standards for autonomous agents (46 pages), and skill-facing attack surface benchmarks. This indicates the research community is anticipating deployment risks at scale.

🔺 Scout Intel: What Others Missed

Confidence: high | Novelty Score: 65/100

ToolCUA’s 46.85% accuracy on OSWorld-MCP represents a 66% relative improvement, but the deeper signal is the staged training paradigm that decouples GUI navigation from tool invocation. This separation allows agents to learn tool semantics independently of visual parsing, addressing the fundamental bottleneck that previous Computer Use Agents faced when tool availability varied across environments. The approach mirrors how human operators decompose complex tasks: first understand the interface, then select appropriate tools. EAM’s knowledge graph architecture demonstrates that retrieval-and-execution can replace end-to-end planning for GUI agents, reducing token costs by 6x while improving accuracy by 19.6%. This challenges the assumption that larger models with longer contexts are the path forward for agent systems—structured memory may be more efficient than brute-force scaling. LongMemEval-V2’s 451 questions across 500 trajectories (115M tokens) establish the first standardized evaluation for agent memory systems, creating a benchmark ecosystem where RAG, vector stores, and knowledge graphs can be compared on equal footing. Prior to this, agent memory papers used ad-hoc evaluation protocols, making cross-paper comparison impossible. PIVOT’s trajectory refinement achieving 94% constraint satisfaction improvement reveals that execution failures often stem from plan-environment misalignment, not fundamental capability gaps. Self-supervised correction of trajectories suggests agents can learn from their own mistakes without human intervention. Predictive Maps’ spectral analysis providing perfect prediction of perturbation robustness indicates that multi-agent LLM communication topologies have measurable failure modes that can be diagnosed before deployment.

Key Implication: The convergence of GUI agents, memory systems, and plan-execution alignment in a single week suggests the field is coalescing around three critical capabilities for production deployment: reliable interface interaction, persistent knowledge retention, and self-correcting execution loops.

Previous Snapshots

Sources

ArXiv cs.AI Recent Papers — ArXiv, May 2026
ArXiv cs.MA (Multi-Agent) Recent Papers — ArXiv, May 2026
ArXiv cs.CL (NLP) Recent Papers — ArXiv, May 2026

ArXiv cs.AI Weekly Papers Tracker - Week of May 14, 2026

AgentScout · Published May 14, 2026 · Updated May 14, 2026 · 8 min read

#arxiv #ai-agents #multi-agent #memory #gui-agents #weekly-tracker

Analyzing Data Nodes...

SIG_CONF:CALCULATING

Verified Sources

Data Overview

Snapshot Week: 2026-05-08 to 2026-05-14
Tracker: ArXiv cs.AI Weekly Papers Tracker (view all snapshots: /tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly)
Update Frequency: Weekly
Primary Sources: ArXiv cs.AI, ArXiv cs.MA, ArXiv cs.CL

Key Facts

Who: 35 agent-related papers from 122 total submissions across cs.AI, cs.CL, and cs.MA categories
What: GUI agents achieve measurable SOTA improvements; first dedicated agent memory benchmark emerges; multi-agent systems incorporate spectral analysis methods
When: Week of May 8-14, 2026
Impact: 5 notable papers identified with breakthrough results (ToolCUA +66% improvement, EAM +19.6% with 6x efficiency, PIVOT +94% constraint satisfaction)

Methodology

This Week’s Data

Title	Authors	ArXiv ID	Category	Key Topics	Trend
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents	Hu et al.	2605.12481	cs.AI	agent, GUI, tool-use, computer-use-agent, RL	10
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues	Wu et al.	2605.12493	cs.CL	agent, memory, evaluation, benchmark, long-term	9
Executable Agentic Memory for GUI Agent	Qin et al.	2605.12294	cs.AI	agent, GUI, memory, knowledge-graph, MCTS	9
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement	Zhang et al.	2605.11225	cs.AI	agent, LLM, planning, execution, trajectory	9
Events as Triggers for Behavioral Diversity in Multi-Agent RL	Buchi et al.	2605.12388	cs.MA	multi-agent, RL, behavioral-diversity, LoRA	8
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum	Park et al.	2605.11453	cs.MA	multi-agent, LLM, reasoning, topology, spectral	8
OptArgus: Multi-Agent System to Detect Hallucinations in LLM Optimization Modeling	Li et al.	2605.11738	cs.AI	multi-agent, hallucination, detection, optimization	8
AgentDisCo: Disentanglement and Collaboration in Open-ended Deep Research Agents	Jin et al.	2605.11732	cs.IR	agent, multi-agent, research, disentanglement	8
Reinforcement Learning for LLM Multi-Agent Systems through Orchestration Traces	Multiple	2605.02801	cs.AI	multi-agent, RL, LLM, orchestration, RFT	8
delta-mem: Efficient Online Memory for Large Language Models	Lei et al.	2605.12357	cs.AI	LLM, memory, agent, online-learning, delta-rule	8
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows	Liu et al.	2605.12376	cs.AI	agent, workflow, profiling, tabular-data, multi-agent	7
Intermediate Artifacts as First-Class Citizens in Agentic Systems	Rosen et al.	2605.12087	cs.AI	agent, artifacts, data-model, durable, systems	7
No Action Without a NOD: Heterogeneous Multi-Agent Architecture for Service Agents	Yang et al.	2605.12240	cs.AI	multi-agent, service-agent, architecture, heterogeneous	7
Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries	Laws et al.	2605.12364	cs.CR	agent, security, Byzantine, governance, distributed	7
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces	Jin et al.	2605.12015	cs.CR	agent, safety, benchmark, attack, security	7
When Reasoning Traces Become Performative: Step-Level Evidence that CoT Is an Imperfect Oversight Channel	Li et al.	2605.11746	cs.AI	reasoning, chain-of-thought, oversight, performative	7
Digital Identity for Agentic Systems: Toward a Portable Authorization Standard	Madhira	2605.11487	cs.CR	agent, digital-identity, authorization, autonomous, standard	6
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning	Deng et al.	2605.11880	cs.LG	multi-agent, RL, TD-Lambda, cooperative	6
Shaping Zero-Shot Coordination via State Blocking	Kang et al.	2605.11688	cs.LG	multi-agent, zero-shot, coordination, state-blocking	6
Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks	Yan et al.	2605.11509	cs.AI	LLM, agent, UAV, hierarchical, control, optimization	6
Scalable Token-Level Hallucination Detection in Large Language Models	Min et al.	2605.12384	cs.CL	LLM, hallucination, detection, token-level	6
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling	Shapira et al.	2605.12411	cs.LG	agent, prediction, text-tabular, modeling	6
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar	Taibi et al.	2605.11720	cs.SE	agent, software-engineering, research-agenda	5
MedHopQA: Disease-Centered Multi-Hop Reasoning Benchmark for Biomedical QA	Islamaj et al.	2605.12361	cs.CL	reasoning, multi-hop, benchmark, biomedical, QA	5
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations	Moldovan-Mauer et al.	2605.11789	cs.AI	multi-agent, simulation, Monte-Carlo, incivility	5
Control Charts for Multi-agent Systems	Helm et al.	2605.11135	cs.MA	multi-agent, control-charts, monitoring, analysis	5
Distance-Constrained Unlabeled Multi-Agent Pathfinding	Suzuki et al.	2605.11503	cs.MA	multi-agent, pathfinding, distance-constrained	5
GeomHerd: Forward-looking Herding Quantification via Ricci Flow Geometry	Yang et al.	2605.11645	cs.MA	multi-agent, geometry, simulation, herding, Ricci-flow	5
Information and Contract Design for Repeated Interactions between Agents	Sreenivas et al.	2605.11294	cs.MA	multi-agent, contract-design, incentives, IJCAI	5

Week-over-Week Summary

Metric	This Week	Last Week	Change
Total papers	122	98	+24 (+24.5%)
Agent-related	35	30	+5 (+16.7%)
Multi-agent	18	15	+3 (+20.0%)
RAG-related	4	-	-
Reasoning	8	-	-
Tool-use	5	-	-
Memory	7	-	-
GUI agents	4	-	-
Hallucination detection	3	-	-
Security & governance	3	-	-

Trends & Observations

🔺 Scout Intel: What Others Missed

Confidence: high | Novelty Score: 65/100

Previous Snapshots

Sources

ArXiv cs.AI Recent Papers — ArXiv, May 2026
ArXiv cs.MA (Multi-Agent) Recent Papers — ArXiv, May 2026
ArXiv cs.CL (NLP) Recent Papers — ArXiv, May 2026

ilfrgtjmwq0vcrwkhle3ec░░░dr3bam4atnv0irxg4iswogdjw6fug4akv9████h5v27wah2mjsucessa1luamjydoc3dy1q░░░7sbfncms7nitnkwav3cnfs2khlr4ozw░░░idubhtio3gkcqwr6vxbb9qbdf8z0x165a░░░jx6imkdmrpmtdpnhp3kvqbg2gre9mcc3░░░m2d0bxlymmc43tplno83o2gozl2pi2tt9░░░r9nyn4eu4d9cjbogn6pccrpub2ak8l95████x9ettqldjrizoww6pd5jsl8x1hz94ej████2ztl33cg3wyicm1xcjh9uqmz8afd7m9k████mtjn1ju1u6xp2fp1fqxvcsmluz0pici████f0dlio9x7rjeb51wd55skra7ajje0vxd6░░░fwlwt6wpduc601fa3e4xuj7xj6ib9wrd████spk9w2m3qdj9kqze37y9wug7b7poetkl░░░sse17unw0q7mh6br0fa0l868gubrqoww████3zdjwsmrlcq3xw5x44d3civ7uzjvvby6████6yo4fyf3dwb077gbqm7g0ah7gbt0za85l9░░░jp60bzq4h1d6no7o5jz0u7hl4fvt2qo4████hudtbsz8fzguoityok1clrg3wa7vsdho░░░htfv3it1gyo8qlqowebcx5r5703e3kzbm████ssy1bhw978ni9bhzde5u2bux199mm9up░░░aqvmczxse744923f644dd3r51dh8u98p░░░030zbytpclsccgvrra05879ebxst6vmu2████d5y9ikexd4cnuymge51fwr7c88dp8tqlh████36zhtnwnaengslx63benyu5m1ew46mf░░░96ytiu7luf97m1g8iul3npmdivu87fe28████pmtrfi9u4tior8yai2x36lrxem19e5x░░░eou5xhzu9q469s8v69k9welqmz6f5mtw████bp5b8qgqywpp4mmji5oo9m58mt31tolsk░░░qruklevxntnjupdbyono8q3fbqwikw9n4░░░17h9tjhdxse3f0nompi2yefacpypsq6kw░░░nr3suqventtktvpwljfveevlhazylzk████4tc7whp5oabdlt6artw6v7rqu0buxoz4░░░eyte5a06pdjt04ns3nvsq3w38x9awmn░░░s9gbf9kq90m9yjvnca2o6ouykenowa7gd░░░zclqgdfzzemglmcy2lbe1wgodkm8760q████s4k2xd7xv29xlwfi9kiivn6q11fufh5d░░░5iu4mw5wd0vc0vr5p14s6mvhbny3imkg████x4w8u69qx4dfs13x51gympf767esddq7h████qlboygxquhsnllo1d1pok6pr40odrn2h████zu0saf5hkpjh26n291cqui55dzpr1qgre████3ylhthrtk2b2aziw9j04icvuwfwyht6o████tshvbkeg5n9xrpmzezu02aadp94seo2r4░░░j3b2kwy6g1iga2jekbx6da29c9jphwga████2y2v6gpalqkagw919sodwl1kr7ff7x4bz░░░he8hli4keshk2f0a7l2mtgxmstu7uupr░░░een5j84fecfj2bneeifmxrgg7ek67auy4░░░1uv6qyy3p2nhk5c8cfaxol3hbbnors6et████6lh89za8wdllex8btukphq7p24oo6d6n████btyikt7ibppk0v74db4nmijr56d6xquf░░░289wg89bhdf

Related Intel

Data Jun 25, 2026

ArXiv cs.AI Weekly Papers Tracker — Week of Jun 25, 2026

ArXiv cs.AI papers for Jun 18-25, 2026: 32 total, 68.8% agent-related (22 papers), avg trend score 9.14. Notable: RIFT-Bench, Metis self-evolving agents, 14 new benchmarks.

#arxiv #cs-ai #agents #benchmarks

Data Jun 23, 2026

LLM Product Release Tracker — Week of Jun 17, 2026

Weekly snapshot of LLM vendor product releases, feature updates, and enterprise announcements. This week: Anthropic Korea expansion, Google TTS streaming.

#llm #product-release #anthropic #google

Data Jun 22, 2026

GitHub AI Agent Repository Stars Tracker — Week of Jun 22, 2026

hermes-agent hits 198,941 stars (+2.82% WoW). Python/TypeScript dominate 77% of top 30. Ecosystem grows to 158 repos.

#github #ai-agents #stars-tracker #open-source