ArXiv cs.AI Weekly Papers Tracker - Week of May 14, 2026
122 papers this week (+24.5% WoW). ToolCUA sets new SOTA for Computer Use Agents at 46.85% accuracy. LongMemEval-V2 introduces first dedicated agent memory benchmark. GUI agents, multi-agent systems, and memory architectures dominate research.
Data Overview
- Snapshot Week: 2026-05-08 to 2026-05-14
- Tracker: ArXiv cs.AI Weekly Papers Tracker (view all snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI, ArXiv cs.MA, ArXiv cs.CL
Key Facts
- Who: 35 agent-related papers from 122 total submissions across cs.AI, cs.CL, and cs.MA categories
- What: GUI agents achieve measurable SOTA improvements; first dedicated agent memory benchmark emerges; multi-agent systems incorporate spectral analysis methods
- When: Week of May 8-14, 2026
- Impact: 5 notable papers identified with breakthrough results (ToolCUA +66% improvement, EAM +19.6% with 6x efficiency, PIVOT +94% constraint satisfaction)
Methodology
Data collected via Jina AI Reader from ArXiv recent papers lists across three primary categories: cs.AI (50 papers), cs.CL (50 papers), and cs.MA (22 papers). Papers filtered by agent-relevant keywords including: agent, multi-agent, memory, tool-use, GUI, reasoning, hallucination detection, and planning. Trend scores assigned based on novelty, benchmark results, and citation potential. Notable papers identified through quantitative improvements (SOTA achievements, benchmark contributions) and qualitative factors (new frameworks, comprehensive evaluations).
This Weekβs Data
| Title | Authors | ArXiv ID | Category | Key Topics | Trend |
|---|---|---|---|---|---|
| ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents | Hu et al. | 2605.12481 | cs.AI | agent, GUI, tool-use, computer-use-agent, RL | 10 |
| LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues | Wu et al. | 2605.12493 | cs.CL | agent, memory, evaluation, benchmark, long-term | 9 |
| Executable Agentic Memory for GUI Agent | Qin et al. | 2605.12294 | cs.AI | agent, GUI, memory, knowledge-graph, MCTS | 9 |
| PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement | Zhang et al. | 2605.11225 | cs.AI | agent, LLM, planning, execution, trajectory | 9 |
| Events as Triggers for Behavioral Diversity in Multi-Agent RL | Buchi et al. | 2605.12388 | cs.MA | multi-agent, RL, behavioral-diversity, LoRA | 8 |
| Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum | Park et al. | 2605.11453 | cs.MA | multi-agent, LLM, reasoning, topology, spectral | 8 |
| OptArgus: Multi-Agent System to Detect Hallucinations in LLM Optimization Modeling | Li et al. | 2605.11738 | cs.AI | multi-agent, hallucination, detection, optimization | 8 |
| AgentDisCo: Disentanglement and Collaboration in Open-ended Deep Research Agents | Jin et al. | 2605.11732 | cs.IR | agent, multi-agent, research, disentanglement | 8 |
| Reinforcement Learning for LLM Multi-Agent Systems through Orchestration Traces | Multiple | 2605.02801 | cs.AI | multi-agent, RL, LLM, orchestration, RFT | 8 |
| delta-mem: Efficient Online Memory for Large Language Models | Lei et al. | 2605.12357 | cs.AI | LLM, memory, agent, online-learning, delta-rule | 8 |
| ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows | Liu et al. | 2605.12376 | cs.AI | agent, workflow, profiling, tabular-data, multi-agent | 7 |
| Intermediate Artifacts as First-Class Citizens in Agentic Systems | Rosen et al. | 2605.12087 | cs.AI | agent, artifacts, data-model, durable, systems | 7 |
| No Action Without a NOD: Heterogeneous Multi-Agent Architecture for Service Agents | Yang et al. | 2605.12240 | cs.AI | multi-agent, service-agent, architecture, heterogeneous | 7 |
| Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries | Laws et al. | 2605.12364 | cs.CR | agent, security, Byzantine, governance, distributed | 7 |
| SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces | Jin et al. | 2605.12015 | cs.CR | agent, safety, benchmark, attack, security | 7 |
| When Reasoning Traces Become Performative: Step-Level Evidence that CoT Is an Imperfect Oversight Channel | Li et al. | 2605.11746 | cs.AI | reasoning, chain-of-thought, oversight, performative | 7 |
| Digital Identity for Agentic Systems: Toward a Portable Authorization Standard | Madhira | 2605.11487 | cs.CR | agent, digital-identity, authorization, autonomous, standard | 6 |
| Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning | Deng et al. | 2605.11880 | cs.LG | multi-agent, RL, TD-Lambda, cooperative | 6 |
| Shaping Zero-Shot Coordination via State Blocking | Kang et al. | 2605.11688 | cs.LG | multi-agent, zero-shot, coordination, state-blocking | 6 |
| Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks | Yan et al. | 2605.11509 | cs.AI | LLM, agent, UAV, hierarchical, control, optimization | 6 |
| Scalable Token-Level Hallucination Detection in Large Language Models | Min et al. | 2605.12384 | cs.CL | LLM, hallucination, detection, token-level | 6 |
| Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling | Shapira et al. | 2605.12411 | cs.LG | agent, prediction, text-tabular, modeling | 6 |
| A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar | Taibi et al. | 2605.11720 | cs.SE | agent, software-engineering, research-agenda | 5 |
| MedHopQA: Disease-Centered Multi-Hop Reasoning Benchmark for Biomedical QA | Islamaj et al. | 2605.12361 | cs.CL | reasoning, multi-hop, benchmark, biomedical, QA | 5 |
| Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations | Moldovan-Mauer et al. | 2605.11789 | cs.AI | multi-agent, simulation, Monte-Carlo, incivility | 5 |
| Control Charts for Multi-agent Systems | Helm et al. | 2605.11135 | cs.MA | multi-agent, control-charts, monitoring, analysis | 5 |
| Distance-Constrained Unlabeled Multi-Agent Pathfinding | Suzuki et al. | 2605.11503 | cs.MA | multi-agent, pathfinding, distance-constrained | 5 |
| GeomHerd: Forward-looking Herding Quantification via Ricci Flow Geometry | Yang et al. | 2605.11645 | cs.MA | multi-agent, geometry, simulation, herding, Ricci-flow | 5 |
| Information and Contract Design for Repeated Interactions between Agents | Sreenivas et al. | 2605.11294 | cs.MA | multi-agent, contract-design, incentives, IJCAI | 5 |
Week-over-Week Summary
| Metric | This Week | Last Week | Change |
|---|---|---|---|
| Total papers | 122 | 98 | +24 (+24.5%) |
| Agent-related | 35 | 30 | +5 (+16.7%) |
| Multi-agent | 18 | 15 | +3 (+20.0%) |
| RAG-related | 4 | - | - |
| Reasoning | 8 | - | - |
| Tool-use | 5 | - | - |
| Memory | 7 | - | - |
| GUI agents | 4 | - | - |
| Hallucination detection | 3 | - | - |
| Security & governance | 3 | - | - |
Trends & Observations
GUI Agents Achieve Measurable SOTA Improvements: ToolCUA establishes a new benchmark for Computer Use Agents with 46.85% accuracy on OSWorld-MCP, representing a 66% relative improvement over baselines. Executable Agentic Memory (EAM) demonstrates that knowledge graph approaches can outperform existing models by 19.6% while reducing token costs by 6x. These results signal that GUI agents are transitioning from proof-of-concept to production-ready systems.
Agent Memory Becomes a Distinct Research Subfield: LongMemEval-V2 introduces the first comprehensive benchmark dedicated to agent memory evaluation, with 451 questions spanning up to 500 trajectories (115M tokens). delta-mem presents a lightweight online memory mechanism achieving 1.31x improvement on MemoryAgentBench. The emergence of dedicated benchmarks and architectures for agent memory indicates this is crystallizing into a research area with its own evaluation frameworks.
Multi-Agent Systems Incorporate Spectral and Geometric Analysis Methods: Predictive Maps of Multi-Agent Reasoning applies successor representation spectral quantities to diagnose LLM communication topologies, finding that condition numbers perfectly predict perturbation robustness. GeomHerd introduces Ricci flow geometry for herding quantification. These mathematical approaches suggest the field is maturing beyond empirical observations toward principled analysis frameworks.
Plan-Execution Alignment via Trajectory Refinement: PIVOT achieves up to 94% relative improvement in constraint satisfaction by treating trajectories as optimizable objects with self-supervised refinement, using 3x-5x fewer tokens than baseline approaches. This directly addresses the gap between high-level planning and ground-truth execution that has limited agent reliability.
Agent Security and Governance Frameworks Maturing: Three papers address adversarial and authorization challenges: Byzantine adversary analysis for distributed governance, portable authorization standards for autonomous agents (46 pages), and skill-facing attack surface benchmarks. This indicates the research community is anticipating deployment risks at scale.
πΊ Scout Intel: What Others Missed
Confidence: high | Novelty Score: 65/100
ToolCUAβs 46.85% accuracy on OSWorld-MCP represents a 66% relative improvement, but the deeper signal is the staged training paradigm that decouples GUI navigation from tool invocation. This separation allows agents to learn tool semantics independently of visual parsing, addressing the fundamental bottleneck that previous Computer Use Agents faced when tool availability varied across environments. The approach mirrors how human operators decompose complex tasks: first understand the interface, then select appropriate tools. EAMβs knowledge graph architecture demonstrates that retrieval-and-execution can replace end-to-end planning for GUI agents, reducing token costs by 6x while improving accuracy by 19.6%. This challenges the assumption that larger models with longer contexts are the path forward for agent systemsβstructured memory may be more efficient than brute-force scaling. LongMemEval-V2βs 451 questions across 500 trajectories (115M tokens) establish the first standardized evaluation for agent memory systems, creating a benchmark ecosystem where RAG, vector stores, and knowledge graphs can be compared on equal footing. Prior to this, agent memory papers used ad-hoc evaluation protocols, making cross-paper comparison impossible. PIVOTβs trajectory refinement achieving 94% constraint satisfaction improvement reveals that execution failures often stem from plan-environment misalignment, not fundamental capability gaps. Self-supervised correction of trajectories suggests agents can learn from their own mistakes without human intervention. Predictive Mapsβ spectral analysis providing perfect prediction of perturbation robustness indicates that multi-agent LLM communication topologies have measurable failure modes that can be diagnosed before deployment.
Key Implication: The convergence of GUI agents, memory systems, and plan-execution alignment in a single week suggests the field is coalescing around three critical capabilities for production deployment: reliable interface interaction, persistent knowledge retention, and self-correcting execution loops.
Previous Snapshots
Sources
- ArXiv cs.AI Recent Papers β ArXiv, May 2026
- ArXiv cs.MA (Multi-Agent) Recent Papers β ArXiv, May 2026
- ArXiv cs.CL (NLP) Recent Papers β ArXiv, May 2026
ArXiv cs.AI Weekly Papers Tracker - Week of May 14, 2026
122 papers this week (+24.5% WoW). ToolCUA sets new SOTA for Computer Use Agents at 46.85% accuracy. LongMemEval-V2 introduces first dedicated agent memory benchmark. GUI agents, multi-agent systems, and memory architectures dominate research.
Data Overview
- Snapshot Week: 2026-05-08 to 2026-05-14
- Tracker: ArXiv cs.AI Weekly Papers Tracker (view all snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI, ArXiv cs.MA, ArXiv cs.CL
Key Facts
- Who: 35 agent-related papers from 122 total submissions across cs.AI, cs.CL, and cs.MA categories
- What: GUI agents achieve measurable SOTA improvements; first dedicated agent memory benchmark emerges; multi-agent systems incorporate spectral analysis methods
- When: Week of May 8-14, 2026
- Impact: 5 notable papers identified with breakthrough results (ToolCUA +66% improvement, EAM +19.6% with 6x efficiency, PIVOT +94% constraint satisfaction)
Methodology
Data collected via Jina AI Reader from ArXiv recent papers lists across three primary categories: cs.AI (50 papers), cs.CL (50 papers), and cs.MA (22 papers). Papers filtered by agent-relevant keywords including: agent, multi-agent, memory, tool-use, GUI, reasoning, hallucination detection, and planning. Trend scores assigned based on novelty, benchmark results, and citation potential. Notable papers identified through quantitative improvements (SOTA achievements, benchmark contributions) and qualitative factors (new frameworks, comprehensive evaluations).
This Weekβs Data
| Title | Authors | ArXiv ID | Category | Key Topics | Trend |
|---|---|---|---|---|---|
| ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents | Hu et al. | 2605.12481 | cs.AI | agent, GUI, tool-use, computer-use-agent, RL | 10 |
| LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues | Wu et al. | 2605.12493 | cs.CL | agent, memory, evaluation, benchmark, long-term | 9 |
| Executable Agentic Memory for GUI Agent | Qin et al. | 2605.12294 | cs.AI | agent, GUI, memory, knowledge-graph, MCTS | 9 |
| PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement | Zhang et al. | 2605.11225 | cs.AI | agent, LLM, planning, execution, trajectory | 9 |
| Events as Triggers for Behavioral Diversity in Multi-Agent RL | Buchi et al. | 2605.12388 | cs.MA | multi-agent, RL, behavioral-diversity, LoRA | 8 |
| Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum | Park et al. | 2605.11453 | cs.MA | multi-agent, LLM, reasoning, topology, spectral | 8 |
| OptArgus: Multi-Agent System to Detect Hallucinations in LLM Optimization Modeling | Li et al. | 2605.11738 | cs.AI | multi-agent, hallucination, detection, optimization | 8 |
| AgentDisCo: Disentanglement and Collaboration in Open-ended Deep Research Agents | Jin et al. | 2605.11732 | cs.IR | agent, multi-agent, research, disentanglement | 8 |
| Reinforcement Learning for LLM Multi-Agent Systems through Orchestration Traces | Multiple | 2605.02801 | cs.AI | multi-agent, RL, LLM, orchestration, RFT | 8 |
| delta-mem: Efficient Online Memory for Large Language Models | Lei et al. | 2605.12357 | cs.AI | LLM, memory, agent, online-learning, delta-rule | 8 |
| ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows | Liu et al. | 2605.12376 | cs.AI | agent, workflow, profiling, tabular-data, multi-agent | 7 |
| Intermediate Artifacts as First-Class Citizens in Agentic Systems | Rosen et al. | 2605.12087 | cs.AI | agent, artifacts, data-model, durable, systems | 7 |
| No Action Without a NOD: Heterogeneous Multi-Agent Architecture for Service Agents | Yang et al. | 2605.12240 | cs.AI | multi-agent, service-agent, architecture, heterogeneous | 7 |
| Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries | Laws et al. | 2605.12364 | cs.CR | agent, security, Byzantine, governance, distributed | 7 |
| SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces | Jin et al. | 2605.12015 | cs.CR | agent, safety, benchmark, attack, security | 7 |
| When Reasoning Traces Become Performative: Step-Level Evidence that CoT Is an Imperfect Oversight Channel | Li et al. | 2605.11746 | cs.AI | reasoning, chain-of-thought, oversight, performative | 7 |
| Digital Identity for Agentic Systems: Toward a Portable Authorization Standard | Madhira | 2605.11487 | cs.CR | agent, digital-identity, authorization, autonomous, standard | 6 |
| Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning | Deng et al. | 2605.11880 | cs.LG | multi-agent, RL, TD-Lambda, cooperative | 6 |
| Shaping Zero-Shot Coordination via State Blocking | Kang et al. | 2605.11688 | cs.LG | multi-agent, zero-shot, coordination, state-blocking | 6 |
| Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks | Yan et al. | 2605.11509 | cs.AI | LLM, agent, UAV, hierarchical, control, optimization | 6 |
| Scalable Token-Level Hallucination Detection in Large Language Models | Min et al. | 2605.12384 | cs.CL | LLM, hallucination, detection, token-level | 6 |
| Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling | Shapira et al. | 2605.12411 | cs.LG | agent, prediction, text-tabular, modeling | 6 |
| A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar | Taibi et al. | 2605.11720 | cs.SE | agent, software-engineering, research-agenda | 5 |
| MedHopQA: Disease-Centered Multi-Hop Reasoning Benchmark for Biomedical QA | Islamaj et al. | 2605.12361 | cs.CL | reasoning, multi-hop, benchmark, biomedical, QA | 5 |
| Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations | Moldovan-Mauer et al. | 2605.11789 | cs.AI | multi-agent, simulation, Monte-Carlo, incivility | 5 |
| Control Charts for Multi-agent Systems | Helm et al. | 2605.11135 | cs.MA | multi-agent, control-charts, monitoring, analysis | 5 |
| Distance-Constrained Unlabeled Multi-Agent Pathfinding | Suzuki et al. | 2605.11503 | cs.MA | multi-agent, pathfinding, distance-constrained | 5 |
| GeomHerd: Forward-looking Herding Quantification via Ricci Flow Geometry | Yang et al. | 2605.11645 | cs.MA | multi-agent, geometry, simulation, herding, Ricci-flow | 5 |
| Information and Contract Design for Repeated Interactions between Agents | Sreenivas et al. | 2605.11294 | cs.MA | multi-agent, contract-design, incentives, IJCAI | 5 |
Week-over-Week Summary
| Metric | This Week | Last Week | Change |
|---|---|---|---|
| Total papers | 122 | 98 | +24 (+24.5%) |
| Agent-related | 35 | 30 | +5 (+16.7%) |
| Multi-agent | 18 | 15 | +3 (+20.0%) |
| RAG-related | 4 | - | - |
| Reasoning | 8 | - | - |
| Tool-use | 5 | - | - |
| Memory | 7 | - | - |
| GUI agents | 4 | - | - |
| Hallucination detection | 3 | - | - |
| Security & governance | 3 | - | - |
Trends & Observations
GUI Agents Achieve Measurable SOTA Improvements: ToolCUA establishes a new benchmark for Computer Use Agents with 46.85% accuracy on OSWorld-MCP, representing a 66% relative improvement over baselines. Executable Agentic Memory (EAM) demonstrates that knowledge graph approaches can outperform existing models by 19.6% while reducing token costs by 6x. These results signal that GUI agents are transitioning from proof-of-concept to production-ready systems.
Agent Memory Becomes a Distinct Research Subfield: LongMemEval-V2 introduces the first comprehensive benchmark dedicated to agent memory evaluation, with 451 questions spanning up to 500 trajectories (115M tokens). delta-mem presents a lightweight online memory mechanism achieving 1.31x improvement on MemoryAgentBench. The emergence of dedicated benchmarks and architectures for agent memory indicates this is crystallizing into a research area with its own evaluation frameworks.
Multi-Agent Systems Incorporate Spectral and Geometric Analysis Methods: Predictive Maps of Multi-Agent Reasoning applies successor representation spectral quantities to diagnose LLM communication topologies, finding that condition numbers perfectly predict perturbation robustness. GeomHerd introduces Ricci flow geometry for herding quantification. These mathematical approaches suggest the field is maturing beyond empirical observations toward principled analysis frameworks.
Plan-Execution Alignment via Trajectory Refinement: PIVOT achieves up to 94% relative improvement in constraint satisfaction by treating trajectories as optimizable objects with self-supervised refinement, using 3x-5x fewer tokens than baseline approaches. This directly addresses the gap between high-level planning and ground-truth execution that has limited agent reliability.
Agent Security and Governance Frameworks Maturing: Three papers address adversarial and authorization challenges: Byzantine adversary analysis for distributed governance, portable authorization standards for autonomous agents (46 pages), and skill-facing attack surface benchmarks. This indicates the research community is anticipating deployment risks at scale.
πΊ Scout Intel: What Others Missed
Confidence: high | Novelty Score: 65/100
ToolCUAβs 46.85% accuracy on OSWorld-MCP represents a 66% relative improvement, but the deeper signal is the staged training paradigm that decouples GUI navigation from tool invocation. This separation allows agents to learn tool semantics independently of visual parsing, addressing the fundamental bottleneck that previous Computer Use Agents faced when tool availability varied across environments. The approach mirrors how human operators decompose complex tasks: first understand the interface, then select appropriate tools. EAMβs knowledge graph architecture demonstrates that retrieval-and-execution can replace end-to-end planning for GUI agents, reducing token costs by 6x while improving accuracy by 19.6%. This challenges the assumption that larger models with longer contexts are the path forward for agent systemsβstructured memory may be more efficient than brute-force scaling. LongMemEval-V2βs 451 questions across 500 trajectories (115M tokens) establish the first standardized evaluation for agent memory systems, creating a benchmark ecosystem where RAG, vector stores, and knowledge graphs can be compared on equal footing. Prior to this, agent memory papers used ad-hoc evaluation protocols, making cross-paper comparison impossible. PIVOTβs trajectory refinement achieving 94% constraint satisfaction improvement reveals that execution failures often stem from plan-environment misalignment, not fundamental capability gaps. Self-supervised correction of trajectories suggests agents can learn from their own mistakes without human intervention. Predictive Mapsβ spectral analysis providing perfect prediction of perturbation robustness indicates that multi-agent LLM communication topologies have measurable failure modes that can be diagnosed before deployment.
Key Implication: The convergence of GUI agents, memory systems, and plan-execution alignment in a single week suggests the field is coalescing around three critical capabilities for production deployment: reliable interface interaction, persistent knowledge retention, and self-correcting execution loops.
Previous Snapshots
Sources
- ArXiv cs.AI Recent Papers β ArXiv, May 2026
- ArXiv cs.MA (Multi-Agent) Recent Papers β ArXiv, May 2026
- ArXiv cs.CL (NLP) Recent Papers β ArXiv, May 2026
Related Intel
LLM Product Release Tracker β Week of May 12, 2026
Claude Platform launches on AWS, OpenAI releases GPT-5.5 Instant and three realtime voice models, Anthropic introduces self-improving Managed Agents. 17 releases tracked with 8 high-impact updates.
GitHub AI Agent Repository Stars Tracker β Week of May 11, 2026
The GitHub AI Agent ecosystem witnessed a dramatic reshuffle: Hermes Agent emerged as the new leader at 142K stars, while previous top 5 repositories dropped out of ai-agent topic search entirely. TypeScript now leads at 43.3%, with Claude Code-compatible frameworks dominating the new leaderboard.
AI Agent Governance Diverges as Security Boundaries Break and Infrastructure Accelerates
Microsoft's endpoint-centric governance and ServiceNow's data-plane control represent diverging paths. RCE vulnerabilities expose prompt injection as a new attack class. NVIDIA and Corning reconfigure network topology. $188B VC concentration creates infrastructure dependency.