ArXiv cs.AI Weekly Papers Tracker - Week of May 21, 2026
Weekly snapshot of 30 agent-related research papers from ArXiv cs.AI and cs.CL. Computer-use agent evaluation emerges as dominant theme with OpenComputer's 1,000 tasks and Agent Meltdowns' 64.7% unsafe behavior rate.
Data Overview
- Snapshot Week: 2026-05-15 to 2026-05-21
- Tracker: ArXiv cs.AI Weekly Papers Tracker (view all historical snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI RSS, ArXiv cs.CL RSS
Key Facts
- Who: 167 agent-related papers from ArXiv cs.AI (399 papers) and cs.CL (99 papers) this week
- What: 30 high-impact papers selected with Trend Scores 6-10; Computer-Use Agent evaluation dominates
- When: Week of May 15-21, 2026
- Impact: 377% increase in agent-related papers due to combined cs.AI + cs.CL coverage; 28 multi-agent papers (55.6% WoW growth)
Methodology
Papers are collected weekly from ArXiv RSS feeds (cs.AI and cs.CL categories). Agent-related papers are identified through keyword matching on titles and abstracts. Trend Scores (1-10) are assigned based on citation velocity, HuggingFace paper engagement, and relevance to core agent research themes. This snapshot reflects papers submitted or updated during the week of May 15-21, 2026.
This Weekβs Data
| Title | ArXiv ID | Trend Score | Key Topics | Notable Result |
|---|---|---|---|---|
| OpenComputer: Verifiable Software Worlds for Computer-Use Agents | 2605.19769 | 10 | computer-use agents, verification, desktop automation, 33 apps, 1000 tasks | Frontier agents struggle with end-to-end completion despite partial progress |
| Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents | 2605.19149 | 10 | agent safety, meltdown taxonomy, error handling, 64.7% unsafe behavior | 64.7% of agent rollouts show unsafe behaviors when encountering simulated errors |
| SIGMA: Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling | 2605.19418 | 9 | multi-agent, signed graph, conflict-aware reasoning, 6 benchmarks | Consistently outperforms SOTA baselines on 6 benchmark datasets |
| Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On | 2605.19035 | 9 | A2A networks, trustworthiness, agent coordination, four design pillars | Vision paper for A2A network trust architecture |
| DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows | 2605.19099 | 9 | delegation benchmark, 11 models, routing fidelity, counterfactual ceiling | 15-31 percentage points unrealized headroom for delegation orchestration |
| POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents | 2605.19127 | 9 | privacy benchmark, adversarial probing, 7852 samples, 10 domains | Frontier models withhold >99% protected attributes; smaller models leak over half |
| Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents | 2605.19604 | 9 | formal skills, runtime-native, MCP, hook-governed control, FairyClaw | Token-efficient and enforceable control surface for agent skills |
| PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents | 2605.19932 | 9 | context map, long-context agents, orientation cache, 93-145 fewer iterations | 6.3-34.0% improvement over baselines at 1.7-5.8x lower cost than ACE |
| Evidence-Carrying Multimodal Agents: Hallucination as Exploit | 2605.19192 | 8 | multimodal agents, hallucination-to-action, evidence-carrying, DOM/OCR verifiers | Gate bypass reduced from 15% to 1.3% after 4 hardening steps |
| EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design | 2605.19743 | 8 | multi-agent, engineering design, LangGraph, HPC orchestration, 7 agents | Proprietary models achieve 96-97% task completion on Beams2D |
| SERL: Selective Environment-Reweighted Learning for Multi-Turn Agents | 2605.19447 | 8 | multi-turn agents, feedback reweighting, credit assignment, ALFWorld, WebShop | 90.0% ALFWorld success, 80.1% WebShop success |
| AgentNLQ: A General-Purpose Agent for Natural Language to SQL | 2605.19010 | 8 | NL2SQL, multi-agent, BIRD benchmark, 78.1% semantic accuracy | 78.1% semantic accuracy on BIRD benchmark |
| MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization | 2605.19330 | 8 | skill optimization, Pareto front, Chebyshev scalarization, 7.5% improvement | 7.5% relative improvement over strongest baseline, 14.9% on FEVER |
| Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints | 2605.19140 | 8 | workflow learning, handoff, IC-SMDP, decentralized Q-learning, finite-sample bound | First finite-sample guarantee for neural Q-learning under decentralized partial observability |
| MMoA: An AI-Agent Framework with Recurrence for Memoried Mixture-of-Agent | 2605.19194 | 8 | Mixture-of-Agents, LSTM gating, recurrent routing, AlpacaEval 58.0% | Comparable accuracy with 4.6% runtime efficiency improvement |
| Progressive Autonomy as Preference Learning: Trust Calibration for Agentic Tool Use | 2605.19151 | 8 | trust calibration, tool use, preference learning, Gaussian process, approve/deny | Preferential Bayesian Optimization for allow/block/ask region classification |
| AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees | 2605.19260 | 7 | GUI agents, token reduction, quadtree, 13.22% speedup, 29.52% fewer tokens | 13.22% speedup with 29.52% fewer visual tokens, 99.06% performance retained |
| SimGym: A Framework for A/B Test Simulation with VLM Agents | 2605.19219 | 7 | A/B testing, VLM agents, e-commerce, persona generation, 77% directional alignment | 77% directional alignment with real buyer behavior, weeks to under 1 hour |
| Agentic Trading: When LLM Agents Meet Financial Markets | 2605.19337 | 7 | LLM trading agents, survey, 77 studies, protocol incomparability, reproducibility audit | Only 2/19 studies report extractable time-consistent split protocols |
| Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation | 2605.19779 | 7 | uncertainty quantification, conformal prediction, 50 agents, 18 signals | Calibration error below 0.02 at 24h horizon, per-agent coverage at 80.4% |
| ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking | 2605.19077 | 7 | dialogue state tracking, ReAct loop, MultiWOZ, zero-shot SOTA, 52.71% JGA | New zero-shot SOTA: gpt-oss-20B reaches 52.71% joint goal accuracy |
| REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? | 2605.19196 | 7 | LLM-as-judge, meta-evaluation, deep research agents, failure taxonomy | Best LLM judges achieve below 55% accuracy across reasoning/tool-use failures |
| Discoverable Agent Knowledge: A Formal Framework for Agentic KG Affordances | 2605.19186 | 7 | knowledge graph, agentic affordances, VoID/DCAT extension, OWL-S revival | Agentic Affordance Profile (AAP) for KG selection and composition |
| Prior Knowledge or Search? LLM Agents in Hardware-Aware Code Optimization | 2605.19782 | 7 | LLM optimization, code optimization, CUDA vs TVM, greedy optimization | LLMs depend on pretrained priors rather than provided feedback |
| Multi-Agent Framework for Feature-Constrained Difficulty Control | 2605.19316 | 6 | multi-agent, difficulty control, reading comprehension, item generation | Multi-agent framework for controlled difficulty generation |
| Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory | 2605.19952 | 6 | agent memory, lifelong learning, atomic facts, memory structures | Beyond atomic facts for lifelong agent memory |
| Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents | 2605.20061 | 6 | credit assignment, long-horizon agents, belief rewards, consistency-guided | Belief-based credit assignment for long-horizon agents |
| CopT: Contrastive On-Policy Thinking for General and Agentic Reasoning | 2605.20075 | 6 | agentic reasoning, contrastive thinking, on-policy, continuous spaces | Contrastive on-policy thinking for agentic reasoning |
| ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning | 2605.20176 | 6 | clinical reasoning, multimodal, evidence seeking, agentic | Automated evidence seeking for clinical reasoning agents |
| Memory-Augmented Reinforcement Learning Agent for CAD Generation | 2605.19748 | 6 | memory-augmented RL, CAD generation, design agents | Memory-augmented RL for CAD generation |
Week-over-Week Summary
| Metric | This Week | Last Week | Change |
|---|---|---|---|
| Total papers (cs.AI + cs.CL) | 498 | 122 | +376 (+308.2%) |
| Agent-related papers | 167 | 35 | +132 (+377.1%) |
| Multi-agent systems | 28 | 18 | +10 (+55.6%) |
| Agent memory papers | 9 | - | N/A |
| Computer-use agents | 4 | - | N/A |
| Agent safety papers | 3 | - | N/A |
| Tool use papers | 11 | - | N/A |
Note: The significant increase in paper count is due to expanded coverage from cs.AI-only to combined cs.AI + cs.CL RSS feeds, providing a more comprehensive view of agent research across both AI and NLP communities.
Ecosystem Metrics
| Category | Count | Notes |
|---|---|---|
| Total papers scanned | 498 | 399 cs.AI + 99 cs.CL |
| Agent-related papers | 167 | 33.5% of total |
| Multi-agent systems | 28 | 16.8% of agent papers |
| Reasoning papers | 35 | 21.0% of agent papers |
| Tool use papers | 11 | 6.6% of agent papers |
| RAG-related | 12 | 7.2% of agent papers |
| Agent memory | 9 | 5.4% of agent papers |
| GUI agents | 5 | 3.0% of agent papers |
| Computer-use agents | 4 | 2.4% of agent papers |
| Agent safety | 3 | 1.8% of agent papers |
| Agent evaluation | 6 | 3.6% of agent papers |
Top Papers by Category
| Category | Leading Papers |
|---|---|
| Computer-Use Agents | OpenComputer, Agent Meltdowns, AQuaUI |
| Multi-Agent Systems | SIGMA, EngiAI, MMoA, Learning to Hand Off |
| Agent Memory | PEEK, SERL, Rethinking Memory |
| Agent Safety | Agent Meltdowns, POLAR-Bench, Evidence-Carrying Agents |
| Agent Evaluation | DecisionBench, REFLECT, Distribution-Free UQ |
| Agent Skills | Formal Skill, MOCHA, Discoverable Agent Knowledge |
Trends & Observations
- Computer-Use Agent Evaluation Dominates: OpenComputer establishes the first comprehensive desktop benchmark with 1,000 verifiable tasks across 33 applications, revealing significant gaps in frontier agent capabilities for end-to-end completion.
- Safety Taxonomy Emerges: Agent Meltdowns introduces a systematic failure taxonomy showing 64.7% unsafe behavior rates when agents encounter simulated errors, highlighting critical gaps between helpfulness and harmlessness.
- Multi-Agent Reasoning Matures: SIGMA demonstrates that conflict-aware reasoning via signed graphs consistently outperforms SOTA baselines across 6 benchmarks, signaling advancement in handling disagreement among specialized agents.
- Memory Architectures Break Through: PEEKβs context map approach delivers 6.3-34.0% improvement with 93-145 fewer iterations for long-context tasks, while SERL achieves 90.0% success on ALFWorld through feedback reweighting.
- Privacy Gap Widens: POLAR-Bench reveals a stark divide - frontier models withhold >99% protected attributes while smaller models leak over 50%, suggesting safety alignment correlates strongly with model scale.
- LLM Judges Remain Unreliable: REFLECT shows best LLM judges achieve below 55% accuracy for agent evaluation, underscoring the supervision gap in automated agent assessment.
πΊ Scout Intel: What Others Missed
Confidence: high | Novelty Score: 62/100
The convergence of three papers this week - OpenComputerβs 1,000 verifiable tasks, Agent Meltdownsβ 64.7% unsafe behavior rate, and POLAR-Benchβs privacy gap findings - signals a shift from agent capability building to systematic failure mode cataloging. The research community is transitioning from βwhat can agents do?β to βwhere do agents break?β This is not merely academic: enterprises deploying agents in production face a liability gap where frontier model costs (>$60/1M tokens for reasoning models) combine with 64.7% unsafe behavior rates under error conditions. SIGMAβs conflict-aware approach and PEEKβs context maps address orthogonal problems - inter-agent disagreement and long-context memory - but neither tackles the core safety-evaluation alignment that OpenComputer exposes. The 15-31 percentage point delegation gap in DecisionBench and sub-55% LLM judge accuracy in REFLECT further indicate that automation of agent supervision remains unsolved despite rapid capability advances.
Key Implication: Enterprises should prioritize safety evaluation infrastructure over capability expansion when selecting agent frameworks - the 64.7% meltdown rate under error conditions represents an unacceptable production risk that current benchmarks systematically underreport.
Previous Snapshots
Sources
- ArXiv cs.AI RSS Feed - Primary source for AI agent research papers
- ArXiv cs.CL RSS Feed - Complementary NLP and computational linguistics papers
ArXiv cs.AI Weekly Papers Tracker - Week of May 21, 2026
Weekly snapshot of 30 agent-related research papers from ArXiv cs.AI and cs.CL. Computer-use agent evaluation emerges as dominant theme with OpenComputer's 1,000 tasks and Agent Meltdowns' 64.7% unsafe behavior rate.
Data Overview
- Snapshot Week: 2026-05-15 to 2026-05-21
- Tracker: ArXiv cs.AI Weekly Papers Tracker (view all historical snapshots:
/tech/ai-agents/data/?tracker=arxiv-cs-ai-weekly) - Update Frequency: Weekly
- Primary Sources: ArXiv cs.AI RSS, ArXiv cs.CL RSS
Key Facts
- Who: 167 agent-related papers from ArXiv cs.AI (399 papers) and cs.CL (99 papers) this week
- What: 30 high-impact papers selected with Trend Scores 6-10; Computer-Use Agent evaluation dominates
- When: Week of May 15-21, 2026
- Impact: 377% increase in agent-related papers due to combined cs.AI + cs.CL coverage; 28 multi-agent papers (55.6% WoW growth)
Methodology
Papers are collected weekly from ArXiv RSS feeds (cs.AI and cs.CL categories). Agent-related papers are identified through keyword matching on titles and abstracts. Trend Scores (1-10) are assigned based on citation velocity, HuggingFace paper engagement, and relevance to core agent research themes. This snapshot reflects papers submitted or updated during the week of May 15-21, 2026.
This Weekβs Data
| Title | ArXiv ID | Trend Score | Key Topics | Notable Result |
|---|---|---|---|---|
| OpenComputer: Verifiable Software Worlds for Computer-Use Agents | 2605.19769 | 10 | computer-use agents, verification, desktop automation, 33 apps, 1000 tasks | Frontier agents struggle with end-to-end completion despite partial progress |
| Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents | 2605.19149 | 10 | agent safety, meltdown taxonomy, error handling, 64.7% unsafe behavior | 64.7% of agent rollouts show unsafe behaviors when encountering simulated errors |
| SIGMA: Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling | 2605.19418 | 9 | multi-agent, signed graph, conflict-aware reasoning, 6 benchmarks | Consistently outperforms SOTA baselines on 6 benchmark datasets |
| Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On | 2605.19035 | 9 | A2A networks, trustworthiness, agent coordination, four design pillars | Vision paper for A2A network trust architecture |
| DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows | 2605.19099 | 9 | delegation benchmark, 11 models, routing fidelity, counterfactual ceiling | 15-31 percentage points unrealized headroom for delegation orchestration |
| POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents | 2605.19127 | 9 | privacy benchmark, adversarial probing, 7852 samples, 10 domains | Frontier models withhold >99% protected attributes; smaller models leak over half |
| Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents | 2605.19604 | 9 | formal skills, runtime-native, MCP, hook-governed control, FairyClaw | Token-efficient and enforceable control surface for agent skills |
| PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents | 2605.19932 | 9 | context map, long-context agents, orientation cache, 93-145 fewer iterations | 6.3-34.0% improvement over baselines at 1.7-5.8x lower cost than ACE |
| Evidence-Carrying Multimodal Agents: Hallucination as Exploit | 2605.19192 | 8 | multimodal agents, hallucination-to-action, evidence-carrying, DOM/OCR verifiers | Gate bypass reduced from 15% to 1.3% after 4 hardening steps |
| EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design | 2605.19743 | 8 | multi-agent, engineering design, LangGraph, HPC orchestration, 7 agents | Proprietary models achieve 96-97% task completion on Beams2D |
| SERL: Selective Environment-Reweighted Learning for Multi-Turn Agents | 2605.19447 | 8 | multi-turn agents, feedback reweighting, credit assignment, ALFWorld, WebShop | 90.0% ALFWorld success, 80.1% WebShop success |
| AgentNLQ: A General-Purpose Agent for Natural Language to SQL | 2605.19010 | 8 | NL2SQL, multi-agent, BIRD benchmark, 78.1% semantic accuracy | 78.1% semantic accuracy on BIRD benchmark |
| MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization | 2605.19330 | 8 | skill optimization, Pareto front, Chebyshev scalarization, 7.5% improvement | 7.5% relative improvement over strongest baseline, 14.9% on FEVER |
| Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints | 2605.19140 | 8 | workflow learning, handoff, IC-SMDP, decentralized Q-learning, finite-sample bound | First finite-sample guarantee for neural Q-learning under decentralized partial observability |
| MMoA: An AI-Agent Framework with Recurrence for Memoried Mixture-of-Agent | 2605.19194 | 8 | Mixture-of-Agents, LSTM gating, recurrent routing, AlpacaEval 58.0% | Comparable accuracy with 4.6% runtime efficiency improvement |
| Progressive Autonomy as Preference Learning: Trust Calibration for Agentic Tool Use | 2605.19151 | 8 | trust calibration, tool use, preference learning, Gaussian process, approve/deny | Preferential Bayesian Optimization for allow/block/ask region classification |
| AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees | 2605.19260 | 7 | GUI agents, token reduction, quadtree, 13.22% speedup, 29.52% fewer tokens | 13.22% speedup with 29.52% fewer visual tokens, 99.06% performance retained |
| SimGym: A Framework for A/B Test Simulation with VLM Agents | 2605.19219 | 7 | A/B testing, VLM agents, e-commerce, persona generation, 77% directional alignment | 77% directional alignment with real buyer behavior, weeks to under 1 hour |
| Agentic Trading: When LLM Agents Meet Financial Markets | 2605.19337 | 7 | LLM trading agents, survey, 77 studies, protocol incomparability, reproducibility audit | Only 2/19 studies report extractable time-consistent split protocols |
| Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation | 2605.19779 | 7 | uncertainty quantification, conformal prediction, 50 agents, 18 signals | Calibration error below 0.02 at 24h horizon, per-agent coverage at 80.4% |
| ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking | 2605.19077 | 7 | dialogue state tracking, ReAct loop, MultiWOZ, zero-shot SOTA, 52.71% JGA | New zero-shot SOTA: gpt-oss-20B reaches 52.71% joint goal accuracy |
| REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? | 2605.19196 | 7 | LLM-as-judge, meta-evaluation, deep research agents, failure taxonomy | Best LLM judges achieve below 55% accuracy across reasoning/tool-use failures |
| Discoverable Agent Knowledge: A Formal Framework for Agentic KG Affordances | 2605.19186 | 7 | knowledge graph, agentic affordances, VoID/DCAT extension, OWL-S revival | Agentic Affordance Profile (AAP) for KG selection and composition |
| Prior Knowledge or Search? LLM Agents in Hardware-Aware Code Optimization | 2605.19782 | 7 | LLM optimization, code optimization, CUDA vs TVM, greedy optimization | LLMs depend on pretrained priors rather than provided feedback |
| Multi-Agent Framework for Feature-Constrained Difficulty Control | 2605.19316 | 6 | multi-agent, difficulty control, reading comprehension, item generation | Multi-agent framework for controlled difficulty generation |
| Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory | 2605.19952 | 6 | agent memory, lifelong learning, atomic facts, memory structures | Beyond atomic facts for lifelong agent memory |
| Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents | 2605.20061 | 6 | credit assignment, long-horizon agents, belief rewards, consistency-guided | Belief-based credit assignment for long-horizon agents |
| CopT: Contrastive On-Policy Thinking for General and Agentic Reasoning | 2605.20075 | 6 | agentic reasoning, contrastive thinking, on-policy, continuous spaces | Contrastive on-policy thinking for agentic reasoning |
| ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning | 2605.20176 | 6 | clinical reasoning, multimodal, evidence seeking, agentic | Automated evidence seeking for clinical reasoning agents |
| Memory-Augmented Reinforcement Learning Agent for CAD Generation | 2605.19748 | 6 | memory-augmented RL, CAD generation, design agents | Memory-augmented RL for CAD generation |
Week-over-Week Summary
| Metric | This Week | Last Week | Change |
|---|---|---|---|
| Total papers (cs.AI + cs.CL) | 498 | 122 | +376 (+308.2%) |
| Agent-related papers | 167 | 35 | +132 (+377.1%) |
| Multi-agent systems | 28 | 18 | +10 (+55.6%) |
| Agent memory papers | 9 | - | N/A |
| Computer-use agents | 4 | - | N/A |
| Agent safety papers | 3 | - | N/A |
| Tool use papers | 11 | - | N/A |
Note: The significant increase in paper count is due to expanded coverage from cs.AI-only to combined cs.AI + cs.CL RSS feeds, providing a more comprehensive view of agent research across both AI and NLP communities.
Ecosystem Metrics
| Category | Count | Notes |
|---|---|---|
| Total papers scanned | 498 | 399 cs.AI + 99 cs.CL |
| Agent-related papers | 167 | 33.5% of total |
| Multi-agent systems | 28 | 16.8% of agent papers |
| Reasoning papers | 35 | 21.0% of agent papers |
| Tool use papers | 11 | 6.6% of agent papers |
| RAG-related | 12 | 7.2% of agent papers |
| Agent memory | 9 | 5.4% of agent papers |
| GUI agents | 5 | 3.0% of agent papers |
| Computer-use agents | 4 | 2.4% of agent papers |
| Agent safety | 3 | 1.8% of agent papers |
| Agent evaluation | 6 | 3.6% of agent papers |
Top Papers by Category
| Category | Leading Papers |
|---|---|
| Computer-Use Agents | OpenComputer, Agent Meltdowns, AQuaUI |
| Multi-Agent Systems | SIGMA, EngiAI, MMoA, Learning to Hand Off |
| Agent Memory | PEEK, SERL, Rethinking Memory |
| Agent Safety | Agent Meltdowns, POLAR-Bench, Evidence-Carrying Agents |
| Agent Evaluation | DecisionBench, REFLECT, Distribution-Free UQ |
| Agent Skills | Formal Skill, MOCHA, Discoverable Agent Knowledge |
Trends & Observations
- Computer-Use Agent Evaluation Dominates: OpenComputer establishes the first comprehensive desktop benchmark with 1,000 verifiable tasks across 33 applications, revealing significant gaps in frontier agent capabilities for end-to-end completion.
- Safety Taxonomy Emerges: Agent Meltdowns introduces a systematic failure taxonomy showing 64.7% unsafe behavior rates when agents encounter simulated errors, highlighting critical gaps between helpfulness and harmlessness.
- Multi-Agent Reasoning Matures: SIGMA demonstrates that conflict-aware reasoning via signed graphs consistently outperforms SOTA baselines across 6 benchmarks, signaling advancement in handling disagreement among specialized agents.
- Memory Architectures Break Through: PEEKβs context map approach delivers 6.3-34.0% improvement with 93-145 fewer iterations for long-context tasks, while SERL achieves 90.0% success on ALFWorld through feedback reweighting.
- Privacy Gap Widens: POLAR-Bench reveals a stark divide - frontier models withhold >99% protected attributes while smaller models leak over 50%, suggesting safety alignment correlates strongly with model scale.
- LLM Judges Remain Unreliable: REFLECT shows best LLM judges achieve below 55% accuracy for agent evaluation, underscoring the supervision gap in automated agent assessment.
πΊ Scout Intel: What Others Missed
Confidence: high | Novelty Score: 62/100
The convergence of three papers this week - OpenComputerβs 1,000 verifiable tasks, Agent Meltdownsβ 64.7% unsafe behavior rate, and POLAR-Benchβs privacy gap findings - signals a shift from agent capability building to systematic failure mode cataloging. The research community is transitioning from βwhat can agents do?β to βwhere do agents break?β This is not merely academic: enterprises deploying agents in production face a liability gap where frontier model costs (>$60/1M tokens for reasoning models) combine with 64.7% unsafe behavior rates under error conditions. SIGMAβs conflict-aware approach and PEEKβs context maps address orthogonal problems - inter-agent disagreement and long-context memory - but neither tackles the core safety-evaluation alignment that OpenComputer exposes. The 15-31 percentage point delegation gap in DecisionBench and sub-55% LLM judge accuracy in REFLECT further indicate that automation of agent supervision remains unsolved despite rapid capability advances.
Key Implication: Enterprises should prioritize safety evaluation infrastructure over capability expansion when selecting agent frameworks - the 64.7% meltdown rate under error conditions represents an unacceptable production risk that current benchmarks systematically underreport.
Previous Snapshots
Sources
- ArXiv cs.AI RSS Feed - Primary source for AI agent research papers
- ArXiv cs.CL RSS Feed - Complementary NLP and computational linguistics papers
Related Intel
ArXiv AI Agent Papers Tracker β Week of Jun 18, 2026
35 papers this week reveal breakthroughs in self-evolving agents, distributed P2P networks, and creative domain benchmarks. OPD-Evolver challenges 397B models with 9B parameters. GameCraft-Bench shows frontier models struggle in creative tasks.
LLM Product Release Weekly Tracker β Week of Jun 16, 2026
Anthropic dominates with Fable 5/Mythos 5 release and immediate export control suspension. Google deprecates Imagen 4 and Veo. Anthropic confidential S-1 signals IPO. 11 entries, 5 high-impact events.
AI Agent Market Transformation: IDE Consolidation, Capital Concentration, Evaluation Gap 2026
Three structural changes define June 2026: Windsurf split signals AI IDE oligopoly formation; 67% of Q1 funding to three frontier labs; CLEAR framework addresses 37% lab-to-production gap. Enterprise deployment requires fundamental strategy shift.