This paper introduces a taxonomy of GUI agent failures and finds that full-image memory has divergent effects on failure distribution. It proposes Action-Grounded Visual Memory (AGMem) as an effective alternative.
Proposes a new memory framework, AGMem, for visual memory in GUI agents
Before reading this…
Applications
To understand this paper, make sure you know these concepts first:
Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.
AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-ba…
The paper introduces AgentRAE, a novel backdoor attack that successfully forces…
Memory poisoning and secure multi-agent systems
This paper analyzes memory poisoning attacks targeting multi-agent systems (MAS)…
CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Correctiv…
The paper introduces ReCAP, a native GUI agent that significantly improves CAPTC…
Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents
The paper introduces eTAMP, a novel attack that poisons LLM web agents' memory u…
Infrastructure for Valuable, Tradable, and Verifiable Agent Memory
The paper proposes an infrastructure, clawgang and meowtrade, to transform priva…
Walma: Learning to See Memory Corruption in WebAssembly
Walma is a machine learning framework that uses memory snapshot classification t…
Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis
MemHint is a neuro-symbolic static analysis pipeline that significantly improves…
Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling
This paper introduces KGERMAR, a framework that constructs dynamic, context-spec…