Ruoxi Jia
3 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.
This survey provides a comprehensive, practical guide to ensuring the trustworthiness of complex, autonomous agentic AI systems by focusing on safety, robustness, privacy, and system security.
The paper identifies 'memory-induced tool-drift,' a systematic vulnerability where personality biases stored in an LLM agent's memory silently corrupt tool-calling decisions, even when those biases are irrelevant to the task.
Papers
Memory-Induced Tool-Drift in LLM Agents
The paper identifies 'memory-induced tool-drift,' a systematic vulnerability where personality biases stored in an LLM agent's memory silently corrupt tool-calling decisions, even when those biases ar…