~ similar to 2606.02322· 19 results
The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…
Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia +2 more
The paper proposes WARDEN, a distributionally robust adversarial training framework that significantly reduces LLM vulnerability to adversarial attacks by dynamically reweighting hard adversarial exam…
This paper introduces Anchored Weight Decay (AWD), a regularization technique that effectively prevents prior-task forgetting during LLM fine-tuning with Evolution Strategies (ES), positioning ES as a…
Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more
The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…
The paper introduces TRACER, a novel regularization framework that uses Weighted Moving Average (WMA) distillation to robustly finetune multimodal models, mitigating catastrophic forgetting and improv…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more
The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.
The paper introduces RAG-Pref, a novel, training-free Retrieval Augmented Generation (RAG) method for preference alignment that significantly improves LLM refusal guardrails against agentic attacks wi…
CRAM proposes a novel framework for Multimodal Continual Instruction Tuning that balances task isolation and parameter efficiency by using centroid-guided routing and adaptive MoE to prevent catastrop…
Han Liu, Shanghao Shi, Yevgeniy Vorobeychik, Chongjie Zhang +1 more
This paper demonstrates that adversarial perturbations possess a low-rank structure, and proposes a two-step method to leverage this property to significantly improve the efficiency and effectiveness…
Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more
This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…
Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang +2 more
The paper introduces CrossMPI, a novel cross-modal prompt injection attack that uses image-only perturbations to steer the interpretation of both textual and visual inputs in Large Vision-Language Mod…
The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessit…
LocalAlign proposes a generalizable prompt injection defense by generating near-target adversarial examples, which enforces a tighter robustness boundary around the correct model response.
The paper introduces SORA, an adaptive adversarial training method that dynamically adjusts perturbation sizes to prevent Catastrophic Overfitting, achieving state-of-the-art robustness and clean accu…
The paper proposes Sensitivity-Uncertainty Alignment (SUA), a framework that measures the misalignment between a model's prediction instability and its stated uncertainty to improve model reliability.
ProtoAda introduces a prototype-guided, format-aware adaptive tuning framework to improve multimodal continual instruction tuning by ensuring task assignment and parameter updates respect heterogeneou…
This paper theoretically analyzes Continuous Adversarial Training (CAT) for LLMs using In-context Learning (ICL) theory, proving that embedding space perturbations effectively enhance robustness again…
The paper introduces a novel, transferable learned attack (LT-MIA) that detects a universal 'signature of memorization' in language models, achieving high accuracy across diverse model architectures (…