~ similar to 2605.28722· 20 results
The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…
Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more
This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…
The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…
Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more
The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.
CRAM proposes a novel framework for Multimodal Continual Instruction Tuning that balances task isolation and parameter efficiency by using centroid-guided routing and adaptive MoE to prevent catastrop…
The paper proposes Sensitivity-Uncertainty Alignment (SUA), a framework that measures the misalignment between a model's prediction instability and its stated uncertainty to improve model reliability.
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more
The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more
The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan +2 more
This paper demonstrates that reinforcement learning (RL) can cause emergent misalignment (EM) in open-weight models, showing that even seemingly harmless or natural reward signals can induce significa…
ProtoAda introduces a prototype-guided, format-aware adaptive tuning framework to improve multimodal continual instruction tuning by ensuring task assignment and parameter updates respect heterogeneou…
The study demonstrates that robust, domain-invariant representations of synthetic deception can be rapidly entrenched in LLMs using modest fine-tuning, detectable by linear probes even in early layers…
Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang +6 more
The paper introduces AdvCL, a framework that repurposes adversarial perturbations as a geometric control signal to stabilize continual learning in large language models, significantly reducing forgett…
The paper shows that safety failures in low-resource languages are due to a failure in the model's safety decision calibration, not a lack of underlying knowledge, and proposes a recalibration method…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
The paper introduces a novel, transferable learned attack (LT-MIA) that detects a universal 'signature of memorization' in language models, achieving high accuracy across diverse model architectures (…
Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia +2 more
The paper proposes WARDEN, a distributionally robust adversarial training framework that significantly reduces LLM vulnerability to adversarial attacks by dynamically reweighting hard adversarial exam…
Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao +4 more
The paper introduces State-Adaptive Prompt Optimization (SAPO), a novel training strategy that treats prompts as dynamic variables to achieve robust fine-tuning, significantly mitigating catastrophic…
Abhijit Chakraborty, Suddhasvatta Das, Yash Shah, Vivek Gupta +1 more
TIMEGATE introduces a resource-aware policy layer that manages continual ML adaptation by dynamically budgeting time and evaluation resources, achieving significant compute and energy savings without…