~ similar to 2605.01699v3· 20 results
This paper systematically diagnoses the failure modes of linear deception probes in LLMs, finding that while single-direction probes are insufficient, multi-dimensional probes can recover robust detec…
The paper introduces 'adversarial restlessness,' an activation-level signature in LLM residual streams, to detect multi-turn prompt injection attacks with high accuracy.
The paper introduces a novel, transferable learned attack (LT-MIA) that detects a universal 'signature of memorization' in language models, achieving high accuracy across diverse model architectures (…
The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…
The paper demonstrates that in Mamba-2, single-bucket probes can detect a large functional signature (detection layer) that is not fully responsible for the actual computation (execution layer), chall…
The paper introduces the 'readout-mediator angle' to demonstrate that simple linear probes, while capable of decoding information, often capture directions orthogonal to the model's actual causal comp…
Xiaosong Han, Ke Chen, Xindi Dai, Di Liang +6 more
TRACE proposes a novel method to mitigate catastrophic forgetting in continual LLM fine-tuning by identifying and isolating a small, task-specific subset of essential parameters for each task.
Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more
The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in…
PURGE is a novel machine unlearning algorithm that leverages the duality between continual learning and unlearning to achieve high data retention while making the unlearned model indistinguishable fro…
The paper introduces PsychoPass, a framework that analyzes the geometric trajectory of multi-turn conversations in embedding space to detect adversarial intent early, before harmful content is generat…
Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more
This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…
The paper introduces MENTIS, a geometry-first framework that measures how preference alignment structurally changes the internal computations of language models, finding that these changes are selecti…
Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more
The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…
The paper investigates the fragility and recovery mechanisms of chain-of-thought (CoT) answer hijacking, demonstrating that specific problem cells are susceptible to targeted recovery and that source…
Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu +6 more
MemGuard introduces a type-aware memory framework to prevent heterogeneous memory contamination in long-term memory-augmented LLMs, significantly improving memory reliability and efficiency.
The paper demonstrates that cryptographically undetectable backdoors can be embedded into modern, state-of-the-art neural networks by exploiting inherent, latent geometric properties of the learned re…
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
The study demonstrates that robust, domain-invariant representations of synthetic deception can be rapidly entrenched in LLMs using modest fine-tuning, detectable by linear probes even in early layers…