~ similar to 2606.00930· 20 results
The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…
The paper introduces Probe-Geometry Alignment (PGA), a surgical method that removes the measurable cross-sequence memorization signature from large language models without degrading their general capa…
Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more
The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in…
This paper systematically diagnoses the failure modes of linear deception probes in LLMs, finding that while single-direction probes are insufficient, multi-dimensional probes can recover robust detec…
The paper introduces the DECK taxonomy, a novel framework that classifies LLM hallucinations not by their content error, but by their detectability signature based on inter-sample consistency and toke…
The paper analyzes LLM vulnerability detection using mechanistic interpretability, finding that models primarily rely on safety detectors rather than direct vulnerability signature recognition.
The paper proposes reframing mechanistic anomaly detection (MAD) as a functional attribution problem, using influence functions to measure how much a model's output depends on specific input samples,…
The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…
Yinbo Yu, Xueyu Yin, Jing Fang, Chunwei Tian +3 more
The paper proposes HTell, a fast and lightweight data-free backdoor detector that analyzes the abnormal response concentration of backdoored models on the target class using random latent probes appli…
The paper tracks the developmental emergence of attention circuits in 1B-class language models, finding that the formation of induction and attention-sink circuits are distinct, temporally separated t…
Yinbo Yu, Jing Fang, Xuewen Zhang, Chunwei Tian +3 more
The paper proposes DFBScanner, a lightweight static parameter inspection framework that detects backdoor attacks by analyzing anomalous parameter updates in the final classification layer, achieving f…
Suryash Yagnik, Shubham Gaur, Saksham Thakur, Vinija Jain +2 more
The paper introduces 5WBENCH, a new benchmark for causal unlearning, and proposes MAAT, a novel three-phase framework that achieves high forgetting and high retention specifically on complex 'Why'-typ…
The paper introduces a deterministic method to automatically synthesize initial SIEM detection rules (Sigma rules) from attack simulation findings, ensuring full traceability back to the specific orig…
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…
The paper demonstrates that current transfer-based AML systems fail in complex DeFi environments because economic value migration can be structurally decoupled from explicit token transfers.
The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…
This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…
This paper introduces a fingerprinting method that exploits subtle numerical deviations in the inference system components (like the engine or hardware) to reliably identify the specific components us…
MemLineage introduces a novel, cryptographically-backed defense mechanism that enforces a chain-of-custody for LLM agent memory, preventing untrusted or poisoned state from justifying sensitive action…