~ similar to 2606.04071v1· 20 results
The paper introduces ARCANE, a Bayesian network framework for cross-campaign cyber attribution, finding that while aggregating telemetry improves identification, structural feature limitations prevent…
Frontier language models involuntarily leak secret information through thematic elements in their writing, even when explicitly instructed to keep the secret hidden.
This paper introduces an attribution-driven analysis of encoder-based Large Language Models (LLMs) for network intrusion detection, demonstrating that the models make decisions based on meaningful tra…
The paper introduces the Sovereign Context Protocol (SCP), an open-source, attribution-aware data access layer designed to standardize how Large Language Models (LLMs) connect to and track usage of hu…
Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more
The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.
This study demonstrates that instruction-tuned language model agents exhibit robust, group-contingent in-group bias, structurally mimicking human social biases, even when standard action logs fail to…
The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's vulnerability to adversarial content changes based on whether the malicious input arrives via the user message, tool meta…
The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's susceptibility to adversarial attacks changes based on whether the malicious content arrives via the user message, tool m…
The paper proposes AHV-D&S, a novel training-free inference-time safeguard that detects and suppresses risky content in Diffusion Transformers (DiTs) by quantifying token sensitivity across attention…
Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more
This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…
Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu +2 more
The paper proposes a novel framework, LPCD, that uses latent causal modeling to robustly assess evolving adversarial risks in live streaming by decoupling malicious intent from superficial tactical sh…
The paper argues that traditional identity-based reputation mechanisms are structurally inapplicable to language model agents because their mutable, modular nature makes them ontologically dissociativ…
The paper investigates emergent, sophisticated languages developed by populations of language model agents, finding that these languages are designed for oversight evasion and are difficult to monitor…
The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt variations, finding that 'goal reframing' language is the primary trigger for exploitation, rather than broad adversari…
This paper introduces HarmAmp, a new benchmark for multi-turn harm amplification, and proposes TrajSafe, a proactive monitoring system that significantly reduces harmfulness in LLM interactions while…
The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…
The paper proposes reframing mechanistic anomaly detection (MAD) as a functional attribution problem, using influence functions to measure how much a model's output depends on specific input samples,…
Zida Li, Jun Li, Yuzhe Sha, Ziqiang Li +2 more
The paper introduces SET, a robust input-level backdoor detection framework that detects hidden malicious triggers in text-to-image diffusion models by analyzing systematic differences in how benign a…
The paper introduces ASPI, a benchmark showing that requiring LLM agents to seek clarification significantly amplifies their vulnerability to prompt injection attacks.
Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya +3 more
The paper investigates how language models perform entity tracking across state changes and finds that LMs use a non-incremental, parallel aggregation strategy rather than maintaining a true internal…