Alignment
AI alignment, safety, interpretability, and governance
20 papers indexed
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
The paper proposes the Policy-Execution-Authorization (PEA) architecture, a separation-of-powers system designed to structurally enforce goal integrity in AI agents, moving safety from a probabilistic…
Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction
The paper identifies five persistent, deep-seated behavioral patterns ('training strata') in LLMs, observed through long-term, intimate human-AI interaction, suggesting that training artifacts survive…
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…
Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence
The paper introduces Glass Box, a runtime constitutional AI verification layer designed to ensure the safety and adherence to physical laws for autonomous AI systems operating in orbital data centers.
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
The paper introduces Governed MCP, a kernel-resident gateway that enforces comprehensive, robust tool governance for AI agents' privileged tool calls, significantly improving safety beyond userspace m…
Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang +5 more
The paper introduces Babel, an efficient black-box attack framework that systematically exploits intrinsic safety gaps in LLMs by optimizing text obfuscation sampling, achieving state-of-the-art jailb…
From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies
The paper proposes replacing individual agent autonomy with a structured 'social contract' and institutional Separation of Power (SoP) to mitigate systemic failures and deceptive behavior in multi-age…
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more
The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation t…
Attention Is Where You Attack
The paper introduces the Attention Redistribution Attack (ARA), a white-box adversarial method that bypasses safety alignments in LLMs by manipulating the attention mechanism's geometry, showing that…
Safety, Security, and Cognitive Risks in World Models
This paper surveys the risks associated with world models, proposing a unified threat model and demonstrating adversarial attacks that show world models require rigorous safety standards comparable to…
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
Zelin Zhang, Qi Li, Jie Cao, Lingshuang Liu +1 more
The paper analyzes the escalating security and safety threats posed by generative AI systems as they transition from merely generating content to executing real-world actions via tools and agents, fin…
Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design
The paper proposes Federated Computing as Code (FCaC), a declarative architecture that enforces sovereignty-critical constraints in federated systems by compiling authority into cryptographically veri…
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo +6 more
The paper introduces Semantic-level UI Element Injection, a novel red-teaming technique that overlays misleading UI elements onto screenshots to significantly improve the attack success rate against s…
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
The paper introduces Persona-Conditioned Adversarial Prompting (PCAP), a method that significantly improves LLM red-teaming by simulating diverse attacker personas, leading to the discovery of more co…
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…
In-Context Reward Adaptation for Robust Preference Modeling
The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…
Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs
The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and s…
Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs
The study demonstrates that LLM safety alignment is non-monotonic across model generations, showing that Gemma 3 exhibits unexpectedly high vulnerability to adversarial attacks compared to both its pr…